Digging deep into tennis stats to forecast this year's Australian Open (and beyond)


For quite a few sports fans -- typically of the nerdier bent -- numbers can give us mooring. They give us an extra layer of perspective or context that allows us to fully understand what we're watching or writing about. Team A and Team B are both good, but how good are they? That was an amazing upset we just watched, but how unexpected was it? Numbers have colored the user experience of most major sports for years. Historically, however, tennis has lagged behind.

Mind you, thanks to data providers like IBM, we have more information in this sport than we've ever had before. In a given match, we see spray charts for players' first serves, their depth behind the baseline on returns, etc. That is interesting and worthwhile. But we skipped the part where we use numbers and probabilities to rank players by surface and derive quality win probabilities on any mainstream level.

That's not to say such numbers don't exist, however. My friend Colin Davy, a data scientist, two-time MIT Sloan Sports Analytics Conference Hackathon winner, and the creator of the bet win probability calculator for the Action Network, began playing with a system of tennis rankings years ago. He has used it to explore slam projections, aging curves, and other topics, and he recently took it out of mothballs to help me answer some questions about the current state of the men's and women's tours.

The primary goal is simple: grade players based on their actual performance instead of just giving them accomplishment points (which are, in turn, derived in part from how lucky their draw is), as the WTA and ATP points systems do. A dominant victory (6-2, 6-1) is going to be more predictive than an unlikely one (7-6, 0-6, 7-6), and a performance against a top-level player is going to be graded on a different curve than one against No. 293. Most importantly for tennis' purposes, Davy's system gives a player different ratings for different playing surfaces.

Compared to the ATP rankings, these ratings are likely to reward long-term steadiness over the type of single-season -- or even single-slam -- hot streak that Daniil Medvedev enjoyed last summer. They're going to remain skeptical of rising players for a bit longer.

(They're also not going to automatically punish injured players like Juan Martin del Potro or Kei Nishikori for their respective absences.)

While Serena Williams has still been excellent since her return from having a child -- she has, after all, made four of the past six slam finals -- her percentile rating here, especially compared to the Big Three on the men's side (Rafael Nadal, Novak Djokovic, Roger Federer), pretty clearly hints at the vulnerability we've seen from her in those slam finals, all straight-set losses. Still, she and Ashleigh Barty stand out a bit from a crowded field.

Perhaps the most interesting immediate takeaway from these ratings: The men's Big Three are the only players on either tour with an overall percentile rating in the 90s. It's really, really hard to establish a level a couple of standard deviations off from the rest of the field, and it makes their overall accomplishments even more remarkable than they already seemed.

That said, these greats' ratings have fallen from respective career peaks: Djokovic's hard-court rating was in the 99th percentile in mid-2016, Federer's was in the 98th just two years ago, and Williams' was in or near the 98th percentile through 2014. (At that end of the bell curve, a small numerical difference can represent a good amount of space.) Only Nadal's current rating is particularly close to his career high.

Projecting the 2020 Australian Open

OK, great, we have some numbers now. What should we do with them? The first thing we can do is forecast the Australian Open, which began this week in Melbourne.

Odds of winning the men's tournament (before play began):

• Djokovic (42%), Nadal (21%), Federer (20%), Medvedev (6%). No one else is above 1%.

This pretty quickly tells us about Djokovic's and Nadal's respective draws. Though they share almost the same hard-court rating, Djokovic has a 77% chance of making the semifinals, and Nadal is at only 58%. Federer's draw is also favorable (69% chance of reaching the semis). Medvedev's quarter, which features Andrey Rublev, Alexander Zverev, Stanislas Wawrinka and David Goffin, among others, is easily the muddiest of the bunch.

Odds of winning the women's tournament (before play began):

• Serena Williams (30%), Karolina Pliskova (10%), Barty (9%), Naomi Osaka (8%), Simona Halep (7%), Petra Kvitova (4%), Aryna Sabalenka (3%), Elina Svitolina (3%), Madison Keys (3%), Sofia Kenin (3%).

Here's your semi-regular reminder that a best-of-five format is inherently less conducive to upsets than best-of-three. So the odds are going to feature a broader distribution of possibilities here even before you factor in the lack of a women's Big Three.

It's pretty clear this draw was intensely favorable for Williams and Pliskova, and at the expense of Barty (who will have to claw past Kvitova, Keys, and Maria Sakkari, among others, just to get out of her quarter of the draw) and Osaka.

Projecting the next two years' worth of slams

Why stop at forecasting one tournament, though? Since we're constantly wondering about when the sport's all-timers are going to finally be overtaken by a younger generation, let's see what Davy's numbers have to say about the next two years' worth of slams.

Projecting slams, independent of the draws, requires some generalizations -- seeding approximations, generic draw difficulty. We saw above just how draw-dependent one's title odds can be, so keep in mind that the odds for each tournament will shift dramatically when the draw comes out.

Still, Davy simulated each upcoming slam 2,000 times each based on where ratings will approximately stand now and, using aging curves, in the future. This exercise produces some interesting results.

Projected slam favorites over the next two years (ATP)
• 2020 French Open: Nadal 23%, Djokovic 17%
• 2020 Wimbledon: Djokovic 17%, Federer 17%
• 2020 US Open: Federer 19%, Djokovic 16%
• 2021 Australian Open: Djokovic 33%, Nadal 29%
• 2021 French Open: Nadal 22%, Djokovic 18%
• 2021 Wimbledon: Djokovic 20%, Federer 17%
• 2021 US Open: Djokovic 18%, Federer 18%

The specific percentages could change with a different sample size, but this lays out the favorites, and ... they're pretty familiar.

Average projected slam titles over the next two years (ATP)

By simply adding the percentages together (a 23% chance = 0.23 titles), we can come up with a loose average of how many slams each player is projected to win.

• Djokovic 1.8 (best odds: 2020 Aussie 42%)
• Nadal 1.1 (best odds: 2021 Aussie 29%)
• Federer 1.1 (best odds: 2020 Aussie 20%)
• Medvedev 0.4 (best odds: 2020 US Open 9%)
Stefanos Tsitsipas 0.4 (best odds: 2020 US Open 8%)
• Zverev 0.3 (best odds: 2020 French Open 7%)
Alex De Minaur 0.2 (best odds: 2021 US Open 6%)

There are two ways to look at this.

1. The new overlords are the old overlords. Even with aging curves baked into the numbers, Djokovic, Nadal, and Federer are in most cases the top three favorites (or three of the top four) for any given slam.

2. On average, the Big Three are still projected to win only four of the next eight slams. Part of this is because these projections are, by nature, conservative. But these totals also serve as a reminder that the Big Three's collective form is indeed down from career peaks. And as Djokovic, Nadal and Federer continue to age, they might be more dependent on strong draws and easy first-week runs than they used to be. The door could be open for rising stars to begin stealing some upsets.

Projected slam favorites over the next two years (WTA)

• 2020 French Open: Halep 13%, Barty 10%
• 2020 Wimbledon: Williams 13%, Barty 12%
• 2020 US Open: Williams 19%, Osaka 12%
• 2021 Australian Open: Williams 16%, Osaka 13%
• 2021 French Open: Halep 11%, Barty 9%
• 2021 Wimbledon: Williams 14%, Barty 11%
• 2021 US Open: Williams 20%, Osaka 11%

Average projected slam titles over the next two years (WTA)

• Serena Williams 1.3 (best odds: 2020 Aussie 30%)
• Barty 0.8 (best odds: 2021 Aussie 13%)
• Osaka 0.7 (best odds: 2021 Aussie 13%)
• Halep 0.7 (best odds: 2020 French Open 13%)
• Pliskova 0.5 (best odds: 2020 Aussie 10%)
• Sabalenka 0.4 (best odds: 2021 US Open 9%)
• Svitolina 0.3 (best odds: 2021 French Open 5%)
Kiki Bertens 0.3 (best odds: 2020 French Open 9%)
Bianca Andreescu 0.3 (best odds: 2021 Aussie 6%)
• Kvitova 0.3 (best odds: 2020 Wimbledon 5%)

The typical WTA slam is a Thunderdome of players with similar talent levels, attempting to win seven consecutive best-of-three matches. That Serena has won 23 of these is mind-blowing, and these numbers back up just how difficult projecting a tournament winner can be. The numbers definitely think Williams has at least one more title in her, however. Of course, they were probably even more confident about that a year ago.

Here's one last stats-based observation:

Coco Gauff: unicorn?

These numbers are based on long-term trends, aging curves, etc., and there really aren't many examples for what Gauff, the 15-year old who played limited events but still made the fourth round at Wimbledon and the third round at the US Open, accomplished last year.

Gauff has quickly climbed into the top 70 of the WTA rankings and is 79th in Davy's rankings. But the odds are good that she'll spend the next couple of years learning the types of lessons that all young, high-upside players have to learn. If she were to continue one-upping herself in slam performances, perhaps reaching a quarterfinal or semifinal, then going even further, it would be pretty definitive proof of her once-in-a-generation potential. The stats don't see that happening, but they're not really designed to see the unicorns.