Statistical Modeling of ATP matches

-

http://thesportsquotient.com/tennis/2016/9/14/statistical-modeling-of-atp-singles-matches-part-i

When it comes to statistically representing sports, tennis is remarkably well-suited. Just consider, for a moment, the complexities in other sports that render them statistically unwieldy: football has over twenty people on the field at a given time and is played with a prolate spheroid (which means bounces on loose balls are close to random); basketball has ten players on the court, all of whom affect the game simultaneously, even when playing off the ball; baseball can be modeled pretty well up to the point of contact, but becomes complicated once the ball is in play. Tennis, on the other hand, remains a mathematically and statistically straightforward game, whether played by Federer or a toddler.

In this first post of an ongoing series revolving around statistical modeling of ATPsingles matches, I will lay out the mathematical understanding of tennis that makes it easy to model and introduce a basic version of a tennis model. For anyone interested in more granular discussion, a good place to start is O’Malley, “Probability Formulas and Statistical Analysis in Tennis” (2008).

First, a brief and simplified overview of the current statistical scholarship on modeling tennis matches: Current models describe a tennis match with a hierarchical Markov model, since the game has a hierarchical scoring system. The picture below illustrates this point.




We can create a statistical model rather easily with the assumption that points within a match are identically and independently distributed (IID). IID alludes to the notion that one point does not affect another and that the probability distribution for each point is the same. Intelligent people can certainly trouble this assumption, but we’ll leave it for now, since it is critical to the statistical model to follow. It is possible to derive a Markov chain for any match, using the probabilities of a player winning a point on serve and return. Much work has been published on inferring those probabilities from past data. The models presented in the literature have been successful, yielding between 68% and 70% of correct binary predictions on outcomes.

Using the IID assumption, we can build out our model so long as we know the probability that a given player wins a point on serve and on return. How do we get this probability (which we’ll call p for serve and q for return)? Historical data. Our estimates of p and q are the proportions of points won on serve or return in the past (which could be any period of time as the modeler sees fit).

Once we have p and q, we’ll need formulas to flesh out the probabilities that a player wins a series of points (i.e. a game), a series of games (i.e. a set), and a series of sets (i.e. a match). Notice that all we have here is p extending through a chain to yield another probability for a higher level (from point to game to set to match). This means we’ll have to consider the various combinations for how a player could win at each level. For example, a player can win a game by winning four points without losing any, winning four points and losing one, winning found points and losing two, or winning from deuce. The equations that capture this are below (O’Malley, 2008).



After carrying this logic through the various twist and turns of a tennis match, you are left with two rather simple formulas. One captures the probability that a player wins a 3-set match and the other a 5-set match, given p and q. S(p,q) denotes the probability of a player winning a set.





Now that the mathematics has been established, we have to figure out how to operationalize these formulas. I’ll use historical data from tennisinsight.com and the statistical software Stata to construct the model from the formulas above. I’ve included my Stata code below for the O’Malley formulas.

gen OmalleySPQ = ( serviceptsw * returnptsw )/(1-( serviceptsw *(1- returnptsw )+ returnptsw *(1- serviceptsw )))

gen Omalley3set = OmalleySPQ^2 * (1+2*(1- OmalleySPQ))

gen Omalley5set = OmalleySPQ^3 * (1+3*(1- OmalleySPQ)+6*(1- OmalleySPQ)^2)

Having run this model using tennisinsight data current as of 9/13/16 for the last 12 months, I got the following ranking of top players. Notice the discrepancies between the O’Malley probabilities and the ATP rankings.



Player Probability of Winning 3-Set Match Probability of Winning 5 Set Match ATPRank
Novak Djokovic .7066 .7515 1
Roger Federer .6492 .6840 7
Andy Murray .6260 .6561 2
Milos Raonic .6085 .6346 6
Rafael Nadal .6081 .6342 4
Gael Monfils .5957 .6191 8
Kei Nishikori .5891 .6109 5
Stan Wawrinka .5880 .6094 3
Marin Cilic .5850 .6053 11

O’Malley models do not incorporate the skill of the specific opponent a player faces in a match. These models provide a probability of victory based only on the individual player’s ability to win on serve and return against previous opponents included in the historical data sample. Undoubtedly, this is a major shortcoming of the model, which may help explain the discrepancies with the ATP rankings. In the next installation of this series, we’ll explore a model that implements a head-to-head comparison and runs through some example match-ups.
 

Red Rick

Bionic Poster


Seriously though, I really don't like a game I play and love being reduced to just a bunch of numbers.

Also, nice quiz on that site

"Who won the US Open on the women's side in 2016"

Are you ****ing serious?
 

Meles

Bionic Poster
-

http://thesportsquotient.com/tennis/2016/9/14/statistical-modeling-of-atp-singles-matches-part-i

When it comes to statistically representing sports, tennis is remarkably well-suited. Just consider, for a moment, the complexities in other sports that render them statistically unwieldy: football has over twenty people on the field at a given time and is played with a prolate spheroid (which means bounces on loose balls are close to random); basketball has ten players on the court, all of whom affect the game simultaneously, even when playing off the ball; baseball can be modeled pretty well up to the point of contact, but becomes complicated once the ball is in play. Tennis, on the other hand, remains a mathematically and statistically straightforward game, whether played by Federer or a toddler.

In this first post of an ongoing series revolving around statistical modeling of ATPsingles matches, I will lay out the mathematical understanding of tennis that makes it easy to model and introduce a basic version of a tennis model. For anyone interested in more granular discussion, a good place to start is O’Malley, “Probability Formulas and Statistical Analysis in Tennis” (2008).

First, a brief and simplified overview of the current statistical scholarship on modeling tennis matches: Current models describe a tennis match with a hierarchical Markov model, since the game has a hierarchical scoring system. The picture below illustrates this point.




We can create a statistical model rather easily with the assumption that points within a match are identically and independently distributed (IID). IID alludes to the notion that one point does not affect another and that the probability distribution for each point is the same. Intelligent people can certainly trouble this assumption, but we’ll leave it for now, since it is critical to the statistical model to follow. It is possible to derive a Markov chain for any match, using the probabilities of a player winning a point on serve and return. Much work has been published on inferring those probabilities from past data. The models presented in the literature have been successful, yielding between 68% and 70% of correct binary predictions on outcomes.

Using the IID assumption, we can build out our model so long as we know the probability that a given player wins a point on serve and on return. How do we get this probability (which we’ll call p for serve and q for return)? Historical data. Our estimates of p and q are the proportions of points won on serve or return in the past (which could be any period of time as the modeler sees fit).

Once we have p and q, we’ll need formulas to flesh out the probabilities that a player wins a series of points (i.e. a game), a series of games (i.e. a set), and a series of sets (i.e. a match). Notice that all we have here is p extending through a chain to yield another probability for a higher level (from point to game to set to match). This means we’ll have to consider the various combinations for how a player could win at each level. For example, a player can win a game by winning four points without losing any, winning four points and losing one, winning found points and losing two, or winning from deuce. The equations that capture this are below (O’Malley, 2008).



After carrying this logic through the various twist and turns of a tennis match, you are left with two rather simple formulas. One captures the probability that a player wins a 3-set match and the other a 5-set match, given p and q. S(p,q) denotes the probability of a player winning a set.





Now that the mathematics has been established, we have to figure out how to operationalize these formulas. I’ll use historical data from tennisinsight.com and the statistical software Stata to construct the model from the formulas above. I’ve included my Stata code below for the O’Malley formulas.

gen OmalleySPQ = ( serviceptsw * returnptsw )/(1-( serviceptsw *(1- returnptsw )+ returnptsw *(1- serviceptsw )))

gen Omalley3set = OmalleySPQ^2 * (1+2*(1- OmalleySPQ))

gen Omalley5set = OmalleySPQ^3 * (1+3*(1- OmalleySPQ)+6*(1- OmalleySPQ)^2)

Having run this model using tennisinsight data current as of 9/13/16 for the last 12 months, I got the following ranking of top players. Notice the discrepancies between the O’Malley probabilities and the ATP rankings.



Player Probability of Winning 3-Set Match Probability of Winning 5 Set Match ATPRank
Novak Djokovic .7066 .7515 1
Roger Federer .6492 .6840 7
Andy Murray .6260 .6561 2
Milos Raonic .6085 .6346 6
Rafael Nadal .6081 .6342 4
Gael Monfils .5957 .6191 8
Kei Nishikori .5891 .6109 5
Stan Wawrinka .5880 .6094 3
Marin Cilic .5850 .6053 11

O’Malley models do not incorporate the skill of the specific opponent a player faces in a match. These models provide a probability of victory based only on the individual player’s ability to win on serve and return against previous opponents included in the historical data sample. Undoubtedly, this is a major shortcoming of the model, which may help explain the discrepancies with the ATP rankings. In the next installation of this series, we’ll explore a model that implements a head-to-head comparison and runs through some example match-ups.
Would love to see surface ratings. ELO drives me crazy with this. At least filter it down to hard courts.

So, Monfils has won 77% of his matches this year. Djokovic and Murray are much higher. Even Stan checks in at 76% in 2016. Shouldn't their probabilities be much higher? That seems to imply something needs fixing.:rolleyes:
 

Shaolin

G.O.A.T.


Seriously though, I really don't like a game I play and love being reduced to just a bunch of numbers.

Also, nice quiz on that site

"Who won the US Open on the women's side in 2016"

Are you ****ing serious?

Yeah, sad seeing a beautiful game reduced to a bunch of numbers. It's like comparing someones cremated remains to their living self.
 

esgee48

G.O.A.T.
Markovian modeling require some form of stability in the underlying data. The probabilities are changing over time, match to match, etc. as players have their ups and downs, get hot, get injured, etc. Your data especially has to be relatively stable over time, which it is not given the fall off as a player ages or gets injured. You used 2016 (last 12 months) year of data which is not very useful since they don't agree with the rankings. Means some other variable is missing from your analysis. It is nice work and kudos for doing it. 2 cents.
 

Meles

Bionic Poster
Markovian modeling require some form of stability in the underlying data. The probabilities are changing over time, match to match, etc. as players have their ups and downs, get hot, get injured, etc. Your data especially has to be relatively stable over time, which it is not given the fall off as a player ages or gets injured. You used 2016 (last 12 months) year of data which is not very useful since they don't agree with the rankings. Means some other variable is missing from your analysis. It is nice work and kudos for doing it. 2 cents.
Looks like plagarism at its best.:confused:
 

mightyrick

Legend
We can create a statistical model rather easily with the assumption that points within a match are identically and independently distributed (IID). IID alludes to the notion that one point does not affect another and that the probability distribution for each point is the same.
The entirety of that massively huge post is failed by these two sentences.
 

esgee48

G.O.A.T.
Looks like plagarism at its best.:confused:
Thank You! I had to go back almost 2 decades to days of yore to think about Markov models. Back then, we knew we had to fit the predictions to actual experience to weigh the Markov predictions from each period so that it could account for variables not factored into the data crunching. The curve fitting was needed because really old data (current and old n/a products) should have no bearing on what is currently happening in a portfolio of newer and current products. Minimize error term on a least square basis!
 

Meles

Bionic Poster
Thank You! I had to go back almost 2 decades to days of yore to think about Markov models. Back then, we knew we had to fit the predictions to actual experience to weigh the Markov predictions from each period so that it could account for variables not factored into the data crunching. The curve fitting was needed because really old data (current and old n/a products) should have no bearing on what is currently happening in a portfolio of newer and current products. Minimize error term on a least square basis!
I was referring to the OP, but yeah the curve from the article does not fit reality. So is Dedan's some kind of super stats troll or do you claim that title?:rolleyes:
 
@Red Rick, @ mightyrick, @Meles, et al

Thanks for the feedback gents.

The OP was actually a blog I came across from a local young man (Westchester County, NY) whom I happen to know very well and upon reading it I thought to myself: . "what the hell...let's throw this good geek's analysis out there for review and comment..." As for the guy himself, he's an absolute gem as a person and dare I say a soon-to-be-graduate from an upper tier Ivy ('P') with a pair of serious dough-re-mi offers (NYC/SF) waiting for him. . That he's a Mets fan is about the only thing I can hold against him... : )

That said, I was curious to read the replies his "numbers game'" piece would engender. Would they be analytical in nature? Peevishly pedantic in tone (i.e. replete with the adorable gifs and rolling eye emojs-lol)? Well, after reading the (varying) replies.... res ipsa loquitur . ("the thing speaks for itself")
:cool::cool:
 

Red Rick

Bionic Poster
@Red Rick, @ mightyrick, @Meles, et al

Thanks for the feedback gents.

The OP was actually a blog I came across from a local young man (Westchester County, NY) whom I happen to know very well and upon reading it I thought to myself: . "what the hell...let's throw this good geek's analysis out there for review and comment..." As for the guy himself, he's an absolute gem as a person and dare I say a soon-to-be-graduate from an upper tier Ivy ('P') with a pair of serious dough-re-mi offers (NYC/SF) waiting for him. . That he's a Mets fan is about the only thing I can hold against him... : )

That said, I was curious to read the replies his "numbers game'" piece would engender. Would they be analytical in nature? Peevishly pedantic in tone (i.e. replete with the adorable gifs and rolling eye emojs-lol)? Well, after reading the (varying) replies.... res ipsa loquitur . ("the thing speaks for itself")
:cool::cool:

I have no doubt he can crunch some numbers, but this


O’Malley models do not incorporate the skill of the specific opponent a player faces in a match. These models provide a probability of victory based only on the individual player’s ability to win on serve and return against previous opponents included in the historical data sample. Undoubtedly, this is a major shortcoming of the model, which may help explain the discrepancies with the ATP rankings. In the next installation of this series, we’ll explore a model that implements a head-to-head comparison and runs through some example match-ups.
Makes me question the entire purpose of it. Tennis has at least 2 players, and here you basically take away the opponent. I understand you can use a Markov model and write it down mathematically, but you can't turn tennis into something that's not tennis (a one man game).
 

esgee48

G.O.A.T.
I was referring to the OP, but yeah the curve from the article does not fit reality. So is Dedan's some kind of super stats troll or do you claim that title?:rolleyes:
Nope. Have degrees in Chemical Engineering (emphasis on catalytics) and Finance (Quant emphasis now called Financial Engineering) and most of the stuff was in Investment analysis/Investment banking type work, model and scorecard developments, which were proprietary in nature. The math was easy (data was a mess), but trying to explain why a model would not work in given situations to a non quant type/boss was a nightmare. THINK Dilbert! :p
 

Meles

Bionic Poster
Nope. Have degrees in Chemical Engineering (emphasis on catalytics) and Finance (Quant emphasis now called Financial Engineering) and most of the stuff was in Investment analysis/Investment banking type work, model and scorecard developments, which were proprietary in nature. The math was easy (data was a mess), but trying to explain why a model would not work in given situations to a non quant type/boss was a nightmare. THINK Dilbert! :p
Very cool. Well this model doesn't look like its working too well. Envious of your degrees, but I'm not the enginnering type. Chemistry and Mathematical Economics. Statistics can be very powerful stuff, but it seems under utilized despite the technology around today. Seems like Dilbert bosses hold sway. I'm sure Financial Engineering is really something. Any way to get an edge in the world of finance.
 

donquijote

G.O.A.T.
So who is going to win Shanghai?

I am sure the bookmakers (or the main one) have some huge computer code to guess the winner and the score but it seems to be the best kept secret in the world.
 
Top