2023 Match Result Prediction Competition

schmke

Legend
ok I think we need to be clear. Schmke and ntrp is not predicting 72% of the usta matches. They are correctly predicting 72% of the fraction of matches that they even attempt to predict.

I think some amount of build up of information is required. And it seems wtn has no build up at all - it is making predictions for people who have never played a rated match in their system. They just assign them a number based on who knows what. But usta ignores a ton of matches and only even tries to make predictions on a select number of matches. If you selected like they do you should be able to hit 80 maybe 85 is a stretch.
Fair point about what matches each algorithm will predict.

But saying WTN has no information is patently false, they themselves claimed they went back 5 years (now closer to 6 since they've been around a bit). So of course there is build up and they have a rating published from that number. Additionally, WTN does look across more matches than NTRP does (and thus my ratings) as they rate juniors and collegiate players based on various tournament matches played outside of league play. Might they publish some number prematurely? Or might it be valuable to incorporate how big the game zone is for a player in the predictions? Sure, but to say they have no build up and are just picking a number of out the air misrepresents their rating IMHO. They arguably should have as much data as UTR to form an opinion on a player's rating, and certainly have more match data than I do.

As far as what I'm predicting, yes, some players are brand new and don't have a rating and I don't predict those matches as there is nothing to base a prediction on. But I am predicting matches where players have played enough matches (which isn't that many) to establish a rating and saying only a "fraction" have been predicted is also probably a mischaracterization as "fraction" probably carries with it a connotation of a 1/4 or a 1/3, which would be a small fraction. Technical 3/4 or 75% is also a fraction and that is closer to the number of matches I am predicting, which is certainly not picking and choosing which to predict.

If you look at the number of matches predicted, WTN had the most, which I believe means they had ratings for some of the self-rated players from their junior/high school play, so where I had them as self-rated without a rating and didn't predict their match, WTN had a rating and did. It may be that WTNs from junior/high-school play don't translate well to league play with adults and that contributes to their poor showing. But if a rating professes to be the single rating for the "World", I think it is fair to judge its performance on how it predicts matches when players transition from junior to adult play. If they predict these matches poorly, it may be an indication they have "islands" of players where the relative ratings are not accurate and thus aren't really achieving the goal of an accurate single rating.

Now, doing that is very hard, but UTR also has ratings for juniors coming in to league play and they have done far better. And my algorithm has also done better, albeit not predicting some of the matches for the players when brand new.
 

TennisOTM

Professional
WTN does no better at predicting matches involving only players with recent USTA adult league results. Basically those are the matches where all four systems gave a prediction. The standings order is same over those ~140 matches: UTR 73%, TLS 70%, TR 68%, WTN 59%.

This tells me that WTN's problems go far beyond the potential issue of using outdated ratings based on old matches or initial guesses. Even when there are plentiful recent results by which older ratings should be washed out, their numbers still don't work well.

Also, I don't think it's true that WTN assigns you a rating before you have any match results in their system. There were plenty of new league players who did not have a WTN rating (or had only singles or only doubles rating) prior to the 2023 season. They've produced no prediction for about 20% of the matches played.
 
I am definitely surprised that WTN is doing so badly.

Making a half-decent rating isn't hard. Elo is out there, glicko is out there, Microsoft has published their paper on their TrueSkill2, if you do something halfway reasonable based on one of those you should get a decent rating system. This isn't rocket science. Even if you do something totally new, you can at least compare your accuracy to those and see if you're close...
 

Chalkdust

Professional
I am definitely surprised that WTN is doing so badly.

Making a half-decent rating isn't hard. Elo is out there, glicko is out there, Microsoft has published their paper on their TrueSkill2, if you do something halfway reasonable based on one of those you should get a decent rating system. This isn't rocket science. Even if you do something totally new, you can at least compare your accuracy to those and see if you're close...
Yes but...

Let's go with chess Elo, since people generally think this is a good ratings system.
And let's say we are trying to predict the outcomes of matches at a local chess tournament.

Suppose the tournament is 'open' and has attracted a bunch of players from across the skill spectrum, including beginners and all the way to grand masters. Throw out draws, and look at only wins vs losses. What kind of match prediction rate do we think we'd get based on comparing opponent's elo ratings?

Now imagine that tournament entry is restricted to just players with an elo of 1500 - 1700. What kind of match prediction rate do we think we'd get now?

Hint: Much lower, because when players are close to one another in rating, anything can and does happen.

This is why trying to determine the accuracy of a rating system purely by looking at its predictive ability is somewhat nonsense, unless you apply a bunch more statistical analysis to it.
 
Yes but...

Let's go with chess Elo, since people generally think this is a good ratings system.
And let's say we are trying to predict the outcomes of matches at a local chess tournament.

Suppose the tournament is 'open' and has attracted a bunch of players from across the skill spectrum, including beginners and all the way to grand masters. Throw out draws, and look at only wins vs losses. What kind of match prediction rate do we think we'd get based on comparing opponent's elo ratings?

Now imagine that tournament entry is restricted to just players with an elo of 1500 - 1700. What kind of match prediction rate do we think we'd get now?

Hint: Much lower, because when players are close to one another in rating, anything can and does happen.

This is why trying to determine the accuracy of a rating system purely by looking at its predictive ability is somewhat nonsense, unless you apply a bunch more statistical analysis to it.
Well, yeah, if you're doing the math of course you should compare by something like log loss rather than classification accuracy, but again, not rocket science.

...or have a consistent prediction cohort - if you have a particular cohort of matches, you CAN compare the classification accuracy on that particular set of matches across rating systems.
 

TennisOTM

Professional
Yes but...

Let's go with chess Elo, since people generally think this is a good ratings system.
And let's say we are trying to predict the outcomes of matches at a local chess tournament.

Suppose the tournament is 'open' and has attracted a bunch of players from across the skill spectrum, including beginners and all the way to grand masters. Throw out draws, and look at only wins vs losses. What kind of match prediction rate do we think we'd get based on comparing opponent's elo ratings?

Now imagine that tournament entry is restricted to just players with an elo of 1500 - 1700. What kind of match prediction rate do we think we'd get now?

Hint: Much lower, because when players are close to one another in rating, anything can and does happen.

This is why trying to determine the accuracy of a rating system purely by looking at its predictive ability is somewhat nonsense, unless you apply a bunch more statistical analysis to it.
I agree that a single correct prediction rate without context doesn't tell you much. But if you compare multiple rating systems by how well they do predicting the same matches, and one does much better than the other, does that not tell you something?
 

Chalkdust

Professional
...or have a consistent prediction cohort - if you have a particular cohort of matches, you CAN compare the classification accuracy on that particular set of matches across rating systems.

I agree that a single correct prediction rate without context doesn't tell you much. But if you compare multiple rating systems by how well they do predicting the same matches, and one does much better than the other, does that not tell you something?

Well, imagine you ended up with a 100% prediction success rate. Perfect rating system? Well not necessarily, because of course we always expect some paper upsets due to play on the day. So to what extent is this high hit rate due to the quality of the system, and to what extent due to luck?

Over a large enough number of large enough sample sets, I would agree with you both.

But on a single smallish sample set such as being discussed in this thread, I don't think you can infer with any degree of confidence that system A with a 75% prediction rate is actually any better than system B with a 70% rate, because of the normal result variances expected within each individual match.
 

schmke

Legend
Well, imagine you ended up with a 100% prediction success rate. Perfect rating system? Well not necessarily, because of course we always expect some paper upsets due to play on the day. So to what extent is this high hit rate due to the quality of the system, and to what extent due to luck?

Over a large enough number of large enough sample sets, I would agree with you both.

But on a single smallish sample set such as being discussed in this thread, I don't think you can infer with any degree of confidence that system A with a 75% prediction rate is actually any better than system B with a 70% rate, because of the normal result variances expected within each individual match.
Fair point about sample sizes. But what would you consider to not be "smallish" or be a sufficient sample size then? I believe what was looked at here is across around 250 matches where the systems all rated the matches.
 

Chalkdust

Professional
Fair point about sample sizes. But what would you consider to not be "smallish" or be a sufficient sample size then? I believe what was looked at here is across around 250 matches where the systems all rated the matches.
Good question. Meaningful sample size is also going to be dependent on what we expect the match ratings delta distribution to look like. Meaning, if our sample is skewed towards matches where the ratings difference is large, then we can get away with a smaller sample set, since the variance in "how the players are playing that day" is less likely to cause an upset. Then again the nuances between competing rating systems would also be more obscured in this case (you'd expect them to identify the right favorite when there are 'big' differences in actual ability).

If we have a lot of matches where the players are close in ratings, then we should expect more upsets, and so need a larger sample set to compensate. If we have many sample sets, then we can expect to have sets representative of various ratings delta distributions, which would also increase the confidence. Unfortunately I am too far removed from my stats days to take a crack at it - I can just say this this is a much harder problem than it might initially seem.
 
Yeah, that's why I'm singling out WTN. I think with the sample size we have, I wouldn't make a distinction between schemke's 73% and UTR's 72% and TLS at 70%, but I'm pretty sure that there's enough data to know they're both doing clearly better than WTN at 60%.

I'm not going to pull out my calculator and do the calculations for what the margin of error on that percentage is given the sample size, but suspect that over 10% difference over 150+ matches is going to be significant.
 

TennisOTM

Professional
Using the "Wald interval" from this page: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval, I get the following for 95% confidence intervals for the success rates from the last update:

UTR: 72% (66% to 77%)
TLS: 70% (63% to 78%)
TR: 64% (58% to 70%)
WTN: 60% (54% to 65%)

So we'd seem to have pretty high confidence at this sample size that UTR > WTN is not just a matter of luck, as those two confidence intervals don't overlap. For the other pairwise comparisons, we probably do need more data to be confident in the comparison.
 

Moon Shooter

Hall of Fame
My first rated tennis match - I never played tennis until I was 48 years old - wtn started me at something like a 29.9. I was a self rate 3.0. A 3.5 self rate male had a like a 31 for his very first match. These ratings had weight and effected our partners and opponents. I am not sure how they determined to give me the 29.9. I’m going by memory so don’t get too wrapped up on the exact numbers. That is why I’m saying they seem to pull numbers out of the sky. Why give a self rate 3.5 a 31 and a self rat 3.0 a 29? 3.5 players in my area typically are in the 23-27 range.
 

Moon Shooter

Hall of Fame
So my point above is that although it is possible that there is a problem with the algorithm it is also possible that wtn ratings are being significantly thrown by this bizzare assignment of an initial rating number.

I also think the lack of any coed matches that could equalize the men’s and women’s pools of numbers is a problem. Once that gets worked out by allowing those matches the ratings will be much improved.

Schmke said wtn could use all the matches in utr. Yes any rating system could if they chose, but I think they reached an agreement with usta and usta limits what goes in to wtn for American players. In any case I have not seen any non-usta matches from any other adult leagues in wtn but i believe arrangements can be made - I am not sure of the requirements usta puts on it.
 

Moon Shooter

Hall of Fame
Fair point about what matches each algorithm will predict.

….

As far as what I'm predicting, yes, some players are brand new and don't have a rating and I don't predict those matches as there is nothing to base a prediction on.
…..
If you look at the number of matches predicted, WTN had the most, which I believe means they had ratings for some of the self-rated players from their junior/high school play, …

Nope it seems they just pull a number out of the sky. As a self rate 3.0 they gave me a 29 they gave a self rate 3.5 male something like a 31. In our area 3.5s are in the 22-27 range. That is one of the reasons I think it is doing significantly worse.
 

TennisOTM

Professional
My first rated tennis match - I never played tennis until I was 48 years old - wtn started me at something like a 29.9. I was a self rate 3.0. A 3.5 self rate male had a like a 31 for his very first match. These ratings had weight and effected our partners and opponents. I am not sure how they determined to give me the 29.9. I’m going by memory so don’t get too wrapped up on the exact numbers. That is why I’m saying they seem to pull numbers out of the sky. Why give a self rate 3.5 a 31 and a self rat 3.0 a 29? 3.5 players in my area typically are in the 23-27 range.
Ah I see what you're saying now. When there is a brand new player signed up for a USTA league, WTN will find them and make a profile page, but there is not any rating number on it if they have no prior data on the person. However, when their first match result shows up, you do see a number next to their name in the match score (which is supposedly their pre-match rating). To be clear, I am not using any of those pre-first-match-ever ratings in the competition. But, it does seem like their algorithm is using these initial guesses to calculate ratings going forward.

For example, I see one guy whose first-ever WTN match results happened this spring. He played two singles matches, winning both in straight sets against guys who did have prior history and were rated 17.4 and 19.4. Yet the new player had pre-match ratings of 24.9 and 23.3, and now is at 22.3 after the second win. The only data they have on this guy is that he beat those two players, and yet he is now rated worse than both of them, apparently because of an initial guess based on nothing. How does this make any sense??
 

schmke

Legend
Ah I see what you're saying now. When there is a brand new player signed up for a USTA league, WTN will find them and make a profile page, but there is not any rating number on it if they have no prior data on the person. However, when their first match result shows up, you do see a number next to their name in the match score (which is supposedly their pre-match rating). To be clear, I am not using any of those pre-first-match-ever ratings in the competition. But, it does seem like their algorithm is using these initial guesses to calculate ratings going forward.

For example, I see one guy whose first-ever WTN match results happened this spring. He played two singles matches, winning both in straight sets against guys who did have prior history and were rated 17.4 and 19.4. Yet the new player had pre-match ratings of 24.9 and 23.3, and now is at 22.3 after the second win. The only data they have on this guy is that he beat those two players, and yet he is now rated worse than both of them, apparently because of an initial guess based on nothing. How does this make any sense??
Agree it doesn't make sense, but does explain some of the wildly inaccurate ratings WTN has.
 

Moon Shooter

Hall of Fame
Yes that is exactly what I mean. It is not a defense of wtn. It is instead an indictment of the overall system. But since these initial guesses have nothing to do with the algorithm I would not blame this on the algorithm. It is just some inexplicably dumb error. Why do dumb stuff like that which will just mess up the rating system?

That said I still think more data is better and wtn is correct in their philosophy of adding more data such as mixed doubles and matches from more then 12 months ago. They need to have more matches that equalize the men’s and women’s pools. Once they have that and fix this unforced error of assigning random ratings at the start, I think they will have a very good rating system.
 

schmke

Legend
Yes that is exactly what I mean. It is not a defense of wtn. It is instead an indictment of the overall system. But since these initial guesses have nothing to do with the algorithm I would not blame this on the algorithm. It is just some inexplicably dumb error. Why do dumb stuff like that which will just mess up the rating system?

That said I still think more data is better and wtn is correct in their philosophy of adding more data such as mixed doubles and matches from more then 12 months ago. They need to have more matches that equalize the men’s and women’s pools. Once they have that and fix this unforced error of assigning random ratings at the start, I think they will have a very good rating system.
But a good algorithm would correct an inaccurate initial guess, both by not giving that initial guess much weight but also by rapidly converging on a rating that mirrors the results. WTN doesn't seem to do that very well.
 

Moon Shooter

Hall of Fame
But a good algorithm would correct an inaccurate initial guess, both by not giving that initial guess much weight but also by rapidly converging on a rating that mirrors the results. WTN doesn't seem to do that very well.
Yes good point. Too much weight is given to old matches and these matches after a rating was assigned based on no data.
 

TennisOTM

Professional
Competition update 14:

This update includes results from 18+ 4.0 league district playoffs and from 55+ 8.0 league. Here are the updated standings and "W-L" records for each rating system so far in 2023 (singles and doubles combined):

UTR: 183-71 (72%)
TLS: 113-53 (68%)
TR: 170-97 (64%)
WTN: 203-136 (60%)

UTR is on a good run heading into the summer, going 19-6 since the last update, widening its lead over TLS, which went 9-9 over the same span. UTR is now in first place in both doubles at 141-58 (71%) and singles at 42-13 (76%).

I believe we can conclude with quite strong statistical confidence that UTR is better than WTN at predicting 4.0 men's matches, i.e. the 72% vs. 60% difference is very unlikely to have arisen by luck alone over this number of matches. We also have pretty strong confidence that UTR > TR, and TLS > WTN.

Next up we have the start of 40+ 4.0 league, as well as the start of summer tournament season - I'll include sanctioned USTA 4.0 NTRP singles and doubles tournaments among local players, which will count for year-end ratings here.
 

TennisOTM

Professional
Competition update 15:

This update includes men's results from 40+ 4.0 league, 55+ 8.0 league, and a NTRP 4.0 singles & doubles tournament (Utah State Open). Here are the updated standings and "W-L" records for each rating system so far in 2023 (singles and doubles combined):

UTR: 223-77 (74%)
TLS: 124-61 (67%)
TR: 187-111 (63%)
WTN: 227-157 (59%)

UTR improved to 74% since last update to expand its lead - every other contender's winning percentage dropped. UTR is leading in both doubles at 166-61 (73%) and singles at 57-16 (78%).

TLS, TR, and WTN are all currently doing worse at singles than they are at doubles. WTN is just 54-51 (51%) at singles. Ouch.
 

JT_2eighty

Hall of Fame
I was curious, for 4.0 leagues, and UTR, what is the standard range you are seeing? I am not well versed in UTR, but see guys anywhere from 5.xx to 7.xx? I realize there is probably a grey area, but typically when do the 7.xx's typically hit the 4.5 NTRP level? 7.50? or higher?
 

Idaho MEP

Rookie
WTN is just 54-51 (51%) at singles. Ouch.

WTN seems like it's less about rating ability level and more like a ranking based on accumulating points. The guy who has only played two matches in their system -- but it's two solid victories -- gets ranked below the two guys he just beat because they have each won hundreds of matches over the last five years.
 
I was curious, for 4.0 leagues, and UTR, what is the standard range you are seeing? I am not well versed in UTR, but see guys anywhere from 5.xx to 7.xx? I realize there is probably a grey area, but typically when do the 7.xx's typically hit the 4.5 NTRP level? 7.50? or higher?

Nah, 7.anything is already fine at 4.5
 

schmke

Legend
WTN seems like it's less about rating ability level and more like a ranking based on accumulating points. The guy who has only played two matches in their system -- but it's two solid victories -- gets ranked below the two guys he just beat because they have each won hundreds of matches over the last five years.
I don't think that is accurate. WTN is a rating based on performance in match play, not a ranking based on accumulation of points. That said, they clearly don't rate as well as others from a prediction standpoint.
 

TennisOTM

Professional
I was curious, for 4.0 leagues, and UTR, what is the standard range you are seeing? I am not well versed in UTR, but see guys anywhere from 5.xx to 7.xx? I realize there is probably a grey area, but typically when do the 7.xx's typically hit the 4.5 NTRP level? 7.50? or higher?
From what I see, the vast majority of 4.0 men are 5.xx or 6.xx. I'd say 5.25 - 6.75 is a decent rule of thumb for the 4.0 range, with the low 5's and high 6's being grey areas at either end. I saw a couple of players in the low UTR 7's squeak through with a 4.0C, but seems pretty rare.
 

Idaho MEP

Rookie
I don't think that is accurate. WTN is a rating based on performance in match play, not a ranking based on accumulation of points. That said, they clearly don't rate as well as others from a prediction standpoint.

My point is that I think the reason their rating system is so poor at prediction is because they severely underrate players with smaller data sets. I'm not saying they're literally using an accumulation of points system; I'm saying the algorithm privileges high volume for some reason. It may be quite intentional (i.e., "You've got to earn a high rating..."). I've seen multiple occasions when there are players with just a few matches on their resume rated below people they just beat (badly). And then there's a cascading effect: players that play against high quality players with low volume data sets don't get positive credit where it seems like they should (and where other systems such as UTR give them a lot more credit).
 

JT_2eighty

Hall of Fame
Competition update 15:

This update includes men's results from 40+ 4.0 league, 55+ 8.0 league, and a NTRP 4.0 singles & doubles tournament (Utah State Open). Here are the updated standings and "W-L" records for each rating system so far in 2023 (singles and doubles combined):

UTR: 223-77 (74%)
TLS: 124-61 (67%)
TR: 187-111 (63%)
WTN: 227-157 (59%)

UTR improved to 74% since last update to expand its lead - every other contender's winning percentage dropped. UTR is leading in both doubles at 166-61 (73%) and singles at 57-16 (78%).

TLS, TR, and WTN are all currently doing worse at singles than they are at doubles. WTN is just 54-51 (51%) at singles. Ouch.
I had a question about TR & TLS, as those are mainly geared to approximate USTA's NTRP dynamic ratings, is that correct?

It seems that while UTR is more reliable at predicting outcomes, which do you find (TR vs TLS) more accurately predicts a player who is about to get promoted to the next level?
 

TennisOTM

Professional
I had a question about TR & TLS, as those are mainly geared to approximate USTA's NTRP dynamic ratings, is that correct?

It seems that while UTR is more reliable at predicting outcomes, which do you find (TR vs TLS) more accurately predicts a player who is about to get promoted to the next level?
I checked this one year for players in my area, and TLS was a bit more accurate - they had fewer "misses" for predicting who would get bumped (though still had a fair number of misses). TR is much better at including all matches and providing regular updates, however.
 

Idaho MEP

Rookie
From what I see, the vast majority of 4.0 men are 5.xx or 6.xx. I'd say 5.25 - 6.75 is a decent rule of thumb for the 4.0 range, with the low 5's and high 6's being grey areas at either end. I saw a couple of players in the low UTR 7's squeak through with a 4.0C, but seems pretty rare.
What if someone is, say, ~7.2 in singles, and ~5.2 in doubles?? (Asking for a friend...:oops:)
 

TennisOTM

Professional
What if someone is, say, ~7.2 in singles, and ~5.2 in doubles?? (Asking for a friend...:oops:)
Yeah this is one way that a UTR 7+ player (in singles or doubles) might still be a NTRP 4.0. You could take an average of the two ratings (weighted by how many matches played in each), and that would be a rough equivalent of what NTRP does.
 

TennisOTM

Professional
Competition update 16:

This update includes more 4.0 men's league results from 40+ and 55+. Here are the updated standings and "W-L" records for each rating system so far in 2023 (singles and doubles combined):

UTR: 225-82 (73%)
TLS: 128-66 (66%)
TR: 195-115 (63%)
WTN: 239-159 (60%)

UTR continued its dominance in singles, improving to 60-16 (79%), while also continuing to lead the pack in doubles at 165-66 (71%). At this point I'm just about ready to say that UTR is running away with the victory. TLS was leapfrogging them across many of the earlier updates, but now they've fallen back, and the current 73% vs. 66% difference is statistically significant at the .05 level (less than 5% probability that difference could arise by chance alone).

I've written a lot about how badly WTN is doing, but Tennisrecord is doing quite poorly as well in both singles (62%) and doubles (63%). Given how often captains supposedly use TR to make roster and lineup decisions, might it be worth it for them to consider using UTR instead?
 
I've written a lot about how badly WTN is doing, but Tennisrecord is doing quite poorly as well in both singles (62%) and doubles (63%). Given how often captains supposedly use TR to make roster and lineup decisions, might it be worth it for them to consider using UTR instead?
Most captains I know, and I know a fair share, use UTR mainly to scout lineup decisions, captures all the foreign matches, the new guy who just moved here from wherever, the tournaments, a larger range of numerical ratings, yada yada, and use TR mostly to freak out about bump ups.
 

Moon Shooter

Hall of Fame
My point is that I think the reason their rating system is so poor at prediction is because they severely underrate players with smaller data sets. I'm not saying they're literally using an accumulation of points system; I'm saying the algorithm privileges high volume for some reason. It may be quite intentional (i.e., "You've got to earn a high rating..."). I've seen multiple occasions when there are players with just a few matches on their resume rated below people they just beat (badly). And then there's a cascading effect: players that play against high quality players with low volume data sets don't get positive credit where it seems like they should (and where other systems such as UTR give them a lot more credit).

What I have noticed is that WTN simply assigns new 3.5 rated players a 31 WTN. Most 3.5 players in my area are in the 22-27 range. This would not be so bad if WTN did not weight that starting rating so heavily, but it does. So not only does that player start with a rating that is way out of wack but then if you happen to play against him your rating will also suffer. On the other hand if you play with him your rating will get a big windfall.

Also people who have been playing for years often end up with ridiculously good ratings. A strong 3.5 player in my area had like a 4 WTN rating. Which is absurd.

Perhaps that partly explains what you are seeing?
 

Moon Shooter

Hall of Fame
Yes but...

Let's go with chess Elo, since people generally think this is a good ratings system.
And let's say we are trying to predict the outcomes of matches at a local chess tournament.

Suppose the tournament is 'open' and has attracted a bunch of players from across the skill spectrum, including beginners and all the way to grand masters. Throw out draws, and look at only wins vs losses. What kind of match prediction rate do we think we'd get based on comparing opponent's elo ratings?

Now imagine that tournament entry is restricted to just players with an elo of 1500 - 1700. What kind of match prediction rate do we think we'd get now?

Hint: Much lower, because when players are close to one another in rating, anything can and does happen.

This is why trying to determine the accuracy of a rating system purely by looking at its predictive ability is somewhat nonsense, unless you apply a bunch more statistical analysis to it.

I think chess and NTRP are pretty close. Consider that USTA basically lumps 80% of all male amateur adult tennis players in the 0-4.0 camp.
That would be the equivalent of rating 0-1700 players.

So USTA basically has 50 ratings per .5 so between 2.5-4.0 they have 200 rating points. Chess has 1700 rating points for that same group. 8.5xs the difference in granularity. So comparing 1500-1700 would be like taking a pool of tennis players players that are 3.75-4.00 and and seeing how well the NTRP predicts the outcomes. Let's just look at the extremes where a 1700 is playing a 1500 and a 3.75 player is playing a 4.0 player. according to Schmke, when the difference is .25 NTRP predicts the winner correctly about 78% of the time. I would think the 1700 is going to beat the 1500 about that often.

https://computerratings.blogspot.com/2023/02/how-accurate-are-dynamic-ntrp-ratings.html

I think the only way to test a rating system is to look at its prediction rate.
 

TennisOTM

Professional
Competition update 17:

This update includes results from a sanctioned (Level 4) NTRP 4.0 men's singles & doubles tournament, as well as a few weeks of 40+ men's 4.0 league. Here are updated standings and "W-L" records for each rating system so far in 2023:

Singles and doubles combined:
UTR: 265-101 (72%)
TLS: 154-79 (66%)
TR: 234-130 (64%)
WTN: 280-194 (59%)

No real change from the last update. UTR continues to lead the pack by a fairly significant margin. They are leading in both singles and doubles prediction. Their lead in singles is particularly impressive:

Singles only:
UTR: 76-22 (78%)
TR: 79-42 (65%)
TLS: 49-27 (64%)
WTN: 70-66 (51%)

UTR is perhaps partly benefitting from having a separate singles rating that is not influenced by doubles results, a feature that TR and TLS (and USTA) do not have. On the other hand WTN also has a separate singles rating, which is not helping them at all. It is pretty baffling how poorly they are doing in singles prediction, really no better than a coin flip. Will they be better next year after their recent overhaul? I'm not optimistic...
 

schmke

Legend
For WTN, are you still using ratings from before the season started? Or have you adjusted to using the revised algorithm WTN?
 

Moon Shooter

Hall of Fame
UTR and ntrp seems the only respectable rating systems. These levels are huge so there will be many mismatches plus at these levels I think they have many matches so plenty of data - and to be still hitting under 70% is bad.
 

TennisOTM

Professional
For WTN, are you still using ratings from before the season started? Or have you adjusted to using the revised algorithm WTN?
I'm using the ratings I recorded in mid-January as the predictors for the whole year.

I'm considering, at the end of the year, recording another snapshot of every system's year-end ratings for the same players, and then seeing how well those numbers "predict the past" for the same matches. Could be interesting to see how much improvement there'd be in that result, especially for WTN.
 

Moon Shooter

Hall of Fame
Yes but...

Let's go with chess Elo, since people generally think this is a good ratings system.
And let's say we are trying to predict the outcomes of matches at a local chess tournament.

Suppose the tournament is 'open' and has attracted a bunch of players from across the skill spectrum, including beginners and all the way to grand masters. Throw out draws, and look at only wins vs losses. What kind of match prediction rate do we think we'd get based on comparing opponent's elo ratings?

Now imagine that tournament entry is restricted to just players with an elo of 1500 - 1700. What kind of match prediction rate do we think we'd get now?

Hint: Much lower, because when players are close to one another in rating, anything can and does happen.

This is why trying to determine the accuracy of a rating system purely by looking at its predictive ability is somewhat nonsense, unless you apply a bunch more statistical analysis to it.

The Chess rating system is much better. But in chess there is typically much more data. Keep in mind that if you are about 100 points higher than another player you might be predicted to win about 75% of the games (depending on the rating system). If the higher rated player won 100% of the games after 20 games that would not be good. It would show the rating system is off. And what would happen is the the higher rated player would win rating points and the lower would lose them if the match went 20-0. The players would see this in their rating. In chess everyone sees how it works and so it has legitimacy - that tennis ratings lack.

As for these tennis ratings they either ignore a bunch of data (USTA ignores all mixed and many matches that are USTA matches and UTR ignores all matches older than 12 months) or just make unforced errors like WTN assign random numbers to new players and actually giving those numbers weight which effects their partners and opponents ratings.



I am definitely surprised that WTN is doing so badly.

Making a half-decent rating isn't hard. Elo is out there, glicko is out there, Microsoft has published their paper on their TrueSkill2, if you do something halfway reasonable based on one of those you should get a decent rating system. This isn't rocket science. Even if you do something totally new, you can at least compare your accuracy to those and see if you're close...

Heather from USTA seemed to say some things that lead me to suspect they want to downplay the rating - and therefore are not focused on accuracy. And some of the incredibly stupid things we see makes it seem like USTA and other rating systems are deliberately trying to sabotage there own system.
 

TennisOTM

Professional
Competition update 18:

This update includes all remaining results from 40+ men's 4.0 league, including district playoff matches, plus a few matches from men's 65+ 8.0 league. Here are updated standings and "W-L" records for each rating system so far in 2023:

Singles and doubles combined:
UTR: 284-110 (72%)
TLS: 162-86 (65%)
TR: 251-139 (64%)
WTN: 305-202 (60%)

UTR is holding steady at 72% in first place, while TLS continues its gradual decline out of contention. Tennisrecord may have a chance to move into second place by year end. Still one more large sanctioned tournament to come, plus some more ratings-eligible league play over the next couple months.
 

TennisOTM

Professional
Competition update 19:

This update includes new match results from men's 65+ 8.0 league, now completed, plus a small sanctioned tournament and some matches from a local ratings-eligible fall outdoor league. Here are updated standings and "W-L" records for each rating system so far in 2023:

Singles and doubles combined:
UTR: 299-111 (73%)
TLS: 179-87 (66%)
TR: 262-142 (65%)
WTN: 322-204 (61%)

All four competitors ticked up a % point since the last update, with a relatively predictable set of recent match results. The only real drama remaining is who will finish second? Tennisrecord and TLS, the two publicly-available systems that attempt to mimic NTRP, are in a close head-to-head battle coming down to the wire. They are even closer than shown above when limiting to common matches predicted.

Still one more sanctioned tournament coming up, plus some more ratings-eligible league play. The last matches for the 2023 competition will be played on October 21.
 

travlerajm

Talk Tennis Guru
I just caught up to this thread.

It seems like it should be possible to develop an algorithm that does significantly better than 74%. I believe >80% prediction accuracy for usta league matches ought to be possible.

Maybe @schmke can develop a new algorithm that makes optimizing prediction accuracy as the primary objective?
 

Klitz

Rookie
I just caught up to this thread.

It seems like it should be possible to develop an algorithm that does significantly better than 74%. I believe >80% prediction accuracy for usta league matches ought to be possible.

Maybe @schmke can develop a new algorithm that makes optimizing prediction accuracy as the primary objective?
According to this forum, significant ratings manipulation is pervasive throughout USTA league play. Given the consensus and frequency of first hand accounts of this behavior, it leads me to believe, at some level, it must be true.

This would mean that the data set that any algorithm is using to make predictions is corrupted. Does this not put a ceiling on the maximum accuracy a model could theoretically achieve?

Additional, I believe that as each year passes, more captains/players are becoming aware of and utilizing these tactics that where "pioneered" ~8yrs ago...?

I suspect that Mr. Schmke and others are constantly massaging/refining their algorithm to attempt to remove/identify corrupted data in an effort to increase prediction accuracy...?

I theorize that if rating manipulation was in fact on the rise, that a current match prediction model would produce successful match predictions more frequently in past years compared to more recent years. For example, for the exact same model, predictions in 2016 would be more accurate than 2017, 2018,etc...?

Has anyone tried this?
 

travlerajm

Talk Tennis Guru
According to this forum, significant ratings manipulation is pervasive throughout USTA league play. Given the consensus and frequency of first hand accounts of this behavior, it leads me to believe, at some level, it must be true.

This would mean that the data set that any algorithm is using to make predictions is corrupted. Does this not put a ceiling on the maximum accuracy a model could theoretically achieve?

Additional, I believe that as each year passes, more captains/players are becoming aware of and utilizing these tactics that where "pioneered" ~8yrs ago...?

I suspect that Mr. Schmke and others are constantly massaging/refining their algorithm to attempt to remove/identify corrupted data in an effort to increase prediction accuracy...?

I theorize that if rating manipulation was in fact on the rise, that a current match prediction model would produce successful match predictions more frequently in past years compared to more recent years. For example, for the exact same model, predictions in 2016 would be more accurate than 2017, 2018,etc...?

Has anyone tried this?
My opinion is that ratings manipulation accounts for a very small fraction, perhaps less than 2%, of total usta league matches.

I stand by my belief that 80+ % prediction accuracy is absolutely achievable.

UTR is doing relatively poorly compared to this in part because of inherent flaws in the algorithm. Patching those flaws would probably be enough to get over the 80% threshold.
 

Klitz

Rookie
My opinion is that ratings manipulation accounts for a very small fraction, perhaps less than 2%, of total usta league matches.

I stand by my belief that 80+ % prediction accuracy is absolutely achievable.

UTR is doing relatively poorly compared to this in part because of inherent flaws in the algorithm. Patching those flaws would probably be enough to get over the 80% threshold.
Based on my limited understanding of the rating system, the "2%" would end up inaccurately manipulating far more than just those intial matches.

The innocent person(s) partaking in the match have their own ratings inaccurately adjusted from the manipulated match as well. This innocent player(s) then proceeds to take their inaccurate rating to a subsequent match, etc, etc...

Like the butterfly effect of corrupting rating data. However, every additional match separated from the manipulated match would have a smaller and smaller impact, assuming these subsequent matches were on the up and up.
 

Chalkdust

Professional
I stand by my belief that 80+ % prediction accuracy is absolutely achievable.

I just find this whole discussion funny.

Personally, I know my level can vary quite significantly from day to day, depending on how I'm feeling, how much I've been playing, how much I've been sleeping, and about 100 other factors.
I would guess than on any given day I can be as much as 0.2 DNTRP above or below my 'average'. Probably that's less usual and as good/bad as it gets, but I'd guess that most days I'm maybe 0.1 up or down.

Are you guys saying that you are consistently playing at your DNTRP each and every outing? I think that's hogwash.

As a result any prediction where the two players have a relatively close DNTRP is not going to have a high accuracy rate even if the rating algorithm is perfect, because of the day to day variance in level of both yourself and your opponent.
 

travlerajm

Talk Tennis Guru
I just find this whole discussion funny.

Personally, I know my level can vary quite significantly from day to day, depending on how I'm feeling, how much I've been playing, how much I've been sleeping, and about 100 other factors.
I would guess than on any given day I can be as much as 0.2 DNTRP above or below my 'average'. Probably that's less usual and as good/bad as it gets, but I'd guess that most days I'm maybe 0.1 up or down.

Are you guys saying that you are consistently playing at your DNTRP each and every outing? I think that's hogwash.

As a result any prediction where the two players have a relatively close DNTRP is not going to have a high accuracy rate even if the rating algorithm is perfect, because of the day to day variance in level of both yourself and your opponent.
That’s why 90% might be tricky.
 

schmke

Legend
I just find this whole discussion funny.

Personally, I know my level can vary quite significantly from day to day, depending on how I'm feeling, how much I've been playing, how much I've been sleeping, and about 100 other factors.
I would guess than on any given day I can be as much as 0.2 DNTRP above or below my 'average'. Probably that's less usual and as good/bad as it gets, but I'd guess that most days I'm maybe 0.1 up or down.

Are you guys saying that you are consistently playing at your DNTRP each and every outing? I think that's hogwash.

As a result any prediction where the two players have a relatively close DNTRP is not going to have a high accuracy rate even if the rating algorithm is perfect, because of the day to day variance in level of both yourself and your opponent.
This is true, in doing my ratings/reports I regularly observe that a player will have their match results vary +/- 0.2 if not a bit more. Variability in how one plays, match-ups, partners, etc. all influence this and this is normal.

Still, even with this variability and my attempt to more or less replicate what the USTA does (e.g. with this algorithm haven't tried to improve predictive accuracy), it has ~73% accuracy. Given the variability we see for players, one might think 73% is very good, but you also have a fair amount of matches between players farther apart where that variability doesn't change the result from expected.

Here is a table of accuracy based on the gap between players/pairings so you can see how, as you'd expect, the farther apart the players are, the more accurate the prediction.

Rating GapOverallSinglesDoubles
Overall73%73%72%
<= 0.0554%53%54%
0.05 - 0.1566%63%66%
0.15 - 0.2579%76%80%
0.25 - 0.3587%85%89%
0.35 - 0.4592%89%94%
> 0.4596%95%97%

The above is from analysis I did on my blog earlier this year for matches played in 2022.

I did similar analysis in 2015 and it wasn't a whole lot different.

Winning %
Gap20142015
0.00 - 0.05
53%​
53%​
0.05 - 0.15
63%​
62%​
0.15 - 0.25
75%​
73%​
0.25 - 0.35
84%​
83%​
0.35 - 0.45
90%​
89%​
0.45 - 0.55
93%​
93%​
0.55 - 0.65
95%​
94%​
0.65 - 0.75
96%​
96%​
 
Top