Warning (21.11.2012)
This post is now somewhat outdated. I have switched my data source for ratings, which means that everything I wrote about Elo ratings is more or less irrelevant.
This was supposed to be my post number 3000, but I guess it's a bit late.
So, if you've been following the GSL lately, you may have seen me make posts like this one where I post a list with a bunch of numbers that seemingly comes out of nowhere. I often get questions about various details, so I'll do a bit of a FAQ here.
I have taken a cue from Orek in regards to post structure, since this might get quite long.
Abstract
In this post, I talk a bit about Elo ratings, how they work and why they are good (and bad). I then show how you can use them to predict the results of tournaments, and I present a pet project of mine that allows you to do just that, more or less without needing to know the details involved (though I would recommend that you briefly look over them anyway.)
The second-to-last section contains a plea for suggestions and contributions.
Finally I answer some stray questions, mostly about how to interpret statistical predictions.
Read this if you are interested in Elo rating, what it can do and what it can't do.
+ Show Spoiler +
What are Elo ratings?
+ Show Spoiler +
Why do we need such a system?
+ Show Spoiler +
So how does Elo ratings work?
+ Show Spoiler +
Some things to note about Elo ratings.
+ Show Spoiler +
Differences between chess and Starcraft.
+ Show Spoiler +
Differences between Team Liquid and FIDE.
+ Show Spoiler +
Elo works best for...
+ Show Spoiler +
Could we do better than TLPD Elo ratings?
+ Show Spoiler +
+ Show Spoiler +
The Elo rating system is one of several systems attempting to estimate the strength of players in two-player games. It was designed by Hungarian physicist Arpad Elo for use with chess, and it remains in use today by the International Chess Federation.
Why do we need such a system?
+ Show Spoiler +
We don't need it, per se, but it's always interesting to have a strictly objective source when discussing hypothetical situations of player A vs player B, who is the best in the world right now, and bonjwaness.
Estimating playing strength in such a manner is a difficult problem, because we have incomplete information. In each month, for example, a player might not play more than 20 games against only a handful of opponents. Given two players who have not played each other for a long time, how would you assess a hypothetical match between the two?
Estimating playing strength in such a manner is a difficult problem, because we have incomplete information. In each month, for example, a player might not play more than 20 games against only a handful of opponents. Given two players who have not played each other for a long time, how would you assess a hypothetical match between the two?
So how does Elo ratings work?
+ Show Spoiler +
This is how it works in chess:
- Each player gets an initial rating. This number can be anything, but TLPD appears to use 2000.
- When two players with ratings A and B compete, a formula gives the expected score for each player (0 for a loss, 1 for a win). Generally, they play more than 1 game (let's say they play G games), player A scores a and player B scores b. Of course, a + b = G.
- For each player, their performance rating is computed. This is the rating at which the expected score is equal to the actual score. For example, if player A won three out of four games against a player rated 2000, his performance rating is the rating where this result would be expected if the rating of player B were spot on (in this case, somewhat more than 2000).
- If the match was a sweep (say player A won 4-0), he technically gets a performance rating of infinity. In this case, the performance rating of player A is set to the actual rating of player B plus some fixed number (around 400 I believe.)
- The ratings of player A and B are then adjusted somewhat in the direction of their performance ratings (either up or down, depending on whether they scored better or worse than expected.) The magnitude of the adjustment depends on the difference between performance rating and current rating, as well as a volatility factor called a K-factor in the chess world, which is a subject of much debate.
Some things to note about Elo ratings.
+ Show Spoiler +
- Even if player A beats player B, he can lose rating points if his win was not as clear as expected.
- Elo ratings do not tell you how likely it is that player A beats player B, only the expected scores of each.
- However, if the game only allows scores of 1 and 0 (unlike chess, for example, where a draw is possible), we can work out this probability in a way such that the system is consistent. The upside is that in Starcraft, where draws are impossible, Elo ratings allow us to estimate the probability that player A will beat player B in a single game.
Differences between chess and Starcraft.
+ Show Spoiler +
Players tend to have more volatile ratings in Starcraft than in chess. This is not a problem with the system, but rather a problem with the game, or the players. For this reason, Elo ratings in Starcraft tend to lag behind the player's true skill more so than in chess. The general rule of thumb is that young up-and coming players tend to be underrated, and older players past their peaks tend to be overrated. This also affects slumps.
Bottom line is, there is more reason to be skeptical of Elo ratings in Starcraft than in chess.
The other key difference between chess and Starcraft is that Starcraft has an additional variable that is worth tracking, namely a player's race. TLPD does this, by giving each player a matchup-specific rating, as well as a global one. This is potentially useful information that should not be discarded, but unfortunately it's not at all clear how to combine these numbers into a "total" Elo rating for any given matchup.
We could just use the global rating, but that would be discarding information about matchups, which we know is not insignificant.
We could just use the matchup-specific rating, but this rating is based on a pool of games that are less frequent than the global rating, and so we should have less faith in it.
We could use an average of the global and matchup-specific rating. The potential problem here is that for an up-and-coming player (for example), the global rating will react to his performance faster than the match-specific ratings because it has three times as many games to go on. This can lead to strange situations where the global rating is higher than all the matchup-specific ratings. How are we to interpret this?
The system I am currently using (after convincing myself that the above was flawed), is this: take the average of all three matchup-specific ratings and subtract that from each of them (in math-speak: normalize them to have zero mean), scale them by a factor H and add to the global rating. This system ensures that the global rating is a measure of the true skill of the player, while the matchup-specific ratings merely indicate how much this skill varies when playing different
matchups. Using low or high values for H, we indicate that we have little or much faith in the matchup-specific ratings compared to the global one. I am currently using a value of H=0.3, because I believe the global rating is quite a bit more accurate.
Bottom line is, there is more reason to be skeptical of Elo ratings in Starcraft than in chess.
The other key difference between chess and Starcraft is that Starcraft has an additional variable that is worth tracking, namely a player's race. TLPD does this, by giving each player a matchup-specific rating, as well as a global one. This is potentially useful information that should not be discarded, but unfortunately it's not at all clear how to combine these numbers into a "total" Elo rating for any given matchup.
We could just use the global rating, but that would be discarding information about matchups, which we know is not insignificant.
We could just use the matchup-specific rating, but this rating is based on a pool of games that are less frequent than the global rating, and so we should have less faith in it.
We could use an average of the global and matchup-specific rating. The potential problem here is that for an up-and-coming player (for example), the global rating will react to his performance faster than the match-specific ratings because it has three times as many games to go on. This can lead to strange situations where the global rating is higher than all the matchup-specific ratings. How are we to interpret this?
The system I am currently using (after convincing myself that the above was flawed), is this: take the average of all three matchup-specific ratings and subtract that from each of them (in math-speak: normalize them to have zero mean), scale them by a factor H and add to the global rating. This system ensures that the global rating is a measure of the true skill of the player, while the matchup-specific ratings merely indicate how much this skill varies when playing different
matchups. Using low or high values for H, we indicate that we have little or much faith in the matchup-specific ratings compared to the global one. I am currently using a value of H=0.3, because I believe the global rating is quite a bit more accurate.
Differences between Team Liquid and FIDE.
+ Show Spoiler +
Added to this, the TLPD ratings work somewhat differently from the FIDE ones.
First of all, the exact implementation details of TLPD are hidden from view.
Secondly, and this is a real kicker, FIDE ratings (and in fact all other chess rating systems) are organized in periods, whereas the TLPD ones are not, they are updated on the fly, game by game. FIDE uses the ratings at the end of the previous period in all calculations during the next one, updating the ratings only after the end of another period. These periods keep getting more frequent (currently ratings are published every month.)
The upside of this is that TLPD ratings reflect somewhat the volatility of the skills that they are trying to measure. The downside is that they often overreact. This should be kept in mind, but it's not clear to me whether it means TLPD ratings should be more or less trusted.
First of all, the exact implementation details of TLPD are hidden from view.
Secondly, and this is a real kicker, FIDE ratings (and in fact all other chess rating systems) are organized in periods, whereas the TLPD ones are not, they are updated on the fly, game by game. FIDE uses the ratings at the end of the previous period in all calculations during the next one, updating the ratings only after the end of another period. These periods keep getting more frequent (currently ratings are published every month.)
The upside of this is that TLPD ratings reflect somewhat the volatility of the skills that they are trying to measure. The downside is that they often overreact. This should be kept in mind, but it's not clear to me whether it means TLPD ratings should be more or less trusted.
Elo works best for...
+ Show Spoiler +
Elo ratings work best for a (potentially large) group of players that compete relatively frequently and uniformly against each other. This is the Korean SC2 scene.
It does not work so well if there are subgroups with infrequent competitions across the groups. Comparisons in rating would be accurate within each group, but poor across groups. This is why TLPD has a different list for international SC2. It is not clear at all that there is sufficient contact between the two scenes to combine the lists into one and use it to compare the ratings of an international player who has never been to Korea, and a Korean player who has never been abroad.
This is also why ratings for secluded scenes within the international list should be taken with a grain of salt. There is a lot of contact between EU, NA and SEA, for example, but not so much with Taiwan and China.
It does not work so well if there are subgroups with infrequent competitions across the groups. Comparisons in rating would be accurate within each group, but poor across groups. This is why TLPD has a different list for international SC2. It is not clear at all that there is sufficient contact between the two scenes to combine the lists into one and use it to compare the ratings of an international player who has never been to Korea, and a Korean player who has never been abroad.
This is also why ratings for secluded scenes within the international list should be taken with a grain of salt. There is a lot of contact between EU, NA and SEA, for example, but not so much with Taiwan and China.
Could we do better than TLPD Elo ratings?
+ Show Spoiler +
Probably. The Glicko systems are popular. They include not only ratings, but also information such as rating deviation (which is a measure of how much faith we have in a player's rating), and, for the Glicko II system, volatility, which is a measure for how variable a player's skill level is.
In this system, games against players with small rating deviations (RDs) will have a larger impact on your rating than those against players with large RDs. If you win against a player which we know nothing about, that says very little about your skill level, but if you beat Mvp (whom we are pretty sure is quite good), that says quite a lot.
Furthermore, if your own RD is small, your rating will tend to change more slowly, since the information content of new games is small compared to the information content of your current rating, whereas for a new player, almost all new information comes from new games.
As time passes, your RD will tend upwards, and you can only keep it stable by playing games.
The Glicko II system also introduces a volatility measure, which is supposed to prevent RDs getting too small. The volatility is an attempt to measure the natural variability in a player's skill.
There actually is an available implementation of the Glicko system here. There are some reasons why I prefer to use Elo ratings from TLPD at the moment, however.
First, SC2 Charts do not report any data other than ratings. To estimate the probability of a player winning a game, I need RDs too.
Secondly, I don't quite know which scale they use. The top rankings are upwards to 36000 points. When I implemented my own Glicko system I had ratings not much higher than 2300.
Thirdly, it doesn't appear that they took the effort to separate scenes. This might not be a big problem.
Fourth, they don't have matchup-specific ratings.
Finally, their search function does not appear to be working. This is a bit of a bother. I type a player's name, hit enter, and whatever I type, I always end up here. Am I just being retarded?
So the answer is yes, but for now, I will stick to Elo.
In this system, games against players with small rating deviations (RDs) will have a larger impact on your rating than those against players with large RDs. If you win against a player which we know nothing about, that says very little about your skill level, but if you beat Mvp (whom we are pretty sure is quite good), that says quite a lot.
Furthermore, if your own RD is small, your rating will tend to change more slowly, since the information content of new games is small compared to the information content of your current rating, whereas for a new player, almost all new information comes from new games.
As time passes, your RD will tend upwards, and you can only keep it stable by playing games.
The Glicko II system also introduces a volatility measure, which is supposed to prevent RDs getting too small. The volatility is an attempt to measure the natural variability in a player's skill.
There actually is an available implementation of the Glicko system here. There are some reasons why I prefer to use Elo ratings from TLPD at the moment, however.
First, SC2 Charts do not report any data other than ratings. To estimate the probability of a player winning a game, I need RDs too.
Secondly, I don't quite know which scale they use. The top rankings are upwards to 36000 points. When I implemented my own Glicko system I had ratings not much higher than 2300.
Thirdly, it doesn't appear that they took the effort to separate scenes. This might not be a big problem.
Fourth, they don't have matchup-specific ratings.
Finally, their search function does not appear to be working. This is a bit of a bother. I type a player's name, hit enter, and whatever I type, I always end up here. Am I just being retarded?
So the answer is yes, but for now, I will stick to Elo.
Read this if you are interested in tournament formats and probability.
+ Show Spoiler +
The story so far.
+ Show Spoiler +
A best-of-N match
+ Show Spoiler +
Single elimination brackets
+ Show Spoiler +
Round robin groups
+ Show Spoiler +
MSL-style groups
+ Show Spoiler +
Double elimination brackets
+ Show Spoiler +
Monte Carlo simulations
+ Show Spoiler +
Other?
+ Show Spoiler +
+ Show Spoiler +
Given player A and player B, Elo ratings taken from TLPD give us a decent (but not perfect) estimate of the probability that player A beats player B in a single game.
If we extend this to a tournament (with some kind of hitherto unspecified format), we can use these numbers to estimate the probability of any given outcome of that tournament. This involves the use of maths.
If we extend this to a tournament (with some kind of hitherto unspecified format), we can use these numbers to estimate the probability of any given outcome of that tournament. This involves the use of maths.
A best-of-N match
+ Show Spoiler +
This is the simplest most commonly used format out there, and it's often used as a subformat for more exotic types (e.g. each 'game' in a round robin group is actually a best-of-3, or whatever.) So this is a good place to start.
Let the two players A and B be given, and assume that each will win any given set with probabilities pa and pb, respectively.
What is the probability that player A will win with A sets to B, with A > B?
Well, the binomial distribution comes to mind. The probability that player A wins A games out of A+B is
C(A + B,A) pa^A pb^B.
(Here, C(m,n) is the binomial coefficient.)
This is almost, but not quite, what we are looking for. This allows some nonsense cases, such as WWWL, to win a best-of-5 with 3-1, when it should actually be 3-0. To get the right figure, we need to impose the condition that player A should win the last game. So we are actually looking for the probability that player A wins A-1 games out of A+B-1 and then wins the last game:
P(A,B) = C(A+B-1,A-1) pa^(A-1) pb^B pa = C(A+B-1,A-1) pa^A pb^B
And there's a similar formula for player B's view:
Q(A,B) = C(A+B-1,B-1) pa^A pb^B
Now you can calculate all sorts of things. What is the probability that player A wins a best-of-7?
P(4,0) + P(4,1) + P(4,2) + P(4,3)
What is the probability that the game will go to the last set?
P(4,3) + Q(3,4)
And so on...
Let the two players A and B be given, and assume that each will win any given set with probabilities pa and pb, respectively.
What is the probability that player A will win with A sets to B, with A > B?
Well, the binomial distribution comes to mind. The probability that player A wins A games out of A+B is
C(A + B,A) pa^A pb^B.
(Here, C(m,n) is the binomial coefficient.)
This is almost, but not quite, what we are looking for. This allows some nonsense cases, such as WWWL, to win a best-of-5 with 3-1, when it should actually be 3-0. To get the right figure, we need to impose the condition that player A should win the last game. So we are actually looking for the probability that player A wins A-1 games out of A+B-1 and then wins the last game:
P(A,B) = C(A+B-1,A-1) pa^(A-1) pb^B pa = C(A+B-1,A-1) pa^A pb^B
And there's a similar formula for player B's view:
Q(A,B) = C(A+B-1,B-1) pa^A pb^B
Now you can calculate all sorts of things. What is the probability that player A wins a best-of-7?
P(4,0) + P(4,1) + P(4,2) + P(4,3)
What is the probability that the game will go to the last set?
P(4,3) + Q(3,4)
And so on...
Single elimination brackets
+ Show Spoiler +
This is also one of the easier formats to deal with. Given a single match somewhere in the bracket, a player will emerge from each of two sub-brackets (left and right), they will play a best-of-N match against each other, and the winner will advance to the next match.
Assume players L1 through LM and R1 through RM wins the left and right sub-bracket with probabilities P(L1) through P(LM) and P(R1) through P(RM) respectively. We can use the law of total probability to find the probability that player L1 (for example) wins the full bracket:
P(L1 wins) = sum(k=1...M) P(L1) * P(Rk) * P(L1 beats Rk in a best-of-N)
What we have then is a system for determining the probability of a given player winning a bracket of R rounds, if we know his probability of winning a bracket of R-1 rounds. To complete the picture, the keyword is of course recursion. The base case is a bracket of only one round, which is just a best-of-N match on its own. (See last point.)
As far as other statistics go, it's also relatively easy to compute the life expectancy of a player, by which I mean the number of rounds he or she can expect to win. (A player with life zero will lose the first match, a player with life 1 will win the first match and lose the second, and a player with life R will win the whole bracket. The life expectancy is thus a number between 0 and R.)
The life expectancy of a player in a round-R bracket is:
P(win round 1) + P(win round 2) + ... + P(win round R)
Assume players L1 through LM and R1 through RM wins the left and right sub-bracket with probabilities P(L1) through P(LM) and P(R1) through P(RM) respectively. We can use the law of total probability to find the probability that player L1 (for example) wins the full bracket:
P(L1 wins) = sum(k=1...M) P(L1) * P(Rk) * P(L1 beats Rk in a best-of-N)
What we have then is a system for determining the probability of a given player winning a bracket of R rounds, if we know his probability of winning a bracket of R-1 rounds. To complete the picture, the keyword is of course recursion. The base case is a bracket of only one round, which is just a best-of-N match on its own. (See last point.)
As far as other statistics go, it's also relatively easy to compute the life expectancy of a player, by which I mean the number of rounds he or she can expect to win. (A player with life zero will lose the first match, a player with life 1 will win the first match and lose the second, and a player with life R will win the whole bracket. The life expectancy is thus a number between 0 and R.)
The life expectancy of a player in a round-R bracket is:
P(win round 1) + P(win round 2) + ... + P(win round R)
Round robin groups
+ Show Spoiler +
Round robin groups are pretty cool because the matches that have to be played are set beforehand and will never change. They are also shitty because of all the different types of tiebreakers that are floating around, that must also be considered.
The general framework I'm considering here is a group consisting of M players, where each player plays everyone else in a best-of-N match. At the end, the only results that matter are the match and set scores (either summed over all matches or only among tied players), and the players are ranked according to some scheme of tiebreakers.
Since a best-of-N match has N+1 possible outcomes, and there are M(M-1)/2 matches in total, a group like this can play out a total of
(N+1)^(M(M-1)/2)
different ways. How big could this number be? Well, the largest GSL up-and-down groups have six players and best-of-1, giving 2^15 = 32768 possibilities, which is quite doable. The NASL groups, which are the largest ones I know of, have nine players and best-of-3, giving 4^36 ~ 10^21, which is well beyond the realm of reasonability.
(Thanks to Rannasha who pointed out my mistake here.)
Suppose now that we have enumerated (somehow) all possible ways a group can play out, and each scenario has a probability associated with it, which is just the product of the probability of each match having that outcome. All we have to do then is to rank all the players for each outcome.
This is a really fun programming problem that many of you would enjoy trying on your own, I am sure. Basically, what I did was to iterate through a list of tiebreakers in prioritized order, sort the table according to the tiebreaker, identify the unresolved ties (if any), and recurse, calling the tiebreaker function on each unresolved sub-table with the next tiebreaker.
Things I realized that are important to keep in mind:
Note that even though there aren't too many possible outcomes in a large group, it will still take a long time to compute because of the tiebreaker mechanism I use. Say, a 16-player group will have several possible 15-player ties that must be replayed, all of which have 13-player ties, and so on. I tried it myself, and it's not fun. Some obvious tweaks can be made to make it more efficient, but I'm not sure at the moment how helpful they'll be.
The general framework I'm considering here is a group consisting of M players, where each player plays everyone else in a best-of-N match. At the end, the only results that matter are the match and set scores (either summed over all matches or only among tied players), and the players are ranked according to some scheme of tiebreakers.
Since a best-of-N match has N+1 possible outcomes, and there are M(M-1)/2 matches in total, a group like this can play out a total of
(N+1)^(M(M-1)/2)
different ways. How big could this number be? Well, the largest GSL up-and-down groups have six players and best-of-1, giving 2^15 = 32768 possibilities, which is quite doable. The NASL groups, which are the largest ones I know of, have nine players and best-of-3, giving 4^36 ~ 10^21, which is well beyond the realm of reasonability.
(Thanks to Rannasha who pointed out my mistake here.)
Suppose now that we have enumerated (somehow) all possible ways a group can play out, and each scenario has a probability associated with it, which is just the product of the probability of each match having that outcome. All we have to do then is to rank all the players for each outcome.
This is a really fun programming problem that many of you would enjoy trying on your own, I am sure. Basically, what I did was to iterate through a list of tiebreakers in prioritized order, sort the table according to the tiebreaker, identify the unresolved ties (if any), and recurse, calling the tiebreaker function on each unresolved sub-table with the next tiebreaker.
Things I realized that are important to keep in mind:
- Sometimes, you need to recurse using the same tiebreaker again. For example, if four players are tied, and the tiebreaker being used is internal match score, the tie might be broken with two players on 2-1 and two players on 1-2, but the sub-ties are not resolvable on this recursion. The next recursion will be called on both the sub-ties, but you need to use the same tiebreaker (internal match score), because of course then the ties will be broken. Basically: if the tiebreaker is of the type "internal" something (not "total" something), and the sub-ties are strictly smaller than the original tie, recurse with the same tiebreaker, and not the next one.
- Some ties are unbreakable. In this case, rematches must be simulated, but that's pretty easy if your code is modular enough. Just dump the offending players into a new group and run again.
- Some ties are still unbreakable. Consider the three-player group with everyone going 1-1. You would have an infinite recursion in this case, because the super-tie is always possible on every level. What to do? Simple: since the super-tie results in the same group again, simply ignore these situations. Pretend they never happened. The only problem this will cause is that the probabilities of each outcome don't add up to 1 any more. To fix this, simply rescale all the relevant quantities to "bridge the gap," so to speak.
Note that even though there aren't too many possible outcomes in a large group, it will still take a long time to compute because of the tiebreaker mechanism I use. Say, a 16-player group will have several possible 15-player ties that must be replayed, all of which have 13-player ties, and so on. I tried it myself, and it's not fun. Some obvious tweaks can be made to make it more efficient, but I'm not sure at the moment how helpful they'll be.
MSL-style groups
+ Show Spoiler +
Many of the newer guys might not know what I'm talking about. To be precise, it's the four-player group format used by the GSL in their rounds of 32 and 16. It's basically a tiny four-player double elimination bracket without a grand finals.
I will tackle the more general double elimination brackets in the next point, but MSL groups deserve their own treatment for reasons that will hopefully be more clear soon.
The reason you can't deal with an MSL group the same way you deal with single elimination brackets is because of the mixing of losers and winners. If you use the same formulas as I gave two points back (with the obvious changes) you will, for example, see that there is a nonzero probability that a player faces himself in the final match! This is obviously bullshit, but is still counted because the formulas assume that the players from each sub-bracket are independent. This is true in a single elimination bracket, but not in double elimination.
This makes implementing double elimination brackets somewhat awkward, but MSL groups are still so small that it's pretty easy to do. What I did basically boils down to stacked loops: for each possible outcome in the two first matches, and for each possible outcome in the resulting winner's and loser's matches, and for each possible outcome in the final match... etc.
Just sum up the probabilities for each player to advance (in one of two spots), and you're done! (I couldn't really think of a lot of other interesting statistics to compute.)
I will tackle the more general double elimination brackets in the next point, but MSL groups deserve their own treatment for reasons that will hopefully be more clear soon.
The reason you can't deal with an MSL group the same way you deal with single elimination brackets is because of the mixing of losers and winners. If you use the same formulas as I gave two points back (with the obvious changes) you will, for example, see that there is a nonzero probability that a player faces himself in the final match! This is obviously bullshit, but is still counted because the formulas assume that the players from each sub-bracket are independent. This is true in a single elimination bracket, but not in double elimination.
This makes implementing double elimination brackets somewhat awkward, but MSL groups are still so small that it's pretty easy to do. What I did basically boils down to stacked loops: for each possible outcome in the two first matches, and for each possible outcome in the resulting winner's and loser's matches, and for each possible outcome in the final match... etc.
Just sum up the probabilities for each player to advance (in one of two spots), and you're done! (I couldn't really think of a lot of other interesting statistics to compute.)
Double elimination brackets
+ Show Spoiler +
Because of the problem I pointed out in the previous point, pretty much the best way to deal with general double elimination brackets are to bruteforce them the same way we did round robins, but the largest double elimination brackets can be really huge. Luckily, it doesn't matter what the exact score in a match is, the only important thing is to track the winners (I'm looking at you, extended series). Thus, a double elimination bracket with M has roughly
2^(2M)
ways to play out. (There are 2M matches because each player must lose two matches to drop out. Add and subtract a handful... doesn't really matter, and for each match only two outcomes matter.) This is perhaps doable for 16 players, but not more than that.
The biggest challenge here, compared to previous formats, is essentially just to set up the matches and the links between them. Once you've done that, it's a matter of enumerating the outcomes and tallying scores much like how we did it with round robin groups, except there are no pesky tiebreakers to keep us awake at night.
Note that many double elimination formats will try to keep players from meeting twice as much as they can, and they do this by dropping players down into the loser's bracket in different ways. The most popular way I've seen this done, is to reverse the "drop-down manner" every other round, but there could be others that I haven't seen. This could have an impact.
For double elimination brackets, the most interesting statistics that I thought of are again probability to win and life expectancy. For the latter, you can't count the number of rounds won, because people can drop out at the same place in the loser's bracket, having played different numbers of matches (the winner's bracket is shorter). Better would be just to keep track of which loser's bracket round someone drops out in.
2^(2M)
ways to play out. (There are 2M matches because each player must lose two matches to drop out. Add and subtract a handful... doesn't really matter, and for each match only two outcomes matter.) This is perhaps doable for 16 players, but not more than that.
The biggest challenge here, compared to previous formats, is essentially just to set up the matches and the links between them. Once you've done that, it's a matter of enumerating the outcomes and tallying scores much like how we did it with round robin groups, except there are no pesky tiebreakers to keep us awake at night.
Note that many double elimination formats will try to keep players from meeting twice as much as they can, and they do this by dropping players down into the loser's bracket in different ways. The most popular way I've seen this done, is to reverse the "drop-down manner" every other round, but there could be others that I haven't seen. This could have an impact.
For double elimination brackets, the most interesting statistics that I thought of are again probability to win and life expectancy. For the latter, you can't count the number of rounds won, because people can drop out at the same place in the loser's bracket, having played different numbers of matches (the winner's bracket is shorter). Better would be just to keep track of which loser's bracket round someone drops out in.
Monte Carlo simulations
+ Show Spoiler +
In the cases where groups and double elimination brackets get too big for exact computations, we will have to switch to the Monte Carlo method. I haven't implemented this yet, but the idea is quite simple: run a reasonably large number of simulations, and base your predictions off of those.
I'll get back on this point later, I guess.
I'll get back on this point later, I guess.
Other?
+ Show Spoiler +
This is a list of tournament formats that I would like to work on, but which have been deprioritized.
- Combined tournaments (such as the GSL.) Problems: Group selection ceremonies are an element that I just can't predict, and I don't know if randomizing is a good way to model them. There are also so many different outcomes to a large tournament that the only way to simulate it would be using Monte Carlo methods.
- MLG. There's just so much that the MLG does different from anyone else that I haven't bothered looking at it yet. It's not just the extended series, but the whole championship bracket is pretty warped.
Read this if you want to do your own number crunching.
+ Show Spoiler +
I have a software package with all my work so far. It supports everything I've written above, namely exact simulation of the five discussed tournament formats. ("Exact" here meaning under the assumption that the Elo model is accurate.) It also gets Elo ratings straight from TLPD (to be used with caution, although you can also do it manually.) To top that out, it also lets you enter match results (partial or complete) underway and see the statistics update in real time.
I've also taken some pains to make it user-friendly, although I concede the fact that I'm only a hobby level programmer, and my professional background is mathematical, not software development.
The git repository is here: https://github.com/TheBB/simul
Now, I'm not a git know-it-all either, but I'm pretty sure everyone can just clone this repository and get going. If you don't know how to do that, there is a downloads page.
Everything is written with Python 3 in mind. On Linux it should run pretty much straight out of the box.
It has not been tested on Windows or Mac OS. I'm confident that everything should work there, except possibly the readline stuff. Feedback is welcome.
The main executable script is simul.py.
For instructions for use, please refer to the readme file bundled with the code. It is also displayed on the repository homepage with some nicer formatting.
Since I am the only one to have ever used this program, it's probably lacking quite a bit in terms of usability. Comments are welcome!
I've also taken some pains to make it user-friendly, although I concede the fact that I'm only a hobby level programmer, and my professional background is mathematical, not software development.
The git repository is here: https://github.com/TheBB/simul
Now, I'm not a git know-it-all either, but I'm pretty sure everyone can just clone this repository and get going. If you don't know how to do that, there is a downloads page.
Everything is written with Python 3 in mind. On Linux it should run pretty much straight out of the box.
It has not been tested on Windows or Mac OS. I'm confident that everything should work there, except possibly the readline stuff. Feedback is welcome.
The main executable script is simul.py.
For instructions for use, please refer to the readme file bundled with the code. It is also displayed on the repository homepage with some nicer formatting.
Since I am the only one to have ever used this program, it's probably lacking quite a bit in terms of usability. Comments are welcome!
Read this if you have suggestions or if you want to contribute.
+ Show Spoiler +
Suggestions?
+ Show Spoiler +
Contributions?
+ Show Spoiler +
+ Show Spoiler +
Oh god, yes, please! I would love nothing more. You can post them in this thread, or send me a PM. Almost anything goes.
Contributions?
+ Show Spoiler +
Clone the repository and code away. Once you have something cool, let me know and I might merge it in. If you're not sure about your idea beforehand, feel free to talk it over with me first! (Post here or PM. We can exchange e-mail addresses.)
Beware though: my code contains essentially zero comments. This can be headache-inducing for others, I'm sure, and it's a habit I've gotten used to since most of my work involves writing code only for myself. I will get around to commenting eventually. If enough people are interested, I will do it sooner rather than later.
Beware though: my code contains essentially zero comments. This can be headache-inducing for others, I'm sure, and it's a habit I've gotten used to since most of my work involves writing code only for myself. I will get around to commenting eventually. If enough people are interested, I will do it sooner rather than later.
Other FAQs
+ Show Spoiler +
Your predictions were wrong!
+ Show Spoiler +
Your numbers don't make any sense!
+ Show Spoiler +
But Mvp doesn't care about ratings!
+ Show Spoiler +
How accurate are your predictions?
+ Show Spoiler +
Could you not update Elo ratings during a tournament
+ Show Spoiler +
What about map imbalance?
+ Show Spoiler +
Did you just take Elo seriously lol
+ Show Spoiler +
You keep spelling ELO wrong.
+ Show Spoiler +
+ Show Spoiler +
Well, even setting aside the fact that we are trying to predict random phenomena, so there are bound to be some errors, there are hundreds of reasons why this could happen, but one of the most frequent ones is that you thought it predicted something that it actually didn't. As with any statistical tool, this will be most useful if you know how to interpret its outputs.
One of the most frequently occuring mistakes is related to the MSL group. Suppose you have four players ranked such that A and B are roughly equal, and C and D are roughly equal, but A and B are somewhat both more skilled than C and D. A and B player each other the first match, and so do C and D. You might get something like this:
A: 55%
B: 55%
C: 45%
D: 45%
Now, which pair of players is most likely to advance? If you answered A and B... you might be wrong! Players that are paired up initially advance much less often than players that aren't.
Another user error is to exaggerate percentages. Suppose you have a match with two players, where player A is favoured with a certain percentage. I suggest the following interpretations:
One of the most frequently occuring mistakes is related to the MSL group. Suppose you have four players ranked such that A and B are roughly equal, and C and D are roughly equal, but A and B are somewhat both more skilled than C and D. A and B player each other the first match, and so do C and D. You might get something like this:
A: 55%
B: 55%
C: 45%
D: 45%
Now, which pair of players is most likely to advance? If you answered A and B... you might be wrong! Players that are paired up initially advance much less often than players that aren't.
Another user error is to exaggerate percentages. Suppose you have a match with two players, where player A is favoured with a certain percentage. I suggest the following interpretations:
- 50-60%: Even
- 60-70%: Slight advantage to advantage
- 70-80%: Advantage to heavy advantage
- 80-90%: Heavy to overwhelming advantage
- 90-100%: Overwhelming advantage
Your numbers don't make any sense!
+ Show Spoiler +
Setting aside the possibility for a bug in the code, if you accept that my Elo model is a perfect predictor, then the numbers are correct, like it or not. Yes, it could mean that someone with a better score in a round robin group has a lower probability of winning than someone with a worse score, if the player with the worse score has a sufficiently high rating to compensate for it. If you, mr. Average Joe himself, are up 2-0 against Mvp in a best of 5, would you consider yourself favoured to win? Probably not.
If you think the numbers are off, you must accept that something is off with the Elo model. That, of course, is perfectly possible, and in some cases even to be expected. (See the discussion about Elo ratings.)
If you think the numbers are off, you must accept that something is off with the Elo model. That, of course, is perfectly possible, and in some cases even to be expected. (See the discussion about Elo ratings.)
But Mvp doesn't care about ratings!
+ Show Spoiler +
This is an annoying one, because it is so wrong and so right.
It is wrong because if someone overperforms, it only means that their rating was lower than it ought to be, and it is subsequently adjusted upwards. Nothing very exceptional about this, it happens literally all the time, and it is not related to anyone's ability to simply switch off their opponent's skill at will. Ratings are meant to measure and predict, not determine.
It is also right, because match context does matter. Most players actually do perform better (or worse) depending on a range of factors, and the effects vary:
It is wrong because if someone overperforms, it only means that their rating was lower than it ought to be, and it is subsequently adjusted upwards. Nothing very exceptional about this, it happens literally all the time, and it is not related to anyone's ability to simply switch off their opponent's skill at will. Ratings are meant to measure and predict, not determine.
It is also right, because match context does matter. Most players actually do perform better (or worse) depending on a range of factors, and the effects vary:
- is the match streamed live?
- is the match played in front of an audience?
- is it a very high-profile match (like a GSL final)?
- is jetlag involved?
- can you BM MarineKing?
How accurate are your predictions?
+ Show Spoiler +
To be honest, I haven't looked closely at the accuracy. I started doing these from the semifinals in GSL season 4, where I predicted a Rain vs Taeja finals (we know how that went). I did correctly predict that Life would beat Mvp. After that, I did quite well in code A.
The round of 32 in this season's GSL has been somewhat surprising (not just to me, but many others). Apart from Bogus and Soulkey (I mentioned how Kespa players have inaccurate ratings at the moment), the least likely player to advance, who did advance, was Ryung at 41%.
Hero, Squirtle and Taeja were massive favourites to advance. Famously, they did not.
Interestingly enough, Hero and Squirtle, next to Life the most likely to advance, both fell to Soulkey and Bogus, who are the most surprising advancees.
All in all I think I am doing reasonably well, all things considering. I am probably better than a monkey (50-50 all the time) and not quite as good as a skilled liquibetter.
The round of 32 in this season's GSL has been somewhat surprising (not just to me, but many others). Apart from Bogus and Soulkey (I mentioned how Kespa players have inaccurate ratings at the moment), the least likely player to advance, who did advance, was Ryung at 41%.
Hero, Squirtle and Taeja were massive favourites to advance. Famously, they did not.
Interestingly enough, Hero and Squirtle, next to Life the most likely to advance, both fell to Soulkey and Bogus, who are the most surprising advancees.
All in all I think I am doing reasonably well, all things considering. I am probably better than a monkey (50-50 all the time) and not quite as good as a skilled liquibetter.
Could you not update Elo ratings during a tournament
+ Show Spoiler +
Well, I could. The only problem is that it violates the whole periodization argument. I know it's violated anyway by TLPD, but every little bit helps. Then again, an argument could be made that updating ratings throughout the tournament helps predictions when severly under- or overrated players are playing (such as Kespa members at the moment).
What about map imbalance?
+ Show Spoiler +
Well, it's a factor that is worth tracking (and which can be objectively measured), but I'm not sure how to do it, or how useful it would be. It's also not at all clear how it affects formats where the loser can pick the next map (so that a 4-0 sweep, for example, is somewhat less likely than if all maps were perfectly balanced).
It's not something that I plan to look at, but never say never.
It's not something that I plan to look at, but never say never.
Did you just take Elo seriously lol
+ Show Spoiler +
Barring everything I said about taking things with a pinch of salt, yes, I did.
lol....
lol....
You keep spelling ELO wrong.
+ Show Spoiler +
No, I fucking don't. It's not an acronym, so stop pretending like it is.