On the most recent State of the Game podcast, there was discussion of
MLG's extended series rule in their double elimination
tournament. This post explores the effects of the extended series rule
on tournament outcomes, using a simplified model of players and
tournaments. Several tournament formats are explored: round robin,
single elimination, double elimination, and double elimination with
extended series. Performance is measured by averaging over many
simulations, using several distance metrics from the 'ideal ranking'
of players. Results show a small but measurable improvement in
performance when using the extended series rule; with 64 players in a
best-of-three format, the 'best player' wins 1% more often (25%
compared to 24%) using the extended series rule than with simple
double elimination. However, the improvement from the extended series
rule is marginal compared to the overall tournament format; in
single-elimination, the best player wins 19% of the time, and in
round-robin the best player wins 47% of the time.
1. INTRODUCTION
Skip this section if you are familiar with the debate about MLG's extended series rule.
MLG is the largest Starcraft II tournament in North America, and
consequently its tournament format has a large impact on the
competitive scene. MLG employs a fairly standard double-elimination
tournament format, with each round determined by a best-of-three
series. However, MLG has an additional wrinkle called 'extended
series', which many people find counter-intuitive. To explain these
complexities, let's start with an overview of different tournament
formats.
A single elimination tournament is the simplest format that most
people are familiar with. Play proceeds in rounds, with all players
starting in the same round. Players are then paired in each round and
play a series. The winner proceeds to the next round, and the loser is
eliminated from the tournament. This format has the advantage of
determining a champion in very few games (O(log(# of players))), but
the disadvantage that bad luck can knock out good players at an early
stage.
To help with this problem, double elimination tournaments ensure that
any player must lose twice in order to knocked out of the
tournament. This is done by having two brackets: 'winners' and
'losers'. All players begin in the winners' bracket, and after losing
once are sent to the losers'. Players in the loser's bracket play each
other, as well as all players who join the losers' bracket from the
winner's bracket. Therefore, players in the losers' bracket play
twice as many series as those in the winners'.
MLG has extended the double elimination format with an 'extended
series' rule that is invoked when players meet twice in a single
tournament. If players meet in the winners' bracket, and later again
in the losers' bracket, then instead of playing a new best-of-three
series, their series from the winners' bracket is resumed as a
best-of-seven series. Example: If Alice beats Bob 2-1 in the winners'
bracket, and they meet again in the losers' bracket, then they will
play a best-of-seven series to determine the winner with a starting
score 2-1 in favor of Alice. Alice has to win two games to proceed,
and Bob has to win three.
This rule is intended to avoid some paradoxical outcomes, as well as
statistically increase the likelihood that the 'better player'
continues in the tournament. It is possible in standard double
elimination for Alice to defeat Bob 2-0 in the winners', and Bob to
defeat Alice in the losers' 2-1. The "overall series" between Alice
and Bob is 3-2 in Alice's favor, but Bob continues and Alice does
not.
Similarly, another argument is that double elimination exists in order
to give better players a 'second chance' to continue in the tournament
when defeated by inferior players, but this logic does not apply when
the same players meet again. In this case, it makes more sense (so the
argument goes) to extend the series to determine the 'better player'.
Despite these arguments, the extended series has generated
controversy because in many instances the tournament setting is very
different when the series resumes, and many people find it
unentertaining and counter-intuitive.
In particular, the extended series between Liquid`Tyler and PainUser
at MLG Dallas demonstrates some of the problems. In their series in
the winners' bracket, Liquid`Tyler fell victim to a mistake of the
tournament organizers, and was forced to restart a game that he had a
clear advantage. Liquid`Tyler subsequently lost the series 2-0, which
some have argued was due to the psychological effect of the game
restart. When they later met in the losers' bracket, Liquid`Tyler was
at a significant disadvantage, and lost the extended series 2-4, but
would have won a best-of-three.
This post is organized into several sections. Section 2 describes how
these results were gathered, and the various models used. Section 3
describes the experimental setup. Section 4 presents the
results. Section 5 concludes, and Section 6 shows where to follow up
on this if you are interested.
1.1 SCOPE
This post is an in-depth analysis of the statistical performance of
different tournament formats. It is not concerned with many other
important questions, for example:
* What is the purpose of tournaments, beyond determining skill of
players?
* Is the extended series rule entertaining?
* Is the extended series rule morally justified?
* Players aren't strictly 'better' or 'worse' than each other -- or,
at least, this relationship isn't transitive between players.
* The tournament setting can change when an extended series resumes.
These questions will and have been addressed elsewhere.
2. DESCRIPTION
This post explores the accuracy of several tournament formats,
focusing on the impact of the extended series rule. This is done using
simulation, running through many thousands of tournaments and
comparing the average results. This section describes the player
model, tournament model, and accuracy metrics used in the results.
2.1 PLAYER MODEL
Players are modeled using a simple randomized model. The goal is to
have players of greater or lesser skill, but have each player vary
somewhat in their performance. Players therefore consist of two
numbers: mean performance and deviation. Performance for a single
player is randomly generated each game, and lies in the range
[mean - dev, mean + dev].
The mean performance lies between 0 and 2, and the deviation is always
1. This ensures that the worst player can always beat the best player,
however at the extremes this is unlikely.
A players performance is calculated as follows:
performance = mean + dev * rand^2 * plusminus
Where rand is a uniformly-distributed number in [0,1] and plusminus is
seleted from {-1,1} with even probability. This formula makes the mass
of the probability distributed concentrated around the mean, making
the better players win more often.
To generate a set of players for a tournament, each player's mean is
selected uniformly from [0,2]. This is probably inaccurate -- player's
mean performance is likely distributed on a normal curve. The player
model is probably the biggest weakness in this study, however I still
believe the first-order effects are well captured in the analysis.
2.1 TOURNAMENT MODEL
The rules for each tournament are faithfully replicated in the
simulation, however there are some modelling choices here as well. The
most significant is the seeding of players in each tournament. I have
chosen to use the "ideal seeding", as determined by players' mean
performance, as the initial seeding for players. This removes a source
of inaccuracy from elimination tournaments, and so the results should
be taken as an upper bound for their performance.
Four tournament types are considered: single elimination, double
elimination, double elimination with extended series, and round
robin. The focus of this post is on the effect of extended series, but
single elimination and round robin are included in order to give some
context for these results.
A round robin tournament is one where every player plays every
other. Players are then ranked according to their number of wins. This
tournament produces a complete ranking, first through last, and
because everyone plays everyone, it is very accurate. The down side is
that it requires a lot of games (O(# players)) and is less exciting
than other tournament formats. However, because it is so accurate, it
can be used to calibrate the accuracy of elimination tournaments by
showing a "speed of light" for tournament efficacy.
Similarly, single elimination tournaments show the other end of the
spectrum. They are very fickle in their results, and show relatively
how much of an improvement the extended series rule makes over
standard double elimination.
2.2 MEASURING ACCURACY
One of the principle challenges is determining how to measure
performance of a tournament -- how can we say that one tournament is
"better" than another? The approach taken is to have each tournament
produce a ranking of players, first through last, and compare this
ranking to the ideal ranking, as determined by players' mean
performance.
This produces its own challenges, as elimination tournaments do not
strictly produce a ranking. However, taking seeding into account, an
elimination tournament does sort players into categories based on how
far they made it through the tournament. The ranking of players is
determined as players are eliminated from the tournament -- first
eliminated places last, and so on.
Three metrics are used to measure performance: winner, depth, and
2^depth.
* The 'winner' metric determines performance based on a very simple,
intuitive rule: Did the best player win? This metric is simple,
but unfortunately not very useful, because for even
moderately-sized tournaments, the best player rarely wins.
* The 'depth' metric determines performance based on how deep each
player made it in the tournament. Specifically, the player ranking
is divided into groups according to a single-elimination bracket
(first, second, top four, top eight, top sixteen, etc..). Then
each player's expected placement is calculated based on which
group they fall into within the ideal ranking -- the
fifteenth-best player should place into the top sixteen. These
results are compared against the actual placement from simulation,
and the difference from depth for all players is added to produce
the final "distance from ideal".
* The '2^depth' metric is similar to the depth metric, however
before adding up all of the depth-differences, we first calculate
2^(delta)-1. This is done because, intuitively, it is more
significant if the first player is eliminated in the round of 64
than if the 33rd, 34th, 35th, and 36th players make it to the
round of 32, but the 'depth' metric calculates these as being
equally bad. Essentially, this metric exaggerates says that big
differences in depth are more important than many small
differences.
3. METHODOLOGY
Results are gathered by running simulations of one million tournaments
and averaging the results for each tournament. It is generally found
that the trends in each metric are reflected in the others, except for
the 'winner' metric, which is very sensitive to random factors and
sometimes fluctuates independently.
4. RESULTS
4.1 OVERVIEW
Because this discussion was inspired by MLG Dallas, the first result
to consider the overall performance of each tournament format in a
128-player, best-of-three tournament:
Format | Winner | Depth | 2^Depth
---------------+--------+-------+--------
Single | 0.91 | 52.09 | 110.07
Double | 0.88 | 48.31 | 89.83
DoubleExtended | 0.88 | 46.01 | 87.42
RoundRobin | 0.72 | 22.29 | 28.85
Note that these are distance metrics, so lower is always better. For
the 'winner' metric, this number indicates the fraction of the time
that the best player did not win. So, 1 - 'winner' is the chance of
the best player winning the entire tournament.
A slight improvement can be seen from using the extended series in the
depth metrics, however it is marginal compared to the large difference
between single elimination and round robin tournaments. These results
also indicate that double elimination does perform significantly
better than single elimination, however neither come close the
performance a round robin tournament.
4.2 VARYING NUMBER OF GAMES
We can also explore the effect on tournament outcomes when the number
of games in each series is varied. (In this case, the extended series
is also varied.) These results are graphed below.
These results all show pretty much what one would expect -- using more
games in each series improves the accuracy of the tournament
format. However, this also visually show that the elimination
tournaments all perform similarly, and none approach the accuracy of a
round robin tournament. The ordering of performance is very
consistent, however: round robin is best, followed by double
elimination with extended series, double elimination, and single
elimination.
The depth metric doesn't show much separation between the different
elimination tournament formats, but the winner and 2^depth metrics
both show significant separation between single and double elimination
formats. This indicates that the single elimination format produces
more big differences in outcome than the double elimination tournament.
That is, more often the best player does not win, and more often good
players don't make it as far as they should. In this respect, the
extended series seems to make very little difference.
4.3 VARYING NUMBER OF PLAYERS
In this section, we compare the effect on accuracy when changing the
number of players in the tournament. I have to break methodology here
a bit, because I don't have the time to wait for a million simulations
of a 512-player round robin tournament to finish. So instead, I
simulated fifty thousand simulations. Consequently, there is a little
more noise in these results.
These graphs don't show anything particularly revealing compared with
the last section, but they do confirm that the trends hold over a
variety of tournament sizes. Single elimination does worse than double
elimination formats, and round robin is much better than the
elimination formats. This is particularly true with large numbers of
players -- but in this range, it is an unfair comparison, because
round robin plays many more games. Most relevant to this post,
extended series seems to have minimal effect on results for large
numbers of players, particularly when considering 2^depth.
4.4 EFFECT OF EXTENDED SERIES
We now consider the effect of the extended series in
isolation. Specifically, how often is the extended series used, and
how often does is "correct an injustice" from the winners' bracket?
In this case, we consider a 64-player tournament in double elimination
with extended series format. In a standard double-elimination format,
127 matches will be played.
Simulation shows that, on average, 18.8 extended series will be played
in a 64-player tournament. This means that 15% of matches, on average,
will be rematches of players.
Similarly, of these 18.8 matches, 3.03 of them will result in
"corrections". A correction is when the better player loses in the
winners' bracket and wins the extended series to continue in the
tournament. In 2.17 of the matches, the worse player won in the
winners' bracket and won the extended series, meaning the extended
series failed to "correct" the result from the winners'
bracket.
The worst possible outcome is when the better play wins in the
winners' bracket and loses the extended series. The extended series
does well here, only introducing 0.55 such results per tournament, or
4% of the extended series.
Considering the disadvantage that the better player has when entering
the extended series, it does surprisingly well at correcting these
results, succeeding 58% of the time. At the same time, it only
introduces bad results 4% of the time.
I am tempted to conlude that extended series is successful at letting
the better player continue in the tournament, however data is missing
to compare against a standard double elimination tournament. A good
area of extension for this study would be measuring the outcome if a
regular best-of-three were done, and comparing its
correction/injustice rate to the extended series. The ratio from the
extended series (58%/4%) seems pretty hard to beat -- I would expect a
best-of-three to allow the better play to proceed more often, but have
a much higher injustice rate.
5. CONCLUSION
Whe considering individual matches, the extended series appears to
perform well to make sure the better player continues in the
tournament. In this sense, it fulfills its purpose.
But when looking at the larger picture, it appears that the extended
series has little effect on the outcome. While the extended series
rule does slightly improve outcomes, these differences are not
particularly significant compared to the overall double elimination
format.
What is clear from these results is that both elimination formats
leave much to be desired when compared to a round robin
tournament. Although round-robin is impractical due its large number
of games, other tournament formats such as swiss-style or those with
rounds play deserve further consideration.
Another future area of work is considering the performance of a
points-based system of several double elimination tournaments, like
MLG employs for its full Starcraft II season.
6. SEE ALSO
Wikipedia on tournament formats:
http://en.wikipedia.org/wiki/Single-elimination_tournament
http://en.wikipedia.org/wiki/Swiss_style_tournament
6.1 SOURCE CODE
The source code is available via git at:
git://github.com/nathanbeckmann/Tournament.git
It is written in Go. Have fun!
EDIT 1: Corrected problem with injustice rate. It is 4%, not 3%.
EDIT 2: Fix example in intro (corrected by Cyber_Cheese).