Statistical Analysis of Extended Series

nzb

United States41 Posts

November 12 2010 00:45 GMT

ABSTRACT

On the most recent State of the Game podcast, there was discussion of
MLG's extended series rule in their double elimination
tournament. This post explores the effects of the extended series rule
on tournament outcomes, using a simplified model of players and
tournaments. Several tournament formats are explored: round robin,
single elimination, double elimination, and double elimination with
extended series. Performance is measured by averaging over many
simulations, using several distance metrics from the 'ideal ranking'
of players. Results show a small but measurable improvement in
performance when using the extended series rule; with 64 players in a
best-of-three format, the 'best player' wins 1% more often (25%
compared to 24%) using the extended series rule than with simple
double elimination. However, the improvement from the extended series
rule is marginal compared to the overall tournament format; in
single-elimination, the best player wins 19% of the time, and in
round-robin the best player wins 47% of the time.

1. INTRODUCTION

Skip this section if you are familiar with the debate about MLG's extended series rule.

MLG is the largest Starcraft II tournament in North America, and
consequently its tournament format has a large impact on the
competitive scene. MLG employs a fairly standard double-elimination
tournament format, with each round determined by a best-of-three
series. However, MLG has an additional wrinkle called 'extended
series', which many people find counter-intuitive. To explain these
complexities, let's start with an overview of different tournament
formats.

A single elimination tournament is the simplest format that most
people are familiar with. Play proceeds in rounds, with all players
starting in the same round. Players are then paired in each round and
play a series. The winner proceeds to the next round, and the loser is
eliminated from the tournament. This format has the advantage of
determining a champion in very few games (O(log(# of players))), but
the disadvantage that bad luck can knock out good players at an early
stage.

To help with this problem, double elimination tournaments ensure that
any player must lose twice in order to knocked out of the
tournament. This is done by having two brackets: 'winners' and
'losers'. All players begin in the winners' bracket, and after losing
once are sent to the losers'. Players in the loser's bracket play each
other, as well as all players who join the losers' bracket from the
winner's bracket. Therefore, players in the losers' bracket play
twice as many series as those in the winners'.

MLG has extended the double elimination format with an 'extended
series' rule that is invoked when players meet twice in a single
tournament. If players meet in the winners' bracket, and later again
in the losers' bracket, then instead of playing a new best-of-three
series, their series from the winners' bracket is resumed as a
best-of-seven series. Example: If Alice beats Bob 2-1 in the winners'
bracket, and they meet again in the losers' bracket, then they will
play a best-of-seven series to determine the winner with a starting
score 2-1 in favor of Alice. Alice has to win two games to proceed,
and Bob has to win three.

This rule is intended to avoid some paradoxical outcomes, as well as
statistically increase the likelihood that the 'better player'
continues in the tournament. It is possible in standard double
elimination for Alice to defeat Bob 2-0 in the winners', and Bob to
defeat Alice in the losers' 2-1. The "overall series" between Alice
and Bob is 3-2 in Alice's favor, but Bob continues and Alice does
not.

Similarly, another argument is that double elimination exists in order
to give better players a 'second chance' to continue in the tournament
when defeated by inferior players, but this logic does not apply when
the same players meet again. In this case, it makes more sense (so the
argument goes) to extend the series to determine the 'better player'.

Despite these arguments, the extended series has generated
controversy because in many instances the tournament setting is very
different when the series resumes, and many people find it
unentertaining and counter-intuitive.

In particular, the extended series between Liquid`Tyler and PainUser
at MLG Dallas demonstrates some of the problems. In their series in
the winners' bracket, Liquid`Tyler fell victim to a mistake of the
tournament organizers, and was forced to restart a game that he had a
clear advantage. Liquid`Tyler subsequently lost the series 2-0, which
some have argued was due to the psychological effect of the game
restart. When they later met in the losers' bracket, Liquid`Tyler was
at a significant disadvantage, and lost the extended series 2-4, but
would have won a best-of-three.

This post is organized into several sections. Section 2 describes how
these results were gathered, and the various models used. Section 3
describes the experimental setup. Section 4 presents the
results. Section 5 concludes, and Section 6 shows where to follow up
on this if you are interested.

1.1 SCOPE

This post is an in-depth analysis of the statistical performance of
different tournament formats. It is not concerned with many other
important questions, for example:

* What is the purpose of tournaments, beyond determining skill of
players?

* Is the extended series rule entertaining?

* Is the extended series rule morally justified?

* Players aren't strictly 'better' or 'worse' than each other -- or,
at least, this relationship isn't transitive between players.

* The tournament setting can change when an extended series resumes.

These questions will and have been addressed elsewhere.

2. DESCRIPTION

This post explores the accuracy of several tournament formats,
focusing on the impact of the extended series rule. This is done using
simulation, running through many thousands of tournaments and
comparing the average results. This section describes the player
model, tournament model, and accuracy metrics used in the results.

2.1 PLAYER MODEL

Players are modeled using a simple randomized model. The goal is to
have players of greater or lesser skill, but have each player vary
somewhat in their performance. Players therefore consist of two
numbers: mean performance and deviation. Performance for a single
player is randomly generated each game, and lies in the range
[mean - dev, mean + dev].

The mean performance lies between 0 and 2, and the deviation is always
1. This ensures that the worst player can always beat the best player,
however at the extremes this is unlikely.

A players performance is calculated as follows:

performance = mean + dev * rand^2 * plusminus

Where rand is a uniformly-distributed number in [0,1] and plusminus is
seleted from {-1,1} with even probability. This formula makes the mass
of the probability distributed concentrated around the mean, making
the better players win more often.

To generate a set of players for a tournament, each player's mean is
selected uniformly from [0,2]. This is probably inaccurate -- player's
mean performance is likely distributed on a normal curve. The player
model is probably the biggest weakness in this study, however I still
believe the first-order effects are well captured in the analysis.

2.1 TOURNAMENT MODEL

The rules for each tournament are faithfully replicated in the
simulation, however there are some modelling choices here as well. The
most significant is the seeding of players in each tournament. I have
chosen to use the "ideal seeding", as determined by players' mean
performance, as the initial seeding for players. This removes a source
of inaccuracy from elimination tournaments, and so the results should
be taken as an upper bound for their performance.

Four tournament types are considered: single elimination, double
elimination, double elimination with extended series, and round
robin. The focus of this post is on the effect of extended series, but
single elimination and round robin are included in order to give some
context for these results.

A round robin tournament is one where every player plays every
other. Players are then ranked according to their number of wins. This
tournament produces a complete ranking, first through last, and
because everyone plays everyone, it is very accurate. The down side is
that it requires a lot of games (O(# players)) and is less exciting
than other tournament formats. However, because it is so accurate, it
can be used to calibrate the accuracy of elimination tournaments by
showing a "speed of light" for tournament efficacy.

Similarly, single elimination tournaments show the other end of the
spectrum. They are very fickle in their results, and show relatively
how much of an improvement the extended series rule makes over
standard double elimination.

2.2 MEASURING ACCURACY

One of the principle challenges is determining how to measure
performance of a tournament -- how can we say that one tournament is
"better" than another? The approach taken is to have each tournament
produce a ranking of players, first through last, and compare this
ranking to the ideal ranking, as determined by players' mean
performance.

This produces its own challenges, as elimination tournaments do not
strictly produce a ranking. However, taking seeding into account, an
elimination tournament does sort players into categories based on how
far they made it through the tournament. The ranking of players is
determined as players are eliminated from the tournament -- first
eliminated places last, and so on.

Three metrics are used to measure performance: winner, depth, and
2^depth.

* The 'winner' metric determines performance based on a very simple,
intuitive rule: Did the best player win? This metric is simple,
but unfortunately not very useful, because for even
moderately-sized tournaments, the best player rarely wins.

* The 'depth' metric determines performance based on how deep each
player made it in the tournament. Specifically, the player ranking
is divided into groups according to a single-elimination bracket
(first, second, top four, top eight, top sixteen, etc..). Then
each player's expected placement is calculated based on which
group they fall into within the ideal ranking -- the
fifteenth-best player should place into the top sixteen. These
results are compared against the actual placement from simulation,
and the difference from depth for all players is added to produce
the final "distance from ideal".

* The '2^depth' metric is similar to the depth metric, however
before adding up all of the depth-differences, we first calculate
2^(delta)-1. This is done because, intuitively, it is more
significant if the first player is eliminated in the round of 64
than if the 33rd, 34th, 35th, and 36th players make it to the
round of 32, but the 'depth' metric calculates these as being
equally bad. Essentially, this metric exaggerates says that big
differences in depth are more important than many small
differences.

3. METHODOLOGY

Results are gathered by running simulations of one million tournaments
and averaging the results for each tournament. It is generally found
that the trends in each metric are reflected in the others, except for
the 'winner' metric, which is very sensitive to random factors and
sometimes fluctuates independently.

4. RESULTS

4.1 OVERVIEW

Because this discussion was inspired by MLG Dallas, the first result
to consider the overall performance of each tournament format in a
128-player, best-of-three tournament:

Format | Winner | Depth | 2^Depth
---------------+--------+-------+--------
Single | 0.91 | 52.09 | 110.07
Double | 0.88 | 48.31 | 89.83
DoubleExtended | 0.88 | 46.01 | 87.42
RoundRobin | 0.72 | 22.29 | 28.85

Note that these are distance metrics, so lower is always better. For
the 'winner' metric, this number indicates the fraction of the time
that the best player did not win. So, 1 - 'winner' is the chance of
the best player winning the entire tournament.

A slight improvement can be seen from using the extended series in the
depth metrics, however it is marginal compared to the large difference
between single elimination and round robin tournaments. These results
also indicate that double elimination does perform significantly
better than single elimination, however neither come close the
performance a round robin tournament.

4.2 VARYING NUMBER OF GAMES

We can also explore the effect on tournament outcomes when the number
of games in each series is varied. (In this case, the extended series
is also varied.) These results are graphed below.

[image loading]

These results all show pretty much what one would expect -- using more
games in each series improves the accuracy of the tournament
format. However, this also visually show that the elimination
tournaments all perform similarly, and none approach the accuracy of a
round robin tournament. The ordering of performance is very
consistent, however: round robin is best, followed by double
elimination with extended series, double elimination, and single
elimination.

The depth metric doesn't show much separation between the different
elimination tournament formats, but the winner and 2^depth metrics
both show significant separation between single and double elimination
formats. This indicates that the single elimination format produces
more big differences in outcome than the double elimination tournament.
That is, more often the best player does not win, and more often good
players don't make it as far as they should. In this respect, the
extended series seems to make very little difference.

4.3 VARYING NUMBER OF PLAYERS

In this section, we compare the effect on accuracy when changing the
number of players in the tournament. I have to break methodology here
a bit, because I don't have the time to wait for a million simulations
of a 512-player round robin tournament to finish. So instead, I
simulated fifty thousand simulations. Consequently, there is a little
more noise in these results.

[image loading]

These graphs don't show anything particularly revealing compared with
the last section, but they do confirm that the trends hold over a
variety of tournament sizes. Single elimination does worse than double
elimination formats, and round robin is much better than the
elimination formats. This is particularly true with large numbers of
players -- but in this range, it is an unfair comparison, because
round robin plays many more games. Most relevant to this post,
extended series seems to have minimal effect on results for large
numbers of players, particularly when considering 2^depth.

4.4 EFFECT OF EXTENDED SERIES

We now consider the effect of the extended series in
isolation. Specifically, how often is the extended series used, and
how often does is "correct an injustice" from the winners' bracket?

In this case, we consider a 64-player tournament in double elimination
with extended series format. In a standard double-elimination format,
127 matches will be played.

Simulation shows that, on average, 18.8 extended series will be played
in a 64-player tournament. This means that 15% of matches, on average,
will be rematches of players.

Similarly, of these 18.8 matches, 3.03 of them will result in
"corrections". A correction is when the better player loses in the
winners' bracket and wins the extended series to continue in the
tournament. In 2.17 of the matches, the worse player won in the
winners' bracket and won the extended series, meaning the extended
series failed to "correct" the result from the winners'
bracket.

The worst possible outcome is when the better play wins in the
winners' bracket and loses the extended series. The extended series
does well here, only introducing 0.55 such results per tournament, or
4% of the extended series.

Considering the disadvantage that the better player has when entering
the extended series, it does surprisingly well at correcting these
results, succeeding 58% of the time. At the same time, it only
introduces bad results 4% of the time.

I am tempted to conlude that extended series is successful at letting
the better player continue in the tournament, however data is missing
to compare against a standard double elimination tournament. A good
area of extension for this study would be measuring the outcome if a
regular best-of-three were done, and comparing its
correction/injustice rate to the extended series. The ratio from the
extended series (58%/4%) seems pretty hard to beat -- I would expect a
best-of-three to allow the better play to proceed more often, but have
a much higher injustice rate.

5. CONCLUSION

Whe considering individual matches, the extended series appears to
perform well to make sure the better player continues in the
tournament. In this sense, it fulfills its purpose.

But when looking at the larger picture, it appears that the extended
series has little effect on the outcome. While the extended series
rule does slightly improve outcomes, these differences are not
particularly significant compared to the overall double elimination
format.

What is clear from these results is that both elimination formats
leave much to be desired when compared to a round robin
tournament. Although round-robin is impractical due its large number
of games, other tournament formats such as swiss-style or those with
rounds play deserve further consideration.

Another future area of work is considering the performance of a
points-based system of several double elimination tournaments, like
MLG employs for its full Starcraft II season.

6. SEE ALSO

Wikipedia on tournament formats:
http://en.wikipedia.org/wiki/Single-elimination_tournament
http://en.wikipedia.org/wiki/Swiss_style_tournament

6.1 SOURCE CODE

The source code is available via git at:

git://github.com/nathanbeckmann/Tournament.git

It is written in Go. Have fun!

EDIT 1: Corrected problem with injustice rate. It is 4%, not 3%.

EDIT 2: Fix example in intro (corrected by Cyber_Cheese).

Durn

Canada360 Posts

November 12 2010 01:00 GMT

I think IdrA summed it up quite well in the State of the Game. Statistics aside, it goes like this hypothetical they used:

IdrA makes a stupid mistake and gets knocked out by NoNy in an early round. 3 rounds later, NoNy makes a silly mistake that idrA wouldn't have made. They meet in the losers bracket, they've both made silly mistakes that the other one wouldn't have made. Why should IdrA be penalized?

nzb

United States41 Posts

November 12 2010 01:03 GMT

On November 12 2010 10:00 Durn wrote:
I think IdrA summed it up quite well in the State of the Game. Statistics aside, it goes like this hypothetical they used:

IdrA makes a stupid mistake and gets knocked out by NoNy in an early round. 3 rounds later, NoNy makes a silly mistake that idrA wouldn't have made. They meet in the losers bracket, they've both made silly mistakes that the other one wouldn't have made. Why should IdrA be penalized?

I agree. Even more interesting, lets say (hypothetically) that..

IdrA > Tyler
Tyler > SeleCT
SeleCT > IdrA

There is no "best player" in this group, and now their seeding basically determines who faces who first, and therefore which of them has an advantage in the extended series.

I'd call this one of those things that falls outside the scope of my post.

randplaty

205 Posts

November 12 2010 01:03 GMT

awesome awesome study. Thanks for the hardwork. Good to know that extended series does have some value... although minimal.

Shakes

Australia557 Posts

November 12 2010 01:07 GMT

IdrA's argument is one that has been explicitly excluded from the scope of this analysis (that the "better" player might not be transitive).

Durn

Canada360 Posts

November 12 2010 01:08 GMT

I just took a closer look at all your work, and that's actually really awesome. The statistics do make sense when put out in such an organized manor.

I appreciate your hard work, I hope this will get some eyes from MLG haters. I still disagree with it at the core of its concept, but in terms of your statistics, the math points in the right direction.

vohne

Philippines197 Posts

November 12 2010 01:13 GMT

In a higher level arena the better player isn't always transitive. That is because there are too many variables that must be taken into consideration such as race matchups, maps, player conditioning and etc.

Dragar

United Kingdom971 Posts

November 12 2010 01:19 GMT

Is it possible to rephrase the question to not assume that the better player is transitive? So that the goal is not to determine the 'best' player, but rather to minimise the effect of matchup ordering, etc?

Nayl

Canada413 Posts

November 12 2010 01:22 GMT

IdrA's arguement is irrelevant to the actual statistics or logic in the argument.

Extended series exist to make contest between 2 player fairer, how these guys play 3rd player has no effect.

Also, in his argument, how does he know he wouldn't have made stupid mistake if he were to advance over nony?

paralleluniverse

4065 Posts

November 12 2010 01:23 GMT

#10

On November 12 2010 10:07 Shakes wrote:

Show nested quote +

IdrA's argument is one that has been explicitly excluded from the scope of this analysis (that the "better" player might not be transitive).

Not really.

The nontransitivity is taken in account since performance was measured using a mean +/- and random number. And that allows for the possibility that player A will beat player B, player B beats player C, and player C beats player A.

nzb

United States41 Posts

November 12 2010 01:26 GMT

#11

On November 12 2010 10:19 Dragar wrote:
Is it possible to rephrase the question to not assume that the better player is transitive? So that the goal is not to determine the 'best' player, but rather to minimise the effect of matchup ordering, etc?

This is definitely possible, you would need some kind of relation for each player to every other. The problem with this is you would end up with a lot of choices in terms of modeling -- because the relationship, while not perfectly transitive, is pretty close. (That is, although the cream of the crop might be extremely intransitive, they are definitely better than most of the other players). Therefore the relation you come up with shouldn't be completely random. This kind of data would probably have to be pulled from actual player statistics, which would actually be a huge improvement to the study overall.

But until that happens, I think keeping it simple is better because you avoid a lot of complexities that don't necessarily improve the results.

rasnj

United States1959 Posts

November 12 2010 01:26 GMT

#12

What exactly would be the goal then? I thought about doing this kind of analysis myself, but decided that I couldn't formulate exactly what I wanted the tournament system to accomplish without imposing a total order on the skill levels of the players, and I considered this too far from reality to bother. If you can clearly express the goal of your tournament and a way to determine how far a given ranking is from that goal, then we can probably do some analysis.

zulu_nation8

China26351 Posts

November 12 2010 01:26 GMT

#13

I think your study would only be meaningful if people actually assumed a bo7 series does not determine the best player as well as a bo3 series.

nzb

United States41 Posts

November 12 2010 01:27 GMT

#14

On November 12 2010 10:23 paralleluniverse wrote:

Show nested quote +

In this sense, the intransitivity is a random fluctuation, and if you played a long enough series you would expect it to go away.

But in reality, there probably are cases of "true intransitivity", where people's play styles match up in weird ways so that A > B, B > C, and C > A.

nzb

United States41 Posts

November 12 2010 01:30 GMT

#15

On November 12 2010 10:26 rasnj wrote:

Show nested quote +

Although reality isn't exactly transitive, it is pretty close.

That is, you can pretty confident saying that IdrA > Gretorp > HDstarcraft (random names, don't take offense). So although there are players near each players skill that confuse the issue slighly, the large-scale picture is still pretty clear because there is actually some order.

nzb

United States41 Posts

November 12 2010 01:32 GMT

#16

On November 12 2010 10:26 zulu_nation8 wrote:
I think your study would only be meaningful if people actually assumed a bo7 series does not determine the best player as well as a bo3 series.

I'm not really sure what you are responding to ...

The point of this is to determine exactly how much of an effect extended series has, both for individual matches and for an entire tournament. I'm pretty sure I haven't seen anyone talk about this with real numbers to back up what they are saying

paralleluniverse

4065 Posts

November 12 2010 01:32 GMT

#17

On November 12 2010 10:27 nzb wrote:

Show nested quote +

But these *are* random fluctuations in real life. If A > B > C, we would expect that A will beat B will beat C most of the time, and on some few random occasions for this not to hold. I think your model captures this fact well.

Although I wonder why you used such an archaic setup to simulate player performance instead of just simulating from a normal distribution, which can be done in 1 line in any statistical package, and would probably be more correct.

Nayl

Canada413 Posts

November 12 2010 01:34 GMT

#18

On November 12 2010 10:30 nzb wrote:

Show nested quote +

Well non-transitivity can occur especially if you are comparing between a non-team mate and 2 team mates.

Incontrol might be better than machine because he knows his teammate well, but machine might be better than Painuser but Painuser is better than Incontrol. (random names)

So its not necessarily clear in reality. =/

nzb

United States41 Posts

November 12 2010 01:34 GMT

#19

On November 12 2010 10:32 paralleluniverse wrote:

Show nested quote +

But these *are* random fluctuations in real life. If A > B > C, we would expect that A will beat B will beat C most of the time, and on some random occasions for this not to hold. I think your model captures this fact well.

Although I wonder why you used such an archaic setup to simulate player performance instead of just simulating from a normal distribution, which can be done in 1 line in any statistical package, and would probably be more correct.

Haha, touche. The reason is that I did this in order to have something fun to code in Go, which I've wanted to learn for a while, so doing it in Mathematica or R or something would have defeated my purpose.

MannerMan

371 Posts

November 12 2010 01:34 GMT

#20

Here's a blog I wrote on the same subject the other day.
http://www.teamliquid.net/blogs/viewblog.php?id=168168

It is a bit shorter and less in depth, and the scope is only the difference between separate Bo3s vs an extended series Bo7.

1 2 3 4 5 6 7 Next All

Please or register to reply.

Statistical Analysis of Extended Series

Completed

Ongoing

Upcoming