On the most recent State of the Game podcast, there was discussion of

MLG's extended series rule in their double elimination

tournament. This post explores the effects of the extended series rule

on tournament outcomes, using a simplified model of players and

tournaments. Several tournament formats are explored: round robin,

single elimination, double elimination, and double elimination with

extended series. Performance is measured by averaging over many

simulations, using several distance metrics from the 'ideal ranking'

of players. Results show a small but measurable improvement in

performance when using the extended series rule; with 64 players in a

best-of-three format, the 'best player' wins 1% more often (25%

compared to 24%) using the extended series rule than with simple

double elimination. However, the improvement from the extended series

rule is marginal compared to the overall tournament format; in

single-elimination, the best player wins 19% of the time, and in

round-robin the best player wins 47% of the time.

1. INTRODUCTION

*Skip this section if you are familiar with the debate about MLG's extended series rule.*

MLG is the largest Starcraft II tournament in North America, and

consequently its tournament format has a large impact on the

competitive scene. MLG employs a fairly standard double-elimination

tournament format, with each round determined by a best-of-three

series. However, MLG has an additional wrinkle called 'extended

series', which many people find counter-intuitive. To explain these

complexities, let's start with an overview of different tournament

formats.

A single elimination tournament is the simplest format that most

people are familiar with. Play proceeds in rounds, with all players

starting in the same round. Players are then paired in each round and

play a series. The winner proceeds to the next round, and the loser is

eliminated from the tournament. This format has the advantage of

determining a champion in very few games (O(log(# of players))), but

the disadvantage that bad luck can knock out good players at an early

stage.

To help with this problem, double elimination tournaments ensure that

any player must lose twice in order to knocked out of the

tournament. This is done by having two brackets: 'winners' and

'losers'. All players begin in the winners' bracket, and after losing

once are sent to the losers'. Players in the loser's bracket play each

other, as well as all players who join the losers' bracket from the

winner's bracket. Therefore, players in the losers' bracket play

twice as many series as those in the winners'.

MLG has extended the double elimination format with an 'extended

series' rule that is invoked when players meet twice in a single

tournament. If players meet in the winners' bracket, and later again

in the losers' bracket, then instead of playing a new best-of-three

series, their series from the winners' bracket is resumed as a

best-of-seven series. Example: If Alice beats Bob 2-1 in the winners'

bracket, and they meet again in the losers' bracket, then they will

play a best-of-seven series to determine the winner with a starting

score 2-1 in favor of Alice. Alice has to win two games to proceed,

and Bob has to win three.

This rule is intended to avoid some paradoxical outcomes, as well as

statistically increase the likelihood that the 'better player'

continues in the tournament. It is possible in standard double

elimination for Alice to defeat Bob 2-0 in the winners', and Bob to

defeat Alice in the losers' 2-1. The "overall series" between Alice

and Bob is 3-2 in Alice's favor, but Bob continues and Alice does

not.

Similarly, another argument is that double elimination exists in order

to give better players a 'second chance' to continue in the tournament

when defeated by inferior players, but this logic does not apply when

the same players meet again. In this case, it makes more sense (so the

argument goes) to extend the series to determine the 'better player'.

Despite these arguments, the extended series has generated

controversy because in many instances the tournament setting is very

different when the series resumes, and many people find it

unentertaining and counter-intuitive.

In particular, the extended series between Liquid`Tyler and PainUser

at MLG Dallas demonstrates some of the problems. In their series in

the winners' bracket, Liquid`Tyler fell victim to a mistake of the

tournament organizers, and was forced to restart a game that he had a

clear advantage. Liquid`Tyler subsequently lost the series 2-0, which

some have argued was due to the psychological effect of the game

restart. When they later met in the losers' bracket, Liquid`Tyler was

at a significant disadvantage, and lost the extended series 2-4, but

would have won a best-of-three.

This post is organized into several sections. Section 2 describes how

these results were gathered, and the various models used. Section 3

describes the experimental setup. Section 4 presents the

results. Section 5 concludes, and Section 6 shows where to follow up

on this if you are interested.

1.1 SCOPE

This post is an in-depth analysis of the statistical performance of

different tournament formats. It is not concerned with many other

important questions, for example:

* What is the purpose of tournaments, beyond determining skill of

players?

* Is the extended series rule entertaining?

* Is the extended series rule morally justified?

* Players aren't strictly 'better' or 'worse' than each other -- or,

at least, this relationship isn't transitive between players.

* The tournament setting can change when an extended series resumes.

These questions will and have been addressed elsewhere.

2. DESCRIPTION

This post explores the accuracy of several tournament formats,

focusing on the impact of the extended series rule. This is done using

simulation, running through many thousands of tournaments and

comparing the average results. This section describes the player

model, tournament model, and accuracy metrics used in the results.

2.1 PLAYER MODEL

Players are modeled using a simple randomized model. The goal is to

have players of greater or lesser skill, but have each player vary

somewhat in their performance. Players therefore consist of two

numbers: mean performance and deviation. Performance for a single

player is randomly generated each game, and lies in the range

[mean - dev, mean + dev].

The mean performance lies between 0 and 2, and the deviation is always

1. This ensures that the worst player can always beat the best player,

however at the extremes this is unlikely.

A players performance is calculated as follows:

performance = mean + dev * rand^2 * plusminus

Where rand is a uniformly-distributed number in [0,1] and plusminus is

seleted from {-1,1} with even probability. This formula makes the mass

of the probability distributed concentrated around the mean, making

the better players win more often.

To generate a set of players for a tournament, each player's mean is

selected uniformly from [0,2]. This is probably inaccurate -- player's

mean performance is likely distributed on a normal curve. The player

model is probably the biggest weakness in this study, however I still

believe the first-order effects are well captured in the analysis.

2.1 TOURNAMENT MODEL

The rules for each tournament are faithfully replicated in the

simulation, however there are some modelling choices here as well. The

most significant is the seeding of players in each tournament. I have

chosen to use the "ideal seeding", as determined by players' mean

performance, as the initial seeding for players. This removes a source

of inaccuracy from elimination tournaments, and so the results should

be taken as an upper bound for their performance.

Four tournament types are considered: single elimination, double

elimination, double elimination with extended series, and round

robin. The focus of this post is on the effect of extended series, but

single elimination and round robin are included in order to give some

context for these results.

A round robin tournament is one where every player plays every

other. Players are then ranked according to their number of wins. This

tournament produces a complete ranking, first through last, and

because everyone plays everyone, it is very accurate. The down side is

that it requires a lot of games (O(# players)) and is less exciting

than other tournament formats. However, because it is so accurate, it

can be used to calibrate the accuracy of elimination tournaments by

showing a "speed of light" for tournament efficacy.

Similarly, single elimination tournaments show the other end of the

spectrum. They are very fickle in their results, and show relatively

how much of an improvement the extended series rule makes over

standard double elimination.

2.2 MEASURING ACCURACY

One of the principle challenges is determining how to measure

performance of a tournament -- how can we say that one tournament is

"better" than another? The approach taken is to have each tournament

produce a ranking of players, first through last, and compare this

ranking to the ideal ranking, as determined by players' mean

performance.

This produces its own challenges, as elimination tournaments do not

strictly produce a ranking. However, taking seeding into account, an

elimination tournament does sort players into categories based on how

far they made it through the tournament. The ranking of players is

determined as players are eliminated from the tournament -- first

eliminated places last, and so on.

Three metrics are used to measure performance: winner, depth, and

2^depth.

* The 'winner' metric determines performance based on a very simple,

intuitive rule: Did the best player win? This metric is simple,

but unfortunately not very useful, because for even

moderately-sized tournaments, the best player rarely wins.

* The 'depth' metric determines performance based on how deep each

player made it in the tournament. Specifically, the player ranking

is divided into groups according to a single-elimination bracket

(first, second, top four, top eight, top sixteen, etc..). Then

each player's expected placement is calculated based on which

group they fall into within the ideal ranking -- the

fifteenth-best player should place into the top sixteen. These

results are compared against the actual placement from simulation,

and the difference from depth for all players is added to produce

the final "distance from ideal".

* The '2^depth' metric is similar to the depth metric, however

before adding up all of the depth-differences, we first calculate

2^(delta)-1. This is done because, intuitively, it is more

significant if the first player is eliminated in the round of 64

than if the 33rd, 34th, 35th, and 36th players make it to the

round of 32, but the 'depth' metric calculates these as being

equally bad. Essentially, this metric exaggerates says that big

differences in depth are more important than many small

differences.

3. METHODOLOGY

Results are gathered by running simulations of one million tournaments

and averaging the results for each tournament. It is generally found

that the trends in each metric are reflected in the others, except for

the 'winner' metric, which is very sensitive to random factors and

sometimes fluctuates independently.

4. RESULTS

4.1 OVERVIEW

Because this discussion was inspired by MLG Dallas, the first result

to consider the overall performance of each tournament format in a

128-player, best-of-three tournament:

Format | Winner | Depth | 2^Depth

---------------+--------+-------+--------

Single | 0.91 | 52.09 | 110.07

Double | 0.88 | 48.31 | 89.83

DoubleExtended | 0.88 | 46.01 | 87.42

RoundRobin | 0.72 | 22.29 | 28.85

Note that these are distance metrics, so lower is always better. For

the 'winner' metric, this number indicates the fraction of the time

that the best player did not win. So, 1 - 'winner' is the chance of

the best player winning the entire tournament.

A slight improvement can be seen from using the extended series in the

depth metrics, however it is marginal compared to the large difference

between single elimination and round robin tournaments. These results

also indicate that double elimination does perform significantly

better than single elimination, however neither come close the

performance a round robin tournament.

4.2 VARYING NUMBER OF GAMES

We can also explore the effect on tournament outcomes when the number

of games in each series is varied. (In this case, the extended series

is also varied.) These results are graphed below.

These results all show pretty much what one would expect -- using more

games in each series improves the accuracy of the tournament

format. However, this also visually show that the elimination

tournaments all perform similarly, and none approach the accuracy of a

round robin tournament. The ordering of performance is very

consistent, however: round robin is best, followed by double

elimination with extended series, double elimination, and single

elimination.

The depth metric doesn't show much separation between the different

elimination tournament formats, but the winner and 2^depth metrics

both show significant separation between single and double elimination

formats. This indicates that the single elimination format produces

more big differences in outcome than the double elimination tournament.

That is, more often the best player does not win, and more often good

players don't make it as far as they should. In this respect, the

extended series seems to make very little difference.

4.3 VARYING NUMBER OF PLAYERS

In this section, we compare the effect on accuracy when changing the

number of players in the tournament. I have to break methodology here

a bit, because I don't have the time to wait for a million simulations

of a 512-player round robin tournament to finish. So instead, I

simulated fifty thousand simulations. Consequently, there is a little

more noise in these results.

These graphs don't show anything particularly revealing compared with

the last section, but they do confirm that the trends hold over a

variety of tournament sizes. Single elimination does worse than double

elimination formats, and round robin is much better than the

elimination formats. This is particularly true with large numbers of

players -- but in this range, it is an unfair comparison, because

round robin plays many more games. Most relevant to this post,

extended series seems to have minimal effect on results for large

numbers of players, particularly when considering 2^depth.

4.4 EFFECT OF EXTENDED SERIES

We now consider the effect of the extended series in

isolation. Specifically, how often is the extended series used, and

how often does is "correct an injustice" from the winners' bracket?

In this case, we consider a 64-player tournament in double elimination

with extended series format. In a standard double-elimination format,

127 matches will be played.

Simulation shows that, on average, 18.8 extended series will be played

in a 64-player tournament. This means that 15% of matches, on average,

will be rematches of players.

Similarly, of these 18.8 matches, 3.03 of them will result in

"corrections". A correction is when the better player loses in the

winners' bracket and wins the extended series to continue in the

tournament. In 2.17 of the matches, the worse player won in the

winners' bracket and won the extended series, meaning the extended

series failed to "correct" the result from the winners'

bracket.

The worst possible outcome is when the better play wins in the

winners' bracket and loses the extended series. The extended series

does well here, only introducing 0.55 such results per tournament, or

4% of the extended series.

Considering the disadvantage that the better player has when entering

the extended series, it does surprisingly well at correcting these

results, succeeding 58% of the time. At the same time, it only

introduces bad results 4% of the time.

I am tempted to conlude that extended series is successful at letting

the better player continue in the tournament, however data is missing

to compare against a standard double elimination tournament. A good

area of extension for this study would be measuring the outcome if a

regular best-of-three were done, and comparing its

correction/injustice rate to the extended series. The ratio from the

extended series (58%/4%) seems pretty hard to beat -- I would expect a

best-of-three to allow the better play to proceed more often, but have

a much higher injustice rate.

5. CONCLUSION

Whe considering individual matches, the extended series appears to

perform well to make sure the better player continues in the

tournament. In this sense, it fulfills its purpose.

But when looking at the larger picture, it appears that the extended

series has little effect on the outcome. While the extended series

rule does slightly improve outcomes, these differences are not

particularly significant compared to the overall double elimination

format.

What is clear from these results is that both elimination formats

leave much to be desired when compared to a round robin

tournament. Although round-robin is impractical due its large number

of games, other tournament formats such as swiss-style or those with

rounds play deserve further consideration.

Another future area of work is considering the performance of a

points-based system of several double elimination tournaments, like

MLG employs for its full Starcraft II season.

6. SEE ALSO

Wikipedia on tournament formats:

http://en.wikipedia.org/wiki/Single-elimination_tournament

http://en.wikipedia.org/wiki/Swiss_style_tournament

6.1 SOURCE CODE

The source code is available via git at:

git://github.com/nathanbeckmann/Tournament.git

It is written in Go. Have fun!

EDIT 1: Corrected problem with injustice rate. It is 4%, not 3%.

EDIT 2: Fix example in intro (corrected by Cyber_Cheese).