Fun with TLPD and R Part 2, or Why ELO Sucks

zoniusalexandr

United States39 Posts

July 08 2011 05:02 GMT

In this series, I'm working on statistical methods for analyzing player performance using the language R. Yesterday I looked at visualizing the network structure of games played. Today I'll be showing you why ELO is a flawed metric of player performance.

For those of you unfamiliar with ELO, it's a scoring system used widely in chess and occasionally in other sports. Basically, it works pretty much like the SC2 ladder, in that each player gets a skill rating (like ladder points), and when you win/lose your rating goes up or down based on the rating of your opponent. If you're interested in how this works, check out this article for the technical details.

One of ELO's main features is the limited amount of information needed to calculate it. All you need to update ELO scores are the players' old ratings and the outcomes of all subsequent games. This has some nice mathematical properties (see Markov chain), but mainly its advantage is logistical. Chess has no easily measurable quantities within games (unlike most sports), instead the only data available are wins and losses. In addition, the calculations are fairly simple, and can be done on any basic calculator.

This advantage, which was so helpful for chess in 1960, become somewhat of a hindrance in today's environment. The memoryless properties of ELO mean that every time you update ratings, you end up throwing away a good amount of information. With modern computing, we have the power to examine not just the most recent games, but the entire history of games, should it prove valuable. Also over the last few decades (coinciding with the growth in computing), more advanced statistical techniques using the Bayesian paradigm have been developed to utilize information of that scale. I'll be talking more about these approaches in one of my next posts, as I build an alternative rating system from a Bayesian approach (hopefully either this weekend or sometime next week).

For the time being, I'd like to show you a few main weaknesses of ELO as a predictive tool in the current Starcraft scene through some simple examples.

1) ELO only works in one direction

ELO is a great example of what's known as a "real-time" filter. It works in one direction, updating ratings based on the next stage of results. This works great if all you have are the next stage of results, but if we have the whole history of games then this isn't a good method. When Losira lost to Polt 1-2 in the opening round of the Super Tournament, it looked like Losira was essentially equivalent to the other players exiting in the round of 64. However, now that we know Polt went on to win the whole tournament, it would be great to rethink our evaluation of Losira. ELO has no mechanism for doing this.

2) ELO does poorly with small amounts of games

Here's a quick example. I looked at only the games played in the Homestory Cup #3, pretending that those were the only games of Starcraft ever played. Here is a table showing the top 16 players by ELO:
[image loading]

While the top rating makes sense, there are several interesting anomalies. Stephano lost to Thorzain in the first round of the Winner's bracket, but here Stephano is rated 5 spots higher (with a substantially higher ELO). Similarly, White-Ra lost to Stephano in the Loser's bracket, but ends up with a higher ELO. Remember, I started each player off at the same ELO rating (2000), so this effect can't be attributed to prior ELO ratings. Naniwa beat HuK in 5 out of 11 games, and MC in 2 of 3, but here comes in 5th, behind MC, White-Ra, and Stephano. Transitivity is a somewhat tricky tool to use, since it's not necessarily the case that when player A beats player B and player B beats player C, then player A will beat player C. However, here we have some pretty strong signs that ELO rankings aren't reflecting who is likely to win in a hypothetical match. I think this is probably related to my next point:

3) ELO measures dominance

ELO is measuring how often you win against players who have similar win histories. If I play in a small community and beat every other player repeatedly, I'll eventually have the same ELO as Flash. This isn't because I have equal skill compared to Flash, but I equally dominate my small community as he dominates Korean progamers. If we had to play against the same pool of opponents, then the difference in skill would become apparent in terms of ELO. To the extent that players choose different pools of opponents, ELO will just reflect dominance instead of skill. This brings me to my third point:

4) We don't live in an ideal world

In an ideal world, where every player is playing hundreds of games on a regular basis against randomly assigned opponents, ELO could be very accurate. This is absolutely not the case. Starcraft 2 has three fairly distinct communities, as I showed in my last post. The European, American, and Korean communities tend to play the majority of their games within their own groups. That's not to say there are some connections; there are, but the number of games played between members of different communities tends to be smaller and non-random. This is especially true of Korea, which has less crossover than the other two groups, with many of the connections concentrated among a few individuals (Moon, MC, Jinro, July, Ace, etc.). In theory, even a few crossover games should provide some important information as to the relative strength of these groups. However, as we've seen above, ELO doesn't do a great job with teasing information out of a relatively small number of games. If crossover were more frequent and randomly assigned across all group members, perhaps ELO could reflect global dominance.

Instead, what we're left with is a world where ELO is somewhat out of line with each players' "true" skill level. TLPD calculates the ELO for International events and Korean tournaments separately, which is likely the most appropriate thing to do, but it means we can't rank players on a worldwide basis. It also means that players who split their time between Korean and International events (i.e. MC) will be ranked lower down on each. For those who are curious, I've calculated the worldwide ELO for the 12-core of players (a central group, see my last post for more info) and I've put the top 100 players in the spoiler below. Feel free to comment on any and all oddities you notice:

+ Show Spoiler [Worldwide ELO] +

Note: This excludes games against players not in the 252-person 12-core. When I included them, the results were more biased in favor of the Europeans.

Also, I've been working on developing a way in R of simulating tournament results based on these metrics, so that I can measure how well my method performs compared to the standard ELO. I've been testing it out by simulating outcomes for the NASL finals tournament coming up. Since I know some of you out there might be interested, here's the predicted outcomes based on each player's worldwide ELO:

+ Show Spoiler [NASL Predictions] +

Based on 10,000 simulations, here are the predicted winners of each round, plus a pie chart of the outcomes for the overall winner:
Round of 16:
Ret
Squirtle
Morrow
Moon
Sen
aLive
White-Ra
MC

Round of 8:
MC
Ret
Moon
Sen

Semifinals:
MC
Moon

Finals:
MC, who was the winner in over 45% of the simulations

[image loading]

Note: these are based on ELO, so if you've learned anything today, don't trust these results!

Still to come: My alternative ranking method, based on a maximum likelihood approach using the Metropolis-Hastings algorithm. Also, I'm thinking of looking at transitivity, to see how often A > B and B > C implies A > C. If you guys have got any suggestions/requests for procedures, let me know and I'll try and put together some R code for it. Also, comments/questions/haikus about the statistics used in the article are always welcome!

+ Show Spoiler [R Code used in today's analysis] +

rm(list=ls())
options("stringsAsFactors"=FALSE)
int <- read.csv("tlpd_international.csv")
int <- int[ ,1:12]
int$edition <- "International"

kor <- read.csv("tlpd_korean.csv")
kor <- kor[ ,1:12]
kor$edition <- "Korean"

beta <- read.csv("tlpd_beta.csv")
beta <- beta[ ,1:12]
beta$edition <- "Beta"

tlpd <- rbind(int,kor,beta)

write.csv(tlpd, file = "tlpd.csv", append = FALSE)

library(igraph)
tlpdcondensed = tlpd[ ,c(7,10)]
tlpdgraph <- graph.data.frame(tlpdcondensed,directed=FALSE)
V(tlpdgraph)$label <- unique(c(tlpd$Winner,tlpd$Loser))
V(tlpdgraph)$size <- 0
V(tlpdgraph)$label.cex <- 0.75
stlpdgraph <- simplify(tlpdgraph)
cores <- graph.coreness(stlpdgraph)
tlpdgraph2 <- subgraph(stlpdgraph,as.vector(which(cores>12))-1)

tlpdplayers <- rbind(as.matrix(tlpd[ ,c(7,9)]), as.matrix(tlpd[ ,c(10,12)]))
tlpdplayers <- tlpdplayers[!duplicated(tlpdplayers),]
tlpdplayers <- tlpdplayers[order(tlpdplayers[ ,2]), ]
tlpdplayers[ ,2] <- as.numeric(tlpdplayers[ ,2])

#Define a function that calculates the ELO based on a provided data frame of games
calculateELO <- function(games){
#Prepare a matrix of players
players <- unique(c(games$WinnerID,games$LoserID))
elo <- seq(from=2000,to=2000,length.out=length(players)) #Starting ELO is 2000 for all players
numgames <- seq(from=0,to=0,length.out=length(players))
playmat <- cbind(players,elo,numgames)

#Prepare the game matrix
gamemat <- cbind(games$WinnerID,games$LoserID)

#Loop through the games, updating the player matrix with new ELO values
for (i in 1:dim(gamemat)[1]){
game <- gamemat[i, ]

#Get the winner and loser ids
wid <- which(playmat[ ,1]==game[1])
lid <- which(playmat[ ,1]==game[2])

#Calculate ELO
xw <- playmat[wid,2]
xl <- playmat[lid,2]
qw <- 10 ** (xw / 400)
ql <- 10 ** (xl / 400)

#Use K=40 for first 20 games and K=20 thereafter
if (playmat[wid,3] > 40){
kw <- 20
}
else { kw <- 20 }
if (playmat[lid,3] > 40){
kl <- 20
}
else { kl <- 20 }

#Update ELO scores
playmat[wid,2] <- xw + kw * (1 - qw / (qw + ql))
playmat[lid,2] <- xl + kl * ( - ql / (qw + ql))

#Update games played
playmat[wid,3] <- playmat[wid,3] + 1
playmat[lid,3] <- playmat[lid,3] + 1
}
return(playmat)
}

hscelo <- calculateELO(hsc[nrow(hsc):1,])

mergePlayerNames <- function(playmat,tlpdplayers){
playerNames <- c()
for (i in 1:dim(playmat)[1]){
pid <- playmat[i,1]
pname <- tlpdplayers[tlpdplayers[ ,2]==pid,1][[1]]
playerNames <- c(playerNames,pname)
}
return(cbind(playerNames,playmat))
}

hscelo <- mergePlayerNames(hscelo, tlpdplayers)
View(hscelo[rev(order(hscelo[,3])),])

tlpdelo <- calculateELO(tlpd[nrow(tlpd):1,])
tlpdelo <- mergePlayerNames(tlpdelo, tlpdplayers)
View(tlpdelo[rev(order(tlpdelo[,3])),])

tlpd30core <- tlpd[which(tlpd$Winner %in% V(tlpdgraph2)$label),]
tlpd30core <- tlpd30core[which(tlpd30core$Loser %in% V(tlpdgraph2)$label),]
#tlpd30core <- tlpd30core[tlpd30core$Date > "2011-02-01",]

tl30kelo <- calculateELO(tlpd30core[nrow(tlpd30core):1,])
tl30kelo <- mergePlayerNames(tl30kelo, tlpdplayers)
View(tl30kelo[rev(order(tl30kelo[,3])),c(1,3,4)][1:100,])

#Simulate the NASL finals
#Build a matrix describing the tournament structure
#Each row represents one match
#First two columns are the two player ids
#Third column is the number of games played (ie. Bo5, Bo3, etc)
#Fourth and fifth columns represent where the winner goes to
naslmat <- rbind(c(980,96,3,9,1))
naslmat <- rbind(naslmat,c(1301,969,3,9,2))
naslmat <- rbind(naslmat,c(1212,117,3,10,1))
naslmat <- rbind(naslmat,c(1587,2084,3,10,2))
naslmat <- rbind(naslmat,c(1176,1903,3,11,1))
naslmat <- rbind(naslmat,c(1698,1988,3,11,2))
naslmat <- rbind(naslmat,c(1825,1148,3,12,1))
naslmat <- rbind(naslmat,c(640,225,3,12,2))
naslmat <- rbind(naslmat,c(0,0,3,13,1))
naslmat <- rbind(naslmat,c(0,0,3,13,2))
naslmat <- rbind(naslmat,c(0,0,3,14,1))
naslmat <- rbind(naslmat,c(0,0,3,14,2))
naslmat <- rbind(naslmat,c(0,0,5,15,1))
naslmat <- rbind(naslmat,c(0,0,5,15,2))
naslmat <- rbind(naslmat,c(0,0,7,16,1))
naslmat <- rbind(naslmat,c(0,0,0,0,0))

winners <- list()
for (i in 1:15){
winners[[i]] <- matrix(nrow=0,ncol=2)
}

#Run the simulation X times
for (n in 1:10000){
tempmat <- naslmat
for (i in 1:15){
awins <- 0
for (j in 1:tempmat[i,3]){
prob <- 1 / (1 + 10 ** ((as.numeric(tl30kelo[tl30kelo[,2]==tempmat[i,2],3]) -
as.numeric(tl30kelo[tl30kelo[,2]==tempmat[i,1],3])) / 400))
rand <- runif(1)
if (rand < prob){
awins <- awins + 1
}
}
if (awins / tempmat[i,3] > 0.5){
winner <- tempmat[i,1]
}
else {
winner <- tempmat[i,2]
}
if (any(winners[[i]][,1]==winner)){
winners[[i]][which(winners[[i]][,1]==winner),2] <-
winners[[i]][which(winners[[i]][,1]==winner),2] + 1
}
else {
winners[[i]] <- rbind(winners[[i]],c(winner,1))
}
tempmat[tempmat[i,4],tempmat[i,5]] <- winner
}
}

for (i in 1:15){
print(paste("The most common winner of game",i,"was",
mergePlayerNames(winners[[i]][rev(order(winners[[i]][,2]))
,],tlpdplayers)[1,1]))
}

library(ggplot2)

grandfinals <- data.frame(mergePlayerNames(winners[[15]],tlpdplayers))
dimnames(grandfinals)[[2]] <- c("Player","ID","Wins")
grandfinals$Wins <- as.numeric(grandfinals$Wins)/10000
grandfinals <- grandfinals[order(grandfinals$Player),]
cumwins <- c()
places <- c()
for (i in 1:dim(grandfinals)[1]){
cumwins <- c(cumwins,sum(grandfinals$Wins[1:i]))
places <- c(places,sum(grandfinals$Wins[1:i],cumwins[i-1])/2)
}
grandfinals$places <- places

pie <- ggplot(data=grandfinals,aes(x=factor(1),y=Wins,fill=Player)) +
geom_bar(width=1,colour="black") + geom_text(aes(label=Player,y=places))
pie + coord_polar(theta="y") + xlab("") + ylab("") +
opts(title="Simulated NASL Victories\n(Out of 10,000 runs)")

kalany

United States149 Posts

July 08 2011 05:05 GMT

AKA if you're playing crappy opponents in online cups, your ELO gets a huge buff (see Nerchio, Kas, Beastyqt, Happy, Tarson, Strelok etc. etc.)

xxpack09

United States2160 Posts

July 08 2011 05:57 GMT

ELO works well when the following conditions are met:

--players play a large amount of games and play similar amounts of games
--players all play against a similar player pool

ELO doesn't work well at all for SC2 international tbh because there are just too many competitions and players don't all play against the same caliber of players.

TheAmazombie

United States3714 Posts

July 08 2011 06:38 GMT

Awesome write-up. I am not that knowledgeable about it all, but from what I know from my chess background, ELO was not really that trusted of a method overall. It seems so strange overall. Does this all take into account game or matches as a whole? Is there some kind of buff or variable to take matches into account, or to take other factors such as eliminations or clutch situations? Maybe one player performs well only in single-elim, but not double or group play.

I think of this because I remember reading years ago, being a baseball fan, about stuff like SABR metrics and other assorted player stats and rankings such as clutch performance situations where under only certain situations certain players are way better and more reliable, even if they are only average in other, normal situations.

This all brings me to my end point...is there or has someone ever worked on a way to rank based not only on wins and losses, but on situational statistics such like players have in baseball and other sports? I believe it could be super complicated to come up with, but once someone does figure out the stats and how to calculate them, scraping info from replays should be simple.

Empyrean

16981 Posts

July 08 2011 07:08 GMT

As a stats major, I love your blogs

It's interesting that you're into Bayesian methods; I'm working with David Dunson next year on a year long independent study project (so we're obviously going to be using Bayes methods).

I look forward to reading your subsequent entries!

zoniusalexandr

United States39 Posts

July 08 2011 16:34 GMT

On July 08 2011 15:38 TheAmazombie wrote:
This all brings me to my end point...is there or has someone ever worked on a way to rank based not only on wins and losses, but on situational statistics such like players have in baseball and other sports? I believe it could be super complicated to come up with, but once someone does figure out the stats and how to calculate them, scraping info from replays should be simple.

One of the main disadvantages with statistics in Starcraft 2 is the fact that we only have wins/losses data. In baseball you can measure hits, strikes, errors, etc. to evaluate player performance, hence ELO is less common in baseball. I would love to have data on supply counts during games, or metrics like army size or resources lost. Unfortunately, none of that data is stored in replays. What we could find out from the replays are things like, "What is the win rate of a particular race broken down by expansion timing?" or "How does the rate of drone production affect the odds of winning?".

Side note: I'm curious whether it would be possible to grab statistics from the observer panel during a game and store them in, say, a text file or database. If someone knows the technical details of how the observer panel works, let me know.

That said, your question is also about situational statistics - i.e. not all wins and losses are created equal. For example, it would be interesting to compare IdrA's overall win rate with his win rate in games facing elimination (measuring the tilt factor). One thing I forgot to mention in the article is that the fact that not all games are equal could mean that "true" skill is really only represented by a handful of high pressure tournaments. The TLPD data has data on a wide variety of tournaments, but if the only ones that matter are the MLGs, GSLs, TSL3, etc. then we have a much smaller sample than we thought. In such a world, ELO is even worse at measuring skill (because of the small sample size).

Empyrean:

Glad to meet another Bayesian on TL

Feel free to call me out on any of my statistical reasoning, my only Bayesian knowledge comes from courses in econometrics (I'm a PhD in Political Economy), but I assume it's similar to the methods in biostatistics.

denzelz

United States604 Posts

July 22 2011 05:34 GMT

This blog is awesome. I'm just getting to reading all of them. Keep it up please!

Please or register to reply.

Fun with TLPD and R Part 2, or Why ELO Sucks

Completed

Ongoing

Upcoming