For those of you unfamiliar with ELO, it's a scoring system used widely in chess and occasionally in other sports. Basically, it works pretty much like the SC2 ladder, in that each player gets a skill rating (like ladder points), and when you win/lose your rating goes up or down based on the rating of your opponent. If you're interested in how this works, check out this article for the technical details.
One of ELO's main features is the limited amount of information needed to calculate it. All you need to update ELO scores are the players' old ratings and the outcomes of all subsequent games. This has some nice mathematical properties (see Markov chain), but mainly its advantage is logistical. Chess has no easily measurable quantities within games (unlike most sports), instead the only data available are wins and losses. In addition, the calculations are fairly simple, and can be done on any basic calculator.
This advantage, which was so helpful for chess in 1960, become somewhat of a hindrance in today's environment. The memoryless properties of ELO mean that every time you update ratings, you end up throwing away a good amount of information. With modern computing, we have the power to examine not just the most recent games, but the entire history of games, should it prove valuable. Also over the last few decades (coinciding with the growth in computing), more advanced statistical techniques using the Bayesian paradigm have been developed to utilize information of that scale. I'll be talking more about these approaches in one of my next posts, as I build an alternative rating system from a Bayesian approach (hopefully either this weekend or sometime next week).
For the time being, I'd like to show you a few main weaknesses of ELO as a predictive tool in the current Starcraft scene through some simple examples.
1) ELO only works in one direction
ELO is a great example of what's known as a "real-time" filter. It works in one direction, updating ratings based on the next stage of results. This works great if all you have are the next stage of results, but if we have the whole history of games then this isn't a good method. When Losira lost to Polt 1-2 in the opening round of the Super Tournament, it looked like Losira was essentially equivalent to the other players exiting in the round of 64. However, now that we know Polt went on to win the whole tournament, it would be great to rethink our evaluation of Losira. ELO has no mechanism for doing this.
2) ELO does poorly with small amounts of games
Here's a quick example. I looked at only the games played in the Homestory Cup #3, pretending that those were the only games of Starcraft ever played. Here is a table showing the top 16 players by ELO:
While the top rating makes sense, there are several interesting anomalies. Stephano lost to Thorzain in the first round of the Winner's bracket, but here Stephano is rated 5 spots higher (with a substantially higher ELO). Similarly, White-Ra lost to Stephano in the Loser's bracket, but ends up with a higher ELO. Remember, I started each player off at the same ELO rating (2000), so this effect can't be attributed to prior ELO ratings. Naniwa beat HuK in 5 out of 11 games, and MC in 2 of 3, but here comes in 5th, behind MC, White-Ra, and Stephano. Transitivity is a somewhat tricky tool to use, since it's not necessarily the case that when player A beats player B and player B beats player C, then player A will beat player C. However, here we have some pretty strong signs that ELO rankings aren't reflecting who is likely to win in a hypothetical match. I think this is probably related to my next point:
3) ELO measures dominance
ELO is measuring how often you win against players who have similar win histories. If I play in a small community and beat every other player repeatedly, I'll eventually have the same ELO as Flash. This isn't because I have equal skill compared to Flash, but I equally dominate my small community as he dominates Korean progamers. If we had to play against the same pool of opponents, then the difference in skill would become apparent in terms of ELO. To the extent that players choose different pools of opponents, ELO will just reflect dominance instead of skill. This brings me to my third point:
4) We don't live in an ideal world
In an ideal world, where every player is playing hundreds of games on a regular basis against randomly assigned opponents, ELO could be very accurate. This is absolutely not the case. Starcraft 2 has three fairly distinct communities, as I showed in my last post. The European, American, and Korean communities tend to play the majority of their games within their own groups. That's not to say there are some connections; there are, but the number of games played between members of different communities tends to be smaller and non-random. This is especially true of Korea, which has less crossover than the other two groups, with many of the connections concentrated among a few individuals (Moon, MC, Jinro, July, Ace, etc.). In theory, even a few crossover games should provide some important information as to the relative strength of these groups. However, as we've seen above, ELO doesn't do a great job with teasing information out of a relatively small number of games. If crossover were more frequent and randomly assigned across all group members, perhaps ELO could reflect global dominance.
Instead, what we're left with is a world where ELO is somewhat out of line with each players' "true" skill level. TLPD calculates the ELO for International events and Korean tournaments separately, which is likely the most appropriate thing to do, but it means we can't rank players on a worldwide basis. It also means that players who split their time between Korean and International events (i.e. MC) will be ranked lower down on each. For those who are curious, I've calculated the worldwide ELO for the 12-core of players (a central group, see my last post for more info) and I've put the top 100 players in the spoiler below. Feel free to comment on any and all oddities you notice:
+ Show Spoiler [Worldwide ELO] +
Note: This excludes games against players not in the 252-person 12-core. When I included them, the results were more biased in favor of the Europeans.
Also, I've been working on developing a way in R of simulating tournament results based on these metrics, so that I can measure how well my method performs compared to the standard ELO. I've been testing it out by simulating outcomes for the NASL finals tournament coming up. Since I know some of you out there might be interested, here's the predicted outcomes based on each player's worldwide ELO:
+ Show Spoiler [NASL Predictions] +
Based on 10,000 simulations, here are the predicted winners of each round, plus a pie chart of the outcomes for the overall winner:
Round of 16:
Ret
Squirtle
Morrow
Moon
Sen
aLive
White-Ra
MC
Round of 8:
MC
Ret
Moon
Sen
Semifinals:
MC
Moon
Finals:
MC, who was the winner in over 45% of the simulations
Note: these are based on ELO, so if you've learned anything today, don't trust these results!
Still to come: My alternative ranking method, based on a maximum likelihood approach using the Metropolis-Hastings algorithm. Also, I'm thinking of looking at transitivity, to see how often A > B and B > C implies A > C. If you guys have got any suggestions/requests for procedures, let me know and I'll try and put together some R code for it. Also, comments/questions/haikus about the statistics used in the article are always welcome!
+ Show Spoiler [R Code used in today's analysis] +
rm(list=ls())
options("stringsAsFactors"=FALSE)
int <- read.csv("tlpd_international.csv")
int <- int[ ,1:12]
int$edition <- "International"
kor <- read.csv("tlpd_korean.csv")
kor <- kor[ ,1:12]
kor$edition <- "Korean"
beta <- read.csv("tlpd_beta.csv")
beta <- beta[ ,1:12]
beta$edition <- "Beta"
tlpd <- rbind(int,kor,beta)
write.csv(tlpd, file = "tlpd.csv", append = FALSE)
library(igraph)
tlpdcondensed = tlpd[ ,c(7,10)]
tlpdgraph <- graph.data.frame(tlpdcondensed,directed=FALSE)
V(tlpdgraph)$label <- unique(c(tlpd$Winner,tlpd$Loser))
V(tlpdgraph)$size <- 0
V(tlpdgraph)$label.cex <- 0.75
stlpdgraph <- simplify(tlpdgraph)
cores <- graph.coreness(stlpdgraph)
tlpdgraph2 <- subgraph(stlpdgraph,as.vector(which(cores>12))-1)
tlpdplayers <- rbind(as.matrix(tlpd[ ,c(7,9)]), as.matrix(tlpd[ ,c(10,12)]))
tlpdplayers <- tlpdplayers[!duplicated(tlpdplayers),]
tlpdplayers <- tlpdplayers[order(tlpdplayers[ ,2]), ]
tlpdplayers[ ,2] <- as.numeric(tlpdplayers[ ,2])
#Define a function that calculates the ELO based on a provided data frame of games
calculateELO <- function(games){
#Prepare a matrix of players
players <- unique(c(games$WinnerID,games$LoserID))
elo <- seq(from=2000,to=2000,length.out=length(players)) #Starting ELO is 2000 for all players
numgames <- seq(from=0,to=0,length.out=length(players))
playmat <- cbind(players,elo,numgames)
#Prepare the game matrix
gamemat <- cbind(games$WinnerID,games$LoserID)
#Loop through the games, updating the player matrix with new ELO values
for (i in 1:dim(gamemat)[1]){
game <- gamemat[i, ]
#Get the winner and loser ids
wid <- which(playmat[ ,1]==game[1])
lid <- which(playmat[ ,1]==game[2])
#Calculate ELO
xw <- playmat[wid,2]
xl <- playmat[lid,2]
qw <- 10 ** (xw / 400)
ql <- 10 ** (xl / 400)
#Use K=40 for first 20 games and K=20 thereafter
if (playmat[wid,3] > 40){
kw <- 20
}
else { kw <- 20 }
if (playmat[lid,3] > 40){
kl <- 20
}
else { kl <- 20 }
#Update ELO scores
playmat[wid,2] <- xw + kw * (1 - qw / (qw + ql))
playmat[lid,2] <- xl + kl * ( - ql / (qw + ql))
#Update games played
playmat[wid,3] <- playmat[wid,3] + 1
playmat[lid,3] <- playmat[lid,3] + 1
}
return(playmat)
}
hscelo <- calculateELO(hsc[nrow(hsc):1,])
mergePlayerNames <- function(playmat,tlpdplayers){
playerNames <- c()
for (i in 1:dim(playmat)[1]){
pid <- playmat[i,1]
pname <- tlpdplayers[tlpdplayers[ ,2]==pid,1][[1]]
playerNames <- c(playerNames,pname)
}
return(cbind(playerNames,playmat))
}
hscelo <- mergePlayerNames(hscelo, tlpdplayers)
View(hscelo[rev(order(hscelo[,3])),])
tlpdelo <- calculateELO(tlpd[nrow(tlpd):1,])
tlpdelo <- mergePlayerNames(tlpdelo, tlpdplayers)
View(tlpdelo[rev(order(tlpdelo[,3])),])
tlpd30core <- tlpd[which(tlpd$Winner %in% V(tlpdgraph2)$label),]
tlpd30core <- tlpd30core[which(tlpd30core$Loser %in% V(tlpdgraph2)$label),]
#tlpd30core <- tlpd30core[tlpd30core$Date > "2011-02-01",]
tl30kelo <- calculateELO(tlpd30core[nrow(tlpd30core):1,])
tl30kelo <- mergePlayerNames(tl30kelo, tlpdplayers)
View(tl30kelo[rev(order(tl30kelo[,3])),c(1,3,4)][1:100,])
#Simulate the NASL finals
#Build a matrix describing the tournament structure
#Each row represents one match
#First two columns are the two player ids
#Third column is the number of games played (ie. Bo5, Bo3, etc)
#Fourth and fifth columns represent where the winner goes to
naslmat <- rbind(c(980,96,3,9,1))
naslmat <- rbind(naslmat,c(1301,969,3,9,2))
naslmat <- rbind(naslmat,c(1212,117,3,10,1))
naslmat <- rbind(naslmat,c(1587,2084,3,10,2))
naslmat <- rbind(naslmat,c(1176,1903,3,11,1))
naslmat <- rbind(naslmat,c(1698,1988,3,11,2))
naslmat <- rbind(naslmat,c(1825,1148,3,12,1))
naslmat <- rbind(naslmat,c(640,225,3,12,2))
naslmat <- rbind(naslmat,c(0,0,3,13,1))
naslmat <- rbind(naslmat,c(0,0,3,13,2))
naslmat <- rbind(naslmat,c(0,0,3,14,1))
naslmat <- rbind(naslmat,c(0,0,3,14,2))
naslmat <- rbind(naslmat,c(0,0,5,15,1))
naslmat <- rbind(naslmat,c(0,0,5,15,2))
naslmat <- rbind(naslmat,c(0,0,7,16,1))
naslmat <- rbind(naslmat,c(0,0,0,0,0))
winners <- list()
for (i in 1:15){
winners[[i]] <- matrix(nrow=0,ncol=2)
}
#Run the simulation X times
for (n in 1:10000){
tempmat <- naslmat
for (i in 1:15){
awins <- 0
for (j in 1:tempmat[i,3]){
prob <- 1 / (1 + 10 ** ((as.numeric(tl30kelo[tl30kelo[,2]==tempmat[i,2],3]) -
as.numeric(tl30kelo[tl30kelo[,2]==tempmat[i,1],3])) / 400))
rand <- runif(1)
if (rand < prob){
awins <- awins + 1
}
}
if (awins / tempmat[i,3] > 0.5){
winner <- tempmat[i,1]
}
else {
winner <- tempmat[i,2]
}
if (any(winners[[i]][,1]==winner)){
winners[[i]][which(winners[[i]][,1]==winner),2] <-
winners[[i]][which(winners[[i]][,1]==winner),2] + 1
}
else {
winners[[i]] <- rbind(winners[[i]],c(winner,1))
}
tempmat[tempmat[i,4],tempmat[i,5]] <- winner
}
}
for (i in 1:15){
print(paste("The most common winner of game",i,"was",
mergePlayerNames(winners[[i]][rev(order(winners[[i]][,2]))
,],tlpdplayers)[1,1]))
}
library(ggplot2)
grandfinals <- data.frame(mergePlayerNames(winners[[15]],tlpdplayers))
dimnames(grandfinals)[[2]] <- c("Player","ID","Wins")
grandfinals$Wins <- as.numeric(grandfinals$Wins)/10000
grandfinals <- grandfinals[order(grandfinals$Player),]
cumwins <- c()
places <- c()
for (i in 1:dim(grandfinals)[1]){
cumwins <- c(cumwins,sum(grandfinals$Wins[1:i]))
places <- c(places,sum(grandfinals$Wins[1:i],cumwins[i-1])/2)
}
grandfinals$places <- places
pie <- ggplot(data=grandfinals,aes(x=factor(1),y=Wins,fill=Player)) +
geom_bar(width=1,colour="black") + geom_text(aes(label=Player,y=places))
pie + coord_polar(theta="y") + xlab("") + ylab("") +
opts(title="Simulated NASL Victories\n(Out of 10,000 runs)")