Note: This post is rather long and heavy on the statistical theory. There's no shame in skipping down to the results at the bottom.
Ranking Starcraft Players: A Bayesian Approach
Fundamentally every system for evaluating players possesses three main elements: a model of individual games, a set of information to be sought, and a computational mechanism to tie it all together. The model of individual games is key in defining which attributes are thought to matter in determining a winner. In most cases, the set of information corresponds to a series of ratings, one for each player, which are the "optimal" estimates of each player's true skill level. There are many existing estimation strategies that can be used, including simple formulas and complex multivariate regressions.
In ELO, individual games are modeled along the logistic curve, using simple formulas to update player ratings after each game is played. Additionally, the only information obtained from ELO is a single number for each player, which will accurately measure true skill in the long run, but perhaps not over finite samples. If this system produces consistent results, then we have no need to modify any of these elements. However, I believe that ELO does not work consistently when applied to Starcraft at the top levels (and if you've read part 2, then I hope you agree with me).
With that in mind, I would like to propose a Bayesian framework which will allow us to improve on ELO with respect to all three elements. This approach, which can be applied to practically any model of individual games, is based around a Bayesian view of player performance using the Metropolis-Hastings algorithm as our estimation tool.
Some of you may be wondering what I mean when I use the term Bayesian to describe this framework. In statistics, there are two main paradigms when it comes to inference: frequentist inference and Bayesian inference. The frequentist method views the model that produces our data as having certain fixed parameters, and uses estimation as a means of finding the most likely values for those parameters. In contrast, the Bayesian view treats the parameters as variables and takes the data to be fixed, estimating the probability of each possible value of a parameter being the "true" value, based on the data.
The Bayesian paradigm is named after a theorem discovered by Thomas Bayes, which is called "Bayes' rule". Its form is quite simple:
The rule provides a way of turning one conditional probability into its opposite, using the ratio of the unconditional probabilities. We will use this rule to calculate the probability of a set of possible parameter values given the data using the probability of the data given the values. The probability which we are calculating, P(A|B), is known as the posterior probability, since it represents the probability of a certain set of values after we have taken into account the data. In contrast, the probability P(A) is known as the prior probability, since it represents the probability before we take the data into account. The quantity P(B|A) is referred to as the likelihood since it reflects the chances of the data being produced by the set of values in question. Lastly, the term P(B) is nameless and we will essentially ignore it in our analysis (the reason for which will be explained later).
Now that we understand Bayes' rule, let's observe a few things. One, the end result we are obtaining is not a single value, nor is it a finite number of values. Instead, the Bayesian framework allows us to look at the entire distribution of possible ratings. For each player we can estimate the most likely ratings, the variance of possible ratings, and even do arbitrary quantiles. Not only that, we can also examine the correlations across player ratings, since we are viewing the joint distribution, not just the marginals.
In addition, we have that prior term in the calculation. This allows us our estimation to not only take into account the evidence, but also our prior beliefs about what the results might look like. Or not, if we choose to. With ELO, there was no possibility to take this into account. Perhaps we believe that skill ratings are approximately normally distributed. In that case, we could use a bell curve as our prior. These priors are inherently subjective, and so it is often useful to examine the results for multiple different possible priors.
And finally, we can observe that the only restriction on modeling individual games is that they produce a likelihood equation. The model could be of a logistic form, or perhaps a normal form, or something else entirely. It can include just ratings for the players, or take into account a wide variety of different effects, such as racial advantages, map effects and more. Admittedly, this is the area of this framework that I have least explored. As you will see below, I keep the logistic curve from ELO in my analysis, but I am hoping to extend this in the near future. With my new computational mechanism I can now move beyond the model limitations of an ELO-like approach.
So now we have a framework where we can use a wide variety of game models, and obtain a large amount of information about our model parameters. The final component is a computational tool to bridge the other two elements. Now, even though we can calculate our likelihood and prior, we are not able to calculate the posterior yet because of that pesky P(B) term. This term represents the unconditional probability of the actual data occurring, taking into account all possible values of the parameters. An astute observer might notice that when we are computing the posterior, this P(B) value will not change as we vary the values (the A term). This fact is what let's us use the Metropolis-Hastings algorithm.
The Metropolis-Hastings algorithm is a method for sampling from a distribution using only a factor proportional to the density. In our case, the likelihood * prior is proportional to the posterior, because the P(B) value is constant. As a quick overview, we start with an arbitrary set of parameters (setting all player ratings to 0). Then we randomly search for new candidate sets of values and test whether they represent an acceptable increase in probability compared to the previous values. If they are acceptable, then they become the new values and we repeat the process over.
As we repeat the process thousands of times, it won't matter what the initial set of values was (this is called the steady state). Once the algorithm is at this state, when we continue sampling we will be effectively sampling from the posterior, allowing us to estimate the distribution. Typically, we will sample several thousand values initially and then throw them out, in order to ensure that we have reached the steady state before estimating the posterior. This is known as the "burn-in" process. In my analysis, I use a burn-in period of ~200,000 samples, followed by 10,000 samples for estimating the process. (This is probably more than is necessary, but better to err on the side of caution).
----------------------------
Implementation
_________________
Now that we have covered the theory, let's turn to the actual ways I have implemented this framework. Today I will be using the logistic curve from ELO to model individual games, mostly because I spent all of my time working on other elements of the framework. I think this is still a reasonable model, although it no doubt can (and will) be improved upon in the coming weeks.
The other big choice I had to make was deciding which priors to use in the calculations. Today I will be utilizing two different priors to answer two slightly different questions about Starcraft 2 players. The first prior is a perfectly naive, uniform prior, which essentially presumes nothing about the players. This is essentially the same as not using a prior at all, and so the results are more comparable to ELO. I like to think of this prior as asking the question "Who could be the best Starcraft 2 player?", since it gives all players the benefit of the doubt.
I also wanted to estimate a model that incorporated some skepticism about certain players who are newer to the scene or haven't really had their abilities tested at the top level of Starcraft gaming. For this model I did two things. First, I used a skew normal distribution that dropped off in probability as the ratings increased, basically saying that the level of skepticism should increase as a player's rating increases. Second, I varied the thinness of the upper tail according to how many games each player had played in Premier Tournaments. The end result is that someone who has played a huge amount of games in top tournaments (i.e. MC, Sjow, July, and others) would face a prior that looks like this:
Meanwhile, a player who has never played in a Premier tournament would face a prior that looks like this (note the difference in the axes):
Note that I'm not directly assuming that players with fewer games in Premier tournaments are worse. Instead I'm decreasing the probability in the upper tail, which is equivalent to viewing a higher ranking by someone with few games in such tournaments with a greater amount of skepticism. Also note that this is just my prior, which can be overruled by the likelihood function. If a player has proved that he can beat the best players in the world, he will be ranked highly regardless of how many games he has played in Premier tournaments.
The full script I created to run the algorithm and produce the results can be found in the spoiler below:
+ Show Spoiler +
rm(list=ls())
options("stringsAsFactors"=FALSE)
dir <- "/home/ubuntu/"
setwd(dir)
tlpd <- read.csv(file = "tlpd.csv")
library(igraph)
tlpd$Date <- as.Date(tlpd$Date,format="%m/%d/%Y")
tlpdcondensed <- cbind(tlpd$WinnerID,tlpd$LoserID)
tlpdgraph <- graph.data.frame(tlpdcondensed,directed=FALSE)
stlpdgraph <- simplify(tlpdgraph)
cores <- graph.coreness(stlpdgraph)
tlpdgraph2 <- subgraph(stlpdgraph,as.vector(which(cores>10))-1)
tlpd <- tlpd[tlpd$Date > "2011-03-22",]
tlpdplayer <- read.csv("tlpdplayers.csv")
tlpdplayer <- tlpdplayer[!duplicated(tlpdplayer$PlayerID),]
tlpdm <- merge(tlpd,tlpdplayer,by.x="WinnerID",by.y="PlayerID",suffixes=c("",".w"),sort=FALSE)
tlpdm <- merge(tlpdm,tlpdplayer,by.x="LoserID",by.y="PlayerID",suffixes=c("",".l"),sort=FALSE)
tlpdkcore <- tlpdm[which(tlpdm$WinnerID %in% V(tlpdgraph2)$name),]
tlpdkcore <- tlpdkcore[which(tlpdkcore$LoserID %in% V(tlpdgraph2)$name),]
tlpdcondensed2 <- cbind(tlpdkcore$WinnerID,tlpdkcore$LoserID)
tlpdgraph3 <- graph.data.frame(tlpdcondensed2,directed=FALSE)
V(tlpdgraph3)$label <- unique(c(tlpdkcore$Winner,tlpdkcore$Loser))
V(tlpdgraph3)$size <- 0
V(tlpdgraph3)$label.cex <- 0.75
stlpdgraph2 <- simplify(tlpdgraph3)
cores2 <- graph.coreness(stlpdgraph2)
tlpdgraph4 <- subgraph(stlpdgraph2,as.vector(which(cores2>5))-1)
plot(tlpdgraph4,layout=layout.fruchterman.reingold(tlpdgraph4),main="TLPD Game Network",
sub="Only players with games against 12+ top opponents are shown (12-core)",margin=c(-1,-1,-1,-1))
tlpdkcore <- tlpdkcore[which(tlpdkcore$WinnerID %in% V(tlpdgraph4)$name),]
tlpdkcore <- tlpdkcore[which(tlpdkcore$LoserID %in% V(tlpdgraph4)$name),]
#Likelihood function
likeli <- function(vec){
return(sum(plogis(0,vec,2,log.p=TRUE)))
}
#Get number of matches against Koreans
getKoreaNumber <- function(player,data){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Country.l=="ks"))+length(which(losses$Country=="ks")))
}
#Get number of matches against players in the top 50 by likelihood
getGoodPlayerNumber <- function(player,data,players){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Loser %in% players))+length(which(losses$Winner %in% players)))
}
#Get number of matches in premier tournaments
getPremierNumber <- function(player,data,tournaments){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Tournament %in% tournaments))+length(which(losses$Tournament %in% tournaments)))
}
premierTournaments <- c("2011 NASL Season 1 Finals","2011 Sapphire AMD Dreamhack Summer","2011 Pepsi Global StarCraft 2 League Season 4: Code S",
"2011 Pepsi Global StarCraft 2 League Season 4: Code A","2011 LG Cinema 3D StarCraft 2 3D Special League",
"2011 Major League Gaming Pro Circuit: Columbus","2011 Major League Gaming Pro Circuit: Dallas",
"2010 Major League Gaming Pro Circuit: Dallas","2010 Major League Gaming Pro Circuit: DC",
"2010 Major League Gaming Pro Circuit: Raleigh","2011 LG Cinema 3D GSL Super Tournament","2011 Gigabyte StarsWar Killer 6",
"2011 Copenhagen Games Razer Domination","2011 North American Star League Season 1",
"2011 LG Cinema 3D GSL Sponsorship League Season 3: Code S","2011 LG Cinema 3D GSL Sponsorship League Season 3: Code A",
"2011 Dreamhack Stockholm Invitational","2011 LG Cinema 3D GSL World Championship Seoul",
"2011 PokerStrategy.com TeamLiquid StarLeague","2011 Intel Extreme Masters World Championship",
"2011 Intel Extreme Masters European Championship Finals Seas","2011 Intel GSL Sponsorship League Season 2: Code S",
"2011 Intel GSL Sponsorship League Season 2: Code A","2011 ASSEMBLY Winter: SteelSeries Challenge",
"2011 Gainward PlayXP.com StarCraft II Tournament","2011 Sony Ericsson GSL Sponsorship League Season 1: Code S",
"2011 Sony Ericsson GSL Sponsorship League Season 1: Code A","2010 Winter DreamHack SteelSeries LAN Tournament",
"2010 Sony Ericsson Global StarCraft 2 League Open Season 3","2010 Sony Ericsson Global StarCraft 2 League Open Season 2",
"2010 TG Sambo-Intel Global StarCraft 2 League Open Season 1","2010 GSTAR StarCraft 2 All-Star Tournament",
"2010 Blizzcon Invitational","2010 Intel Extreme Masters American Championship Season 5","2010 IEM Global Challenge Gamescom")
premierPlayers <- c("Hack","DongRaeGu","PuMa","Bomber","Polt","sC","MC","MMA","NesTea","MVP","MarineKing","SuperNoVa","GuMiho",
"TOP","Byun","Alicia","Line","Ryung","Taeja","Sen","Min","Leenock","LosirA","aLive","Younghwa","oGsJ","Curious",
"Zenio","July","GanZi","Keen","TheStC","HuK","NaNiwa","Nerchio","ThorZaIN","Moon","Yoda","NaDa","Ace",
"GuineaPig","Rain","Maka","Strelok","Squirtle","Stephano","MorroW","IdrA","KiWiKaKi","Puzzle")
kn <- c()
pn <- c()
pln <- c()
for (i in tlpdplayer$Player){
kn <- c(kn,getKoreaNumber(i,tlpdkcore))
pn <- c(pn,getPremierNumber(i,tlpdkcore,premierTournaments))
pln <- c(pln,getGoodPlayerNumber(i,tlpdkcore,premierPlayers))
}
tlpdplayer$KoreaNumber <- kn
tlpdplayer$PremierNumber <- pn
tlpdplayer$PremierPlayers <- pln
library(sn)
mh <- function(data,n=1000,v=0.1,tlpdplayer=tlpdplayer){
numplayers <- length(unique(c(data$WinnerID,data$LoserID)))
tranvec <- vector()
pids <- unique(c(data$WinnerID,data$LoserID))
tranvec[pids] <- 1:numplayers
kn <- tlpdplayer$KoreaNumber[match(pids,tlpdplayer$PlayerID)]
pn <- tlpdplayer$PremierNumber[match(pids,tlpdplayer$PlayerID)]
pln <- tlpdplayer$PremierPlayers[match(pids,tlpdplayer$PlayerID)]
kn <- 1/(1+exp(-(kn-50)/15))
pn <- 1/(1+exp(-(pn-30)/10))
pln <- 1/(1+exp(-(pln-15)/5))
ratings <- numeric(numplayers)
mhdist <- matrix(nrow=0,ncol=numplayers)
games <- cbind(tranvec[data$WinnerID],tranvec[data$LoserID])
oldll <- likeli(ratings[games[,2]]-ratings[games[,1]]) #+ sum(dsn(ratings, -3, 1+3*pn, 8, log = TRUE))
burnin <- 1000*numplayers
n <- n + burnin
for (i in 1:n){
canFound <- FALSE
while(canFound==FALSE){
can <- rnorm(numplayers,ratings,v)
canll <- likeli(can[games[,2]]-can[games[,1]]) #+ sum(dsn(can, -3, 1+3*pn, 8, log = TRUE))
if(log(runif(1)) < (canll-oldll)){
ratings <- can
oldll <- canll
canFound <- TRUE
if(i > burnin){
mhdist <- rbind(mhdist,ratings)
}
}
}
if(i %% 1000 == 0){
print(i)
}
}
return(mhdist)
}
fivepercentci <- function(vec){
return(quantile(vec,0.05))
}
ninetyfivepercentci <- function(vec){
return(quantile(vec,0.95))
}
system.time(mhraw <- mh(tlpdkcore,10000,0.04,tlpdplayer))
mr <- apply(mhraw,2,mean)
vr <- apply(mhraw,2,var)
fpci <- apply(mhraw,2,fivepercentci)
nfpci <- apply(mhraw,2,ninetyfivepercentci)
quantr <- apply(mhraw,2,quantile)
pids <- unique(c(tlpdkcore$WinnerID,tlpdkcore$LoserID))
mhsummary <- cbind(pids,mr,vr,fpci,nfpci,t(quantr))
mhtlpd <- merge(mhsummary,tlpdplayer,by.x="pids",by.y="PlayerID")
write.csv(mhtlpd, "methast_tlpd_logis_patch13_july.csv")
write.csv(mhraw,"methast_raw_logis_patch13_july.csv")
Just as a quick aside before I get to the results, I like to mention that I used RStudio Server running on Amazon EC2 to run the program and it worked excellently. The ability to run computationally intensive programs on a powerful server for dirt-cheap continues to impress me. If any of you have questions about setting this up, let me know. I used one of Amazon's ordinary sized machines and it produced >225,000 samples in about 25 minutes. While it's not quite as fast to compute as ELO, my hope is that the added predictive ability will more than compensate for the extra time.
-------------------------
Results
-------------------------
I'll be showing you the rankings for the top 50 players out of the sample of 230 under each prior (that's all I could fit on-screen in Excel - if anyone knows how to print the spreadsheet to one long jpeg, let me know). For each player I have computed the mean rating, the variance in ratings, and their highest possible rating. This last value is obtained by calculating the upper 95% confidence bound on the rating, and seeing where the player would rank if this were their mean rating. The variance has been color-coded to aid you in noticing players whose ratings have more uncertainty. Also, don't pay so much attention to the magnitudes of the ratings, they are on a standard logistic curve instead of the modified ELO curve.
First, let's examine the rankings based on the completely naive prior:
As I said before, these ratings incorporate no skepticism, so a player who has only played 25 games in the sample, but has won all 25, would be rated infinitely high. No player falls into that category, but there are several players who have won most of their small history of games. My thoughts are that the lack of skepticism in this prior means that the system overrates many of the high variance players. These players have beaten some of the consensus top players, but to me it's unclear whether this is truly luck or skill based on the small number of games played.
The somewhat less than realistic results with the naive prior prompted me to introduce more skepticism into the computation. I did that by means of a skew normal prior that varies in the upper tail according to how many games in Premier tournaments that player has played. See above for a detailed explanation. See below for the results:
These results strike me as substantially more reasonable, in that the system isn't quite overreacting as much to the players with higher variance, but also isn't dramatically punishing players who haven't played as much (as ELO does).
I'd be curious to see what you think about how these two priors compare. Do you agree that the second rankings are "better"? If you have ideas for other priors that could be tested, let me know.
------------------------
Going Forward
------------------------
From now on I will be sticking with this system for measuring skill, but expanding the model of individual games to account for the effects of race and map. I rushed to get these results done before MLG Anaheim, so that this weekend can be a test of the predictive abilities of the ranking method. Let me know your thoughts on this system and its results below.