Starcraft Statistics in R: A New Ranking Algorithm

zoniusalexandr

United States39 Posts

July 28 2011 04:32 GMT

In this ongoing series, I am using my years of statistical experience to analyze the game of Starcraft 2. Previously, I've looked at visualizing the different scenes in pro competition (Part 1), evaluating ELO as a tool for measuring player performance (Part 2), and calculating to what extent skill can be considered transitive (Part 3). Today I'll be outlining a general framework for modeling Starcraft 2 player performance and explaining how I've used it to rank professional players.

Note: This post is rather long and heavy on the statistical theory. There's no shame in skipping down to the results at the bottom.

Ranking Starcraft Players: A Bayesian Approach

Fundamentally every system for evaluating players possesses three main elements: a model of individual games, a set of information to be sought, and a computational mechanism to tie it all together. The model of individual games is key in defining which attributes are thought to matter in determining a winner. In most cases, the set of information corresponds to a series of ratings, one for each player, which are the "optimal" estimates of each player's true skill level. There are many existing estimation strategies that can be used, including simple formulas and complex multivariate regressions.

In ELO, individual games are modeled along the logistic curve, using simple formulas to update player ratings after each game is played. Additionally, the only information obtained from ELO is a single number for each player, which will accurately measure true skill in the long run, but perhaps not over finite samples. If this system produces consistent results, then we have no need to modify any of these elements. However, I believe that ELO does not work consistently when applied to Starcraft at the top levels (and if you've read part 2, then I hope you agree with me).

With that in mind, I would like to propose a Bayesian framework which will allow us to improve on ELO with respect to all three elements. This approach, which can be applied to practically any model of individual games, is based around a Bayesian view of player performance using the Metropolis-Hastings algorithm as our estimation tool.

Some of you may be wondering what I mean when I use the term Bayesian to describe this framework. In statistics, there are two main paradigms when it comes to inference: frequentist inference and Bayesian inference. The frequentist method views the model that produces our data as having certain fixed parameters, and uses estimation as a means of finding the most likely values for those parameters. In contrast, the Bayesian view treats the parameters as variables and takes the data to be fixed, estimating the probability of each possible value of a parameter being the "true" value, based on the data.

The Bayesian paradigm is named after a theorem discovered by Thomas Bayes, which is called "Bayes' rule". Its form is quite simple:
$[image loading]$
The rule provides a way of turning one conditional probability into its opposite, using the ratio of the unconditional probabilities. We will use this rule to calculate the probability of a set of possible parameter values given the data using the probability of the data given the values. The probability which we are calculating, P(A|B), is known as the posterior probability, since it represents the probability of a certain set of values after we have taken into account the data. In contrast, the probability P(A) is known as the prior probability, since it represents the probability before we take the data into account. The quantity P(B|A) is referred to as the likelihood since it reflects the chances of the data being produced by the set of values in question. Lastly, the term P(B) is nameless and we will essentially ignore it in our analysis (the reason for which will be explained later).

Now that we understand Bayes' rule, let's observe a few things. One, the end result we are obtaining is not a single value, nor is it a finite number of values. Instead, the Bayesian framework allows us to look at the entire distribution of possible ratings. For each player we can estimate the most likely ratings, the variance of possible ratings, and even do arbitrary quantiles. Not only that, we can also examine the correlations across player ratings, since we are viewing the joint distribution, not just the marginals.

In addition, we have that prior term in the calculation. This allows us our estimation to not only take into account the evidence, but also our prior beliefs about what the results might look like. Or not, if we choose to. With ELO, there was no possibility to take this into account. Perhaps we believe that skill ratings are approximately normally distributed. In that case, we could use a bell curve as our prior. These priors are inherently subjective, and so it is often useful to examine the results for multiple different possible priors.

And finally, we can observe that the only restriction on modeling individual games is that they produce a likelihood equation. The model could be of a logistic form, or perhaps a normal form, or something else entirely. It can include just ratings for the players, or take into account a wide variety of different effects, such as racial advantages, map effects and more. Admittedly, this is the area of this framework that I have least explored. As you will see below, I keep the logistic curve from ELO in my analysis, but I am hoping to extend this in the near future. With my new computational mechanism I can now move beyond the model limitations of an ELO-like approach.

So now we have a framework where we can use a wide variety of game models, and obtain a large amount of information about our model parameters. The final component is a computational tool to bridge the other two elements. Now, even though we can calculate our likelihood and prior, we are not able to calculate the posterior yet because of that pesky P(B) term. This term represents the unconditional probability of the actual data occurring, taking into account all possible values of the parameters. An astute observer might notice that when we are computing the posterior, this P(B) value will not change as we vary the values (the A term). This fact is what let's us use the Metropolis-Hastings algorithm.

The Metropolis-Hastings algorithm is a method for sampling from a distribution using only a factor proportional to the density. In our case, the likelihood * prior is proportional to the posterior, because the P(B) value is constant. As a quick overview, we start with an arbitrary set of parameters (setting all player ratings to 0). Then we randomly search for new candidate sets of values and test whether they represent an acceptable increase in probability compared to the previous values. If they are acceptable, then they become the new values and we repeat the process over.

As we repeat the process thousands of times, it won't matter what the initial set of values was (this is called the steady state). Once the algorithm is at this state, when we continue sampling we will be effectively sampling from the posterior, allowing us to estimate the distribution. Typically, we will sample several thousand values initially and then throw them out, in order to ensure that we have reached the steady state before estimating the posterior. This is known as the "burn-in" process. In my analysis, I use a burn-in period of ~200,000 samples, followed by 10,000 samples for estimating the process. (This is probably more than is necessary, but better to err on the side of caution).

----------------------------
Implementation
_________________

Now that we have covered the theory, let's turn to the actual ways I have implemented this framework. Today I will be using the logistic curve from ELO to model individual games, mostly because I spent all of my time working on other elements of the framework. I think this is still a reasonable model, although it no doubt can (and will) be improved upon in the coming weeks.

The other big choice I had to make was deciding which priors to use in the calculations. Today I will be utilizing two different priors to answer two slightly different questions about Starcraft 2 players. The first prior is a perfectly naive, uniform prior, which essentially presumes nothing about the players. This is essentially the same as not using a prior at all, and so the results are more comparable to ELO. I like to think of this prior as asking the question "Who could be the best Starcraft 2 player?", since it gives all players the benefit of the doubt.

I also wanted to estimate a model that incorporated some skepticism about certain players who are newer to the scene or haven't really had their abilities tested at the top level of Starcraft gaming. For this model I did two things. First, I used a skew normal distribution that dropped off in probability as the ratings increased, basically saying that the level of skepticism should increase as a player's rating increases. Second, I varied the thinness of the upper tail according to how many games each player had played in Premier Tournaments. The end result is that someone who has played a huge amount of games in top tournaments (i.e. MC, Sjow, July, and others) would face a prior that looks like this:
[image loading]

Meanwhile, a player who has never played in a Premier tournament would face a prior that looks like this (note the difference in the axes):
[image loading]

Note that I'm not directly assuming that players with fewer games in Premier tournaments are worse. Instead I'm decreasing the probability in the upper tail, which is equivalent to viewing a higher ranking by someone with few games in such tournaments with a greater amount of skepticism. Also note that this is just my prior, which can be overruled by the likelihood function. If a player has proved that he can beat the best players in the world, he will be ranked highly regardless of how many games he has played in Premier tournaments.

The full script I created to run the algorithm and produce the results can be found in the spoiler below:
+ Show Spoiler +

rm(list=ls())
options("stringsAsFactors"=FALSE)
dir <- "/home/ubuntu/"
setwd(dir)

tlpd <- read.csv(file = "tlpd.csv")

library(igraph)

tlpd$Date <- as.Date(tlpd$Date,format="%m/%d/%Y")
tlpdcondensed <- cbind(tlpd$WinnerID,tlpd$LoserID)
tlpdgraph <- graph.data.frame(tlpdcondensed,directed=FALSE)
stlpdgraph <- simplify(tlpdgraph)
cores <- graph.coreness(stlpdgraph)
tlpdgraph2 <- subgraph(stlpdgraph,as.vector(which(cores>10))-1)
tlpd <- tlpd[tlpd$Date > "2011-03-22",]

tlpdplayer <- read.csv("tlpdplayers.csv")
tlpdplayer <- tlpdplayer[!duplicated(tlpdplayer$PlayerID),]

tlpdm <- merge(tlpd,tlpdplayer,by.x="WinnerID",by.y="PlayerID",suffixes=c("",".w"),sort=FALSE)
tlpdm <- merge(tlpdm,tlpdplayer,by.x="LoserID",by.y="PlayerID",suffixes=c("",".l"),sort=FALSE)

tlpdkcore <- tlpdm[which(tlpdm$WinnerID %in% V(tlpdgraph2)$name),]
tlpdkcore <- tlpdkcore[which(tlpdkcore$LoserID %in% V(tlpdgraph2)$name),]

tlpdcondensed2 <- cbind(tlpdkcore$WinnerID,tlpdkcore$LoserID)
tlpdgraph3 <- graph.data.frame(tlpdcondensed2,directed=FALSE)
V(tlpdgraph3)$label <- unique(c(tlpdkcore$Winner,tlpdkcore$Loser))
V(tlpdgraph3)$size <- 0
V(tlpdgraph3)$label.cex <- 0.75
stlpdgraph2 <- simplify(tlpdgraph3)
cores2 <- graph.coreness(stlpdgraph2)
tlpdgraph4 <- subgraph(stlpdgraph2,as.vector(which(cores2>5))-1)
plot(tlpdgraph4,layout=layout.fruchterman.reingold(tlpdgraph4),main="TLPD Game Network",
sub="Only players with games against 12+ top opponents are shown (12-core)",margin=c(-1,-1,-1,-1))

tlpdkcore <- tlpdkcore[which(tlpdkcore$WinnerID %in% V(tlpdgraph4)$name),]
tlpdkcore <- tlpdkcore[which(tlpdkcore$LoserID %in% V(tlpdgraph4)$name),]

#Likelihood function
likeli <- function(vec){
return(sum(plogis(0,vec,2,log.p=TRUE)))
}

#Get number of matches against Koreans
getKoreaNumber <- function(player,data){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Country.l=="ks"))+length(which(losses$Country=="ks")))
}

#Get number of matches against players in the top 50 by likelihood
getGoodPlayerNumber <- function(player,data,players){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Loser %in% players))+length(which(losses$Winner %in% players)))
}

#Get number of matches in premier tournaments
getPremierNumber <- function(player,data,tournaments){
wins <- data[data$Winner==player,]
losses <- data[data$Loser==player,]
return(length(which(wins$Tournament %in% tournaments))+length(which(losses$Tournament %in% tournaments)))
}

premierTournaments <- c("2011 NASL Season 1 Finals","2011 Sapphire AMD Dreamhack Summer","2011 Pepsi Global StarCraft 2 League Season 4: Code S",
"2011 Pepsi Global StarCraft 2 League Season 4: Code A","2011 LG Cinema 3D StarCraft 2 3D Special League",
"2011 Major League Gaming Pro Circuit: Columbus","2011 Major League Gaming Pro Circuit: Dallas",
"2010 Major League Gaming Pro Circuit: Dallas","2010 Major League Gaming Pro Circuit: DC",
"2010 Major League Gaming Pro Circuit: Raleigh","2011 LG Cinema 3D GSL Super Tournament","2011 Gigabyte StarsWar Killer 6",
"2011 Copenhagen Games Razer Domination","2011 North American Star League Season 1",
"2011 LG Cinema 3D GSL Sponsorship League Season 3: Code S","2011 LG Cinema 3D GSL Sponsorship League Season 3: Code A",
"2011 Dreamhack Stockholm Invitational","2011 LG Cinema 3D GSL World Championship Seoul",
"2011 PokerStrategy.com TeamLiquid StarLeague","2011 Intel Extreme Masters World Championship",
"2011 Intel Extreme Masters European Championship Finals Seas","2011 Intel GSL Sponsorship League Season 2: Code S",
"2011 Intel GSL Sponsorship League Season 2: Code A","2011 ASSEMBLY Winter: SteelSeries Challenge",
"2011 Gainward PlayXP.com StarCraft II Tournament","2011 Sony Ericsson GSL Sponsorship League Season 1: Code S",
"2011 Sony Ericsson GSL Sponsorship League Season 1: Code A","2010 Winter DreamHack SteelSeries LAN Tournament",
"2010 Sony Ericsson Global StarCraft 2 League Open Season 3","2010 Sony Ericsson Global StarCraft 2 League Open Season 2",
"2010 TG Sambo-Intel Global StarCraft 2 League Open Season 1","2010 GSTAR StarCraft 2 All-Star Tournament",
"2010 Blizzcon Invitational","2010 Intel Extreme Masters American Championship Season 5","2010 IEM Global Challenge Gamescom")

premierPlayers <- c("Hack","DongRaeGu","PuMa","Bomber","Polt","sC","MC","MMA","NesTea","MVP","MarineKing","SuperNoVa","GuMiho",
"TOP","Byun","Alicia","Line","Ryung","Taeja","Sen","Min","Leenock","LosirA","aLive","Younghwa","oGsJ","Curious",
"Zenio","July","GanZi","Keen","TheStC","HuK","NaNiwa","Nerchio","ThorZaIN","Moon","Yoda","NaDa","Ace",
"GuineaPig","Rain","Maka","Strelok","Squirtle","Stephano","MorroW","IdrA","KiWiKaKi","Puzzle")

kn <- c()
pn <- c()
pln <- c()
for (i in tlpdplayer$Player){
kn <- c(kn,getKoreaNumber(i,tlpdkcore))
pn <- c(pn,getPremierNumber(i,tlpdkcore,premierTournaments))
pln <- c(pln,getGoodPlayerNumber(i,tlpdkcore,premierPlayers))
}
tlpdplayer$KoreaNumber <- kn
tlpdplayer$PremierNumber <- pn
tlpdplayer$PremierPlayers <- pln
library(sn)

mh <- function(data,n=1000,v=0.1,tlpdplayer=tlpdplayer){
numplayers <- length(unique(c(data$WinnerID,data$LoserID)))
tranvec <- vector()
pids <- unique(c(data$WinnerID,data$LoserID))
tranvec[pids] <- 1:numplayers
kn <- tlpdplayer$KoreaNumber[match(pids,tlpdplayer$PlayerID)]
pn <- tlpdplayer$PremierNumber[match(pids,tlpdplayer$PlayerID)]
pln <- tlpdplayer$PremierPlayers[match(pids,tlpdplayer$PlayerID)]
kn <- 1/(1+exp(-(kn-50)/15))
pn <- 1/(1+exp(-(pn-30)/10))
pln <- 1/(1+exp(-(pln-15)/5))
ratings <- numeric(numplayers)
mhdist <- matrix(nrow=0,ncol=numplayers)
games <- cbind(tranvec[data$WinnerID],tranvec[data$LoserID])
oldll <- likeli(ratings[games[,2]]-ratings[games[,1]]) #+ sum(dsn(ratings, -3, 1+3*pn, 8, log = TRUE))
burnin <- 1000*numplayers
n <- n + burnin

for (i in 1:n){
canFound <- FALSE
while(canFound==FALSE){
can <- rnorm(numplayers,ratings,v)
canll <- likeli(can[games[,2]]-can[games[,1]]) #+ sum(dsn(can, -3, 1+3*pn, 8, log = TRUE))
if(log(runif(1)) < (canll-oldll)){
ratings <- can
oldll <- canll
canFound <- TRUE
if(i > burnin){
mhdist <- rbind(mhdist,ratings)
}
}
}
if(i %% 1000 == 0){
print(i)
}
}
return(mhdist)
}

fivepercentci <- function(vec){
return(quantile(vec,0.05))
}

ninetyfivepercentci <- function(vec){
return(quantile(vec,0.95))
}

system.time(mhraw <- mh(tlpdkcore,10000,0.04,tlpdplayer))
mr <- apply(mhraw,2,mean)
vr <- apply(mhraw,2,var)
fpci <- apply(mhraw,2,fivepercentci)
nfpci <- apply(mhraw,2,ninetyfivepercentci)
quantr <- apply(mhraw,2,quantile)
pids <- unique(c(tlpdkcore$WinnerID,tlpdkcore$LoserID))
mhsummary <- cbind(pids,mr,vr,fpci,nfpci,t(quantr))
mhtlpd <- merge(mhsummary,tlpdplayer,by.x="pids",by.y="PlayerID")

write.csv(mhtlpd, "methast_tlpd_logis_patch13_july.csv")
write.csv(mhraw,"methast_raw_logis_patch13_july.csv")

Just as a quick aside before I get to the results, I like to mention that I used RStudio Server running on Amazon EC2 to run the program and it worked excellently. The ability to run computationally intensive programs on a powerful server for dirt-cheap continues to impress me. If any of you have questions about setting this up, let me know. I used one of Amazon's ordinary sized machines and it produced >225,000 samples in about 25 minutes. While it's not quite as fast to compute as ELO, my hope is that the added predictive ability will more than compensate for the extra time.

-------------------------
Results
-------------------------

I'll be showing you the rankings for the top 50 players out of the sample of 230 under each prior (that's all I could fit on-screen in Excel - if anyone knows how to print the spreadsheet to one long jpeg, let me know). For each player I have computed the mean rating, the variance in ratings, and their highest possible rating. This last value is obtained by calculating the upper 95% confidence bound on the rating, and seeing where the player would rank if this were their mean rating. The variance has been color-coded to aid you in noticing players whose ratings have more uncertainty. Also, don't pay so much attention to the magnitudes of the ratings, they are on a standard logistic curve instead of the modified ELO curve.

First, let's examine the rankings based on the completely naive prior:
[image loading]

As I said before, these ratings incorporate no skepticism, so a player who has only played 25 games in the sample, but has won all 25, would be rated infinitely high. No player falls into that category, but there are several players who have won most of their small history of games. My thoughts are that the lack of skepticism in this prior means that the system overrates many of the high variance players. These players have beaten some of the consensus top players, but to me it's unclear whether this is truly luck or skill based on the small number of games played.

The somewhat less than realistic results with the naive prior prompted me to introduce more skepticism into the computation. I did that by means of a skew normal prior that varies in the upper tail according to how many games in Premier tournaments that player has played. See above for a detailed explanation. See below for the results:
[image loading]

These results strike me as substantially more reasonable, in that the system isn't quite overreacting as much to the players with higher variance, but also isn't dramatically punishing players who haven't played as much (as ELO does).

I'd be curious to see what you think about how these two priors compare. Do you agree that the second rankings are "better"? If you have ideas for other priors that could be tested, let me know.

------------------------
Going Forward
------------------------

From now on I will be sticking with this system for measuring skill, but expanding the model of individual games to account for the effects of race and map. I rushed to get these results done before MLG Anaheim, so that this weekend can be a test of the predictive abilities of the ranking method. Let me know your thoughts on this system and its results below.

TheAmazombie

United States3714 Posts

July 28 2011 04:48 GMT

BTW, I love the effort that you put into all of this, first of all.

I do think that the second list looks more reasonable. I do find a few weird things in there, like Huk being so low, even after all of the recent success. Also, is that IM.Happy or the European Happy? I would be surprised to see the European Happy that high, even though I love him as a player. Also, I love Kas but it seems strange to see him over players like Sase, White-Ra, Mana, Socke, and others.

zoniusalexandr

United States39 Posts

July 28 2011 05:21 GMT

Thanks!

First, that's Empire.Happy. Second, I cannot emphasize enough that this is still a work in progress. I'll be tinkering with the prior based on what happens in Anaheim. One other thing I've thought about doing is maybe weighting the games differently depending on how premier the tournament is. Right now a win in the round of 1024 in a random TL Open counts the same towards skill as Game 7 of the NASL finals (as long as it's the same players playing against each other).

For the record, Happy has won 6 of 8 against Socke, but has lost 6 of 9 against MaNa and 4 of 4 against SaSe. He hasn't played against White-Ra in the patch 1.3 era.

RoninShogun

United States315 Posts

July 28 2011 05:24 GMT

I understand what you're doing and all, but its just weird to look at the second list and see NesTea at 17th, though that is a major improvement for him not being on the first one at all

edit: oops he's actually 11th on the first one. hmmmmmmmmmmmmm...

TheAmazombie

United States3714 Posts

July 28 2011 06:05 GMT

On July 28 2011 14:21 zoniusalexandr wrote:
Thanks!

First, that's Empire.Happy. Second, I cannot emphasize enough that this is still a work in progress. I'll be tinkering with the prior based on what happens in Anaheim. One other thing I've thought about doing is maybe weighting the games differently depending on how premier the tournament is. Right now a win in the round of 1024 in a random TL Open counts the same towards skill as Game 7 of the NASL finals (as long as it's the same players playing against each other).

For the record, Happy has won 6 of 8 against Socke, but has lost 6 of 9 against MaNa and 4 of 4 against SaSe. He hasn't played against White-Ra in the patch 1.3 era.

I see. That seems to be where the work needs to be the most then, in that weighting of tourney results and spots, because there has to be a difference in skill and whatnot between NASL finals and round of 1024. I dig it though.

I stated once on one of your other blogs that I would like to see a system that could take into account situations. Like the final game in a 3-3 tie for a bo7 championship. Not to mention even more in-depth intangibles like clutch performance and map performance, let alone in-game situations like coming back to win from huge deficits or something. I know that goes way more in-depth than what you are working on, but someday I would like to see it, being a huge baseball fan I love player and team statistics. That brings me to another point, what about team situations? Like weight a player with their team in a way where the team ratings or every single person on a team could get some kind of rankings based on their team mates rankings. I know it is not a "team" sport in the way that soccer or football are, but in time that could becoming a factor.

rabidch

United States20289 Posts

July 28 2011 06:21 GMT

Interesting, some players jumped up insanely in ranking like naniwa, moon, sen. I kind of want to test it on BW because it's a very simple modification in the script and BW has had very long histories, though you cant go from BW -> SC2. Don't have the time now though

Primadog

United States4411 Posts

July 28 2011 20:12 GMT

holy shit, time to brush off my stats book. This will take some time.

Again, nice work. I am your biggest fan.

Erik.TheRed

United States1655 Posts

July 28 2011 20:20 GMT

Hooray for graphs and spreadsheets!!

No seriously, this looks really interesting

I'm going to try my best to get through all of this.

Primadog

United States4411 Posts

July 28 2011 20:22 GMT

Re: spreadsheet

I personally found using [code\] [/code\] is the best way on displaying tables at TL. Copy-pasting from the spreadsheet into [code/] retains the format reasonably well with minimum editing needed. See here for a sample.

Any chance you can put this dataset on google docs/spreadsheet? I can probably make use of your research in the future.

Primadog

United States4411 Posts

July 28 2011 20:30 GMT

#10

TrueSkill uses conservative skill estimate = μ - k*σ for its display skill level, can you output a similar set for your rankings (as oppose to 95% ranking)?

Breezing through the data, what strikes me right away is your 95% ranking is a good indicator of "skill tiers" that was much lauded subjectively on the forums, but never really researched into.

alexhard

Sweden317 Posts

July 29 2011 00:35 GMT

#11

Well, depending on the actual forecast results this writeup may have been in vain...looking forward to see how your model performs.

If you're interested in individual games prediction I've had good results with a neural network ensemble using ELOs, matchup winrates, and WR over the last 20 games as inputs.

Heyoka

Katowice25012 Posts

July 29 2011 21:50 GMT

#12

This is really awesome, I was just thinking about your transitivity problem earlier and had some thoughts but now they should wait until I can digest this. Unfortunately I've never really taken time to properly grasp bayesian models but I see them referenced all the time in more advanced prediction simulations.

alexhard

Sweden317 Posts

August 01 2011 14:19 GMT

#13

So how did it go? :D

Cassel_Castle

United States820 Posts

August 02 2011 18:18 GMT

#14

I'm taking a 6-week mathematical statistics class starting today. In 6 weeks or less I will comment on this post and tell you what I think of your ranking system.

Please or register to reply.

Starcraft Statistics in R: A New Ranking Algorithm

Completed

Ongoing

Upcoming