Fun with TLPD and R

zoniusalexandr

United States39 Posts

July 07 2011 06:37 GMT

I've been working on a way to statistically evaluate player performances that improves upon ELO, but before I can do any calculations I need data!

Now that we've got all the data in one place and in a common format, we can start having fun. I've been working on network problems recently for my research, so I thought I'd whip up a few graphs.

Here's a quick graph of the game network for the Homestory Cup #3, just to make sure everything's working before we look at the whole dataset. Here the nodes represent players, and each line connecting two players represents one game.
[image loading]

Everything looks good, nothing out of the ordinary. You can see the groups of 4 pretty clearly, with the players who advanced out of group stages in the middle.

Now let's try something more difficult. I thought it might be really cool to look at the whole game network of the TLPD. However, there are a huge number of players who have played only a handful of games, and so plotting all the players becomes overwhelming. Instead, I've limited the sample to a k-core or a k-degenerate graph, which is basically the largest subset of players who have played games against at least k other opponents. I tried out a bunch of values, but k=12 seemed to work the best (you're welcome to experiment on your own). The nodes are represented by the player name, which is colored to represent their most played race:
(edit: this graph originally included redundant games between pairs of players. I've removed the duplicates, which biased the network against the Koreans, who don't have as many games on record)
[image loading]

This graph shows quite clearly the three different scenes, the Koreans on the right, Europeans on bottom-left, and Americans on bottom-right. Even though these three are fairly distinct, there's a lot of cross-over, especially for players in the middle.

That's all I've got for today, getting the data into R took longer than I thought. For my next analysis, I'm going to be working on a new algorithm to rank players based on their performance and the difficulty of their opponents that might replace ELO. I've almost finished working out the math, but my plan is to use Metropolis-Hastings to maximize the likelihood of the realized game outcomes based on the players skill levels. I'm still a little new to Bayesian methods, so ideas/comments would be much appreciated!

PS: Here's the R code I used to consolidate the data and generate the network graphs.
+ Show Spoiler [R Code] +

rm(list=ls())
options("stringsAsFactors"=FALSE)
setwd(dir)

int <- read.csv("tlpd_international.csv")
int <- int[ ,1:12]
int$edition <- "International"

kor <- read.csv("tlpd_korean.csv")
kor <- kor[ ,1:12]
kor$edition <- "Korean"

beta <- read.csv("tlpd_beta.csv")
beta <- beta[ ,1:12]
beta$edition <- "Beta"

tlpd <- rbind(int,kor,beta)

write.csv(tlpd, file = "tlpd.csv", append = FALSE)

library(igraph)

getRaceColors <- function(players,tlpd){
races <- c()
for (player in players){
games <- c(tlpd[tlpd$Winner==player,8],tlpd[tlpd$Loser==player,11])
t <- length(games[games=="T"])
z <- length(games[games=="Z"])
p <- length(games[games=="P"])
if (t > z){
races <- c(races,"#00005d")
}
else if (z > p){
races <- c(races,"#890000")
}
else {
races <- c(races,"#006e2f")
}
}
return(races)
}

event = "2011 Homestory Cup #3"
hsc = tlpd[tlpd$Tournament == event,]
hsccondensed = hsc[ ,c(7,10)]
hscgraph <- graph.data.frame(hsccondensed,directed=FALSE)
plot(hscgraph,vertex.label=unique(c(hsc$Winner,hsc$Loser)),layout=layout.kamada.kawai(hscgraph),
vertex.size=20,vertex.label.cex=0.7,main="Homestory Cup #3 Game Network")

tlpdcondensed = tlpd[ ,c(7,10)]
tlpdgraph <- graph.data.frame(tlpdcondensed,directed=FALSE)
V(tlpdgraph)$label <- unique(c(tlpd$Winner,tlpd$Loser))
V(tlpdgraph)$size <- 0
V(tlpdgraph)$label.cex <- 0.75
cores <- graph.coreness(tlpdgraph)
tlpdgraph2 <- subgraph(tlpdgraph,as.vector(which(cores>30))-1)
V(tlpdgraph2)$label.color <- getRaceColors(V(tlpdgraph2)$label,tlpd)
plot(tlpdgraph2,layout=layout.fruchterman.reingold(tlpdgraph2),main="TLPD Game Network",
sub="Only players with 30+ games against top opponents are shown (30-core)",margin=c(-1,-1,-1,-1))

R1CH

Netherlands10340 Posts

July 07 2011 06:46 GMT

Scraping TLPD (and other parts of the site) is against our terms of service as it creates undue load on our servers. I've edited the offending portion of your post out, I would appreciate it if you didn't distribute such tools.

sinani206

United States1959 Posts

July 07 2011 06:48 GMT

You'd think Tyler would be doing better considering all of the games he's played...

Anyway, really nice work. It was very interesting to see the analysis of HSC and of the TLPD in general.

zoniusalexandr

United States39 Posts

July 07 2011 07:03 GMT

Sorry about that! I've taken down the hosted script file. Would it be acceptable if I still distributed the csv file? I know a number of people have been looking to get access to that data, and distributing the compiled data file won't affect your servers.

R1CH

Netherlands10340 Posts

July 07 2011 07:10 GMT

We'd prefer if you didn't. Our volunteers have put years of work into building up the TLPD database and making it one of TL's top resources and we'd like to keep that data on TL; it represents a lot of time and effort by our TLPD team.

If you have ideas for more statistics and features however, you can definitely post them in the TLPD feedback thread and we'll see if we can make it available through a more official means