|
Hey Teamliquid! I decided to show you guys my new ELO system that I have created and was going to ask you guys for your input. The program first of all, takes input of text files that has listed in it all the files of a particular tournament and then plays games and keeps recalculating ELO based on those results. I know there are ELO systems out there but I wanted to try to look into a system that could be user to some degree of accuracy, find out who the best players are. Here is a screenshot of the tournament files and throughout I will show you results based off of different systems.
![[image loading]](http://i.imgur.com/k54x3.png)
The reason I actually came here, is to get some input on how the system will change ratings. The system starts out by taking a game, in this case, let’s says Marineking vs. Zenio as shown in the picture. Before I describe the system, it should be noted that I stole this whole system straight from the ELO System Wikipedia page. The system will find out an Expected score described by
![[image loading]](http://i.imgur.com/wIaV7.png)
Where Ea is the expected score for player A, Rating B is the ELO of player B, and Rating A is the ELO of player A. It then calculates the new ELO based on
![[image loading]](http://i.imgur.com/UhCd6.png) Where Ra’ is the new ELO of player A, Ra is the old ELO for player A, K is the weighting factor, Sa is the score of player A, and Ea is the expected score for player A. The issues that I am having right now are that traditional chess ELO is almost impossible to use because of how Starcraft is used. In the Chess system, players start out with a high weighting factor for their first x amount of games to catapult them to where they should be. Then this weighting factor is lowered so that ratings are not extremely volatile. To me this does not work because a players skill relative to every other players IS volatile. To test out different weighting factors I set up the system with a base rating of 1200, and a K factor of 20, 40, and 80 respectively while only inputting premier Starcraft events (MLG, GSL, DreamHack, IEM, etc.) I get the following top 10 players:
![[image loading]](http://i.imgur.com/8JvNC.png)
![[image loading]](http://i.imgur.com/2Xvyw.png)
![[image loading]](http://i.imgur.com/zmJP5.png)
It seems that a very high weighting factor makes it so that the people that are doing well most recently are the people with the highest ELO. I am not sure if this is desirable but it seems like Nestea should be leaps and bounds above everybody else in terms of ELO and this certainly does that. Also, a higher weighting factor stretches the gap out and makes it so the best players have a much much larger ELO than average or new players. I also think this is desirable but I really haven’t thought about the consequences of a higher weighting factor on how it affects people with a lower ELO.
Another point is how I calculate change in ELO. Each match is a game and has the same effect no matter what. A best of one has the same weight as a best of seven. I was thinking that there should be a multiplier on weight based on how long the series is. Best of seven has a multiplier of four where as a best of one has a multiplier of one.
Some other issues that have come up is whether or not to add in Open bracket from MLG, small tournaments and cups, team leagues, show matches, etc. I could add them all and do something similar to how longer matches might be weighted. Premier events have weight of 1, major events weight of .9, small events weight of .7, and I could figure out how those weigh into an actual players skill. I was thinking of weighting based off of prize money but event like MLG do not have a direct correlation between prize money and overall talent at the event.
Another issue is that some events go over a long period of time such as NASL and right now I kind of group them based on when they started. This is something that I could probably fix for the tournaments that have already happened and it shouldn’t be a problem for events from now on. And the last issue that I can think of right now is that some ladder systems have some type of security for players at the very top. Which means is a highly rated player loses to a very lowly rated player, they do not get dumped in ELO. Hon comes to mind but I guess that has to do with team rating. Does anyone know what this is? Do you think it should be implemented? A problem with that though, is that higher rated players could tend to stick on the top if they lose to lower rated player often.
Last thing I want to say before I conclude is how this differs from the TLPD ELO system. Based on what I know, the base rating for TLPD is 2000 with weighting of 40 for first some amount of games (I think someone in IRC said 20) and then 20 from then on. I thought I would state that for people so that there is something to compare it to. For right now, there is no weighting change based on games played my what I am doing.
Those are all the things I could think of for now. The reason I wanted to talk to TL about this is because I wanted to try and get a system that is fairly accurate for SC2 and I don’t think I could do it by myself because I tend to not think about everything . If you guys want me to test out something, clarify something, ask me questions, etc. feel free.
For a thread on how TLPD elo works:
http://www.teamliquid.net/forum/viewmessage.php?topic_id=59138
And for a thread that has some ideas on why ELO doesn't work
http://www.teamliquid.net/blogs/viewblog.php?topic_id=241535
|
This is a subject that is close to my heart. I think a good ELO based system is something sc2 can gain a lot from. For example when comparing between any 2 players, check who the world top are, or estimating a tournament favorites.
I've read the threads you linked, and I must say I am not convinced by the arguments for why ELO system doesn't work for sc2. I'll address my thoughts on them briefly (his original points in bold):
1) ELO only works in one direction 2) ELO does poorly with small amounts of games
- We don't have to start our ranking from today with everyone having X amount of ELO rating and go from there. With over a year of sc2, we can begin our ranking from any historical date that we choose and update the players rating from there. This helps mitigating both points. For those players who aren't quite known, and have played a small amount of 'ranked' tournaments games, their rating will obviously reflect poorly on their actual skill, but this problem appears unavoidable to me no matter what system you use (and yes, they should begin with a higher K factor to mitigate this)
3) ELO measures dominance
- I don't think this is as big concern as the post makes it out to be. In his example, if he is dominating his local community and flash is dominating the Korean scene, then flash is dominating a group of higher ranked players, thus his rating will end up higher. There is only so much points to be gained from having even a 90%+ win ratio vs lower skilled opponents, as can be observed by looking at the top players on the ladder who regularly face lower skilled opponents than themselves, and their points are kept in check (in other words, eventually gaining few points for victory and losing many points for a loss catch up to you and keeps you in reasonable range from your opponents).
The only thing to be worried of is if there is such a great disconnect between the Korean scene and the international one, that the ratings act very close to separately. For example if only 1% of games are between Koreans and non Koreans, then the ratings will be out of sync. The first solution I came up with was to attach greater significance to games between Koreans and non Koreans (higher k factor), but if the amount of games is very few I don't think it will be enough. Perhaps there is no other option than manually shift the ratings of one region to better reflect the actual skills. For example if we look at the data and see that 2000 rating Koreans are achieving the performance of 2300 rating when playing non Koreans, then perhaps the Koreans rating needs to be shifted by 300 points as a whole. This is obviously very crude but it's the best I can think of.
4) We don't live in an ideal world...
- Obviously we have to make do with what we have, but I think it's enough. Top players have histories of dozens if not hundreds of games against various opponents. If anything then it's the tier 2 players who only participated in one or a few tournaments that represent the greatest challenge to the system, since for them the data is scarce.
A few more comments / questions regarding your post:
I am not sure how many games did you include in your example analysis (k 20, 40, 80), but I suspect if you include more / go further back in time then there will be less variance in your results. I think part of the discrepency, as well as the obvious one that greater k factor results in higher importance for later results, is that initially all players begin with 1200 rating, which is obviously not representative of their skill. As you include more and more games, the effect of this error becomes mitigated.
Regarding different factors for best of 1,3,5,7, I think it's wrong to give different factors for each, at least the ones you mentioned. Consider the following: You and I both have rating 1000. We face player X who has rating 1500. I play him in a best of 1, and you play him in a best of 7. In any single game you and I have 5% chance to beat him (made up number for 1000 and 1500 ratings). That means that I have 5% chance to take the series from him, and you have closer to 0.02% to take the series from him (if I still remember my statistics). Since some tournaments use bo1, some other formats, and some/most tournaments even have different series sizes for the different stages of the tournament, I suggest we only look at individual games, and give each game the same factor. I'm not sure how other ELO systems handle this, for example chess tournaments, I'd be curious to know so if anyone knows please say how.
While on this subject, I'd be very careful about assigning different factors for different types of events. It seems unnecessary to me, either an event is ranked or it isn't. A show match for example (such as the boxer - yellow showmatch) should be unranked because it is obviously not entirely competitive. Other than that if a player is playing in a tournament, you can only assume he's playing seriously and adjust his rating according to his performance, no matter if it is GSL finals or round of 32 in a small time international tournament.
I didn't find it in your post, could you perhaps elaborate on what you find inadequate in the TLPD system? What is it that you want to fix? You stated the difference of your system and the TLPD one is that the k factor doesn't change, but I'm not sure why you find the changing-k-system less desirable?
As for what k factor to choose, I think doing the standard way of new players starting with high k factor and it drops after certain of amount of games should work decently (varying k factor). I would try to tweak the starting k factor such that a new comer with amazing performance doesn't rise to the top of the list too quickly, beside that the best amount of games and the proper k factor can probably be found through running some simulations and seeing which value brings most players near their final ratings best.
I wish you success with your system. This is something I would really want to see work properly. The ladder is obviously bad at comparing the top players to one another, and anything is better than what we have now where the only way to compare 2 players is by the last time(s) they played each other, and which player won which tournament in the recent / non recent history.
|
To answer your point as to why I might think the TLPD system in inadequate, I think that their system is not necessarily wrong. I am just trying to test out different systems to see if I can understand how an ELO should be changed based on how a structure of a sport works. One thing I want to do is combine the foreign ELO with the Korean ELO, which TLPD does seperately right now. This causes a whole new can of worms which include foreigners booster each others ELO when they never even have to test their skill against the korean. The koreans on the other hand are "beating each other up" and some very good players may have lower ELO than their foreigner counterpart. As you said though, these types of things tend to work themselves out to a degree the longer the system is kept in place. As for the k-changing system that TLPD uses just doesn't seem like it is needed. You are trying to ease new players into the system while trying to keep the pros from losing too many points to a player that is actually better than the rating says he is. It just seems to me like as time goes on, these figures smooth each other out and the only real consequence to not having a k-changing system is that the change in the system will lower some.
I was thinking a ton about the k-factor and why it should or should not be changed and the only rational that I came up with was that if a game has a higher volatility of skill (in terms of change between who is arguably the best player) then a higher k-factor will show this more quickly. There is danger in having one too high though because if we put the k-factor at say 10000, then the winner of the most recent tournament is always number 1 in ELO. I did some testing where I added more and more tournaments and then did a k-value of 80, and it is EXTREMELY unforgiving. A player like Jinro who after his success getting to RO4 in two GSLs, he would lose like 60-70 points for a loss against very good players. The thing that seems right about it though, is that the people that have been on a tear are at the top and the people that have been struggling are not at the top. Someone like Jinro who has had very good results isn't completely screwed by a high-k system but unless you put up results then your not going to be in the top tier. So I guess that brings up the question, what is the system used for? Is it to determine who is favored in an upcoming match? Is it to determine the best player over the body of his work? Is it to determine some type of "skill rating"? I guess that is a question I need to answer.
I agree with you about the best of system where you basically just take it a game at a time. This gives people who dominate somebody in a BO7 a larger point boost than before and somebody who loses in a BO1 (like Nestea a couple of season ago) doesn't lose a bunch of points because of it. As for the leagues and stuff, it seems like I just need to decide what matters and what doesn't. I would like some opinions of stuff like money showmatches like the ones Destiny are doing, or The V etc. I feel like those matter and should be counted where as things like Yellow vs. Boxer should not be added.
The reason I did this was because I am trying to learn GUI and like to spend time programming as a way to relax after the day and felt like I should make something that I might enjoy making. I have a hard time motivating myself to do something I don't enjoy so I figured I would try to make something that interests me.
|
|
|
|