|
I was wondering if you had the p-value for the chi-square test, I didn't see a null hypothesis or whether or not the results were statistically significant..
p-values are nearly zero for all leagues except silver, where it is above 0.2 or 0.3 if memory serves. That's with a n = ~50 million.
Stats minor here. I redid (I hope) your results with the numbers you put in here, with a chi-square GOF test for homogeneity. In my findings, the only league that is statistically imbalanced is bronze, based solely on W/L you reported. (the P-value at the end was about .12, higher than the .05 I set as the requirements for the null hypothesis to be overruled. Silver to Diamond, it didn't go above .03, which essentially means that after bronze, no race has a W/L record that is significantly different from any othre race.
Unless I'm misunderstanding something, that's the test I've done, but with a different sample size. If you use percentages as your counts than your sample size is 100, unless you've adjusted them for the actual sample size (as I have). For a chi-square test it's important that the correct sample size be used.
@ all the people who think my results are meaningless due to matchmaking:
I've said this a lot by now, but I'll state it again for the last time here:
I have reasoned that if the game is imbalanced, that imbalance must manifest as either 1) a difference in win ratios for the different races or 2) a difference in race prevalence as you increase player skill level, except under one of several unlikely scenarios and one likely one.
The degree to which it shifts from condition 1) to condition 2) depends on the strength of the matchmaking system. Since we don't see 1) (as my data show), and we don't see 2) (as I have said and escapeartist has shown), we can conclude that the game is balanced, at least for regular league play.
The unlikely scenarios:
a. Blizzard's matchmaking system is wise to racial imbalances, and choses lower level opponents for a player of a given ranking if they play as a weak race vs. a strong race. The only reason they would do this would be to 'hide' racial imbalances from the player and/or the community.
b. Blizzard's matchmaking system does nothing, and each league is a random sample of the regional player population.
c. People have no race loyalty, and randomly pick their race before each match.
The likely scenario:
d. The races are balanced overall but matchups are imbalanced, in a rock-paper-scissors fashion. I favor protoss, and I really feel that I struggle against Terran.
@ all the people who think the test is inappropriate because I haven't modeled enough variables that affect win rate
I don't have access to data that will allow me to do that. I'd like to, but I can't. In science when that's the case, you have to look for other ways that you can use to test a question. In my case initially reasoned that an imbalance would lead to a difference in win rates among races. People immediately pointed out that that wasn't the case, due to matchmaking. However, I then realized that the matchmaking system would force weak races into the lower leagues.
I checked to see if that happened and amended my analysis with a graph showing that it doesn't. Escapeartist has since analyzed this in more detail and come to the same conclusion, although nobody to my knowledge has done an analysis for lower league play. People have shown, however, that it's not true for the top hundred or so players in each region.
@ the people who think stats are useless
It's been shown before that stats are a much better way of assessing the truth than anecdotal knowledge. Even experts often have misperceptions, and misperceptions often produce feedback loops. Stats are at least partially resistant to this.
That said, I think opinions and impressions of top-level players (IdrA's thoughts on high level ZvT matchups, e.g.) still warrant attention, and consistently held beliefs warrant scientific investigation. In fact, that's what I did with respect to win rates for league play!
Finally, thanks everyone for your interest! I'll keep trying to answer questions but I know I'll miss some and for that I apologize.
|
Sanya12364 Posts
Ugh... stats are good for evidence of phenomena. They aren't proof of phenomena. Stats are better at disproving BS theories than they are at proving theories. They are also better at capturing the empirical outcome of an unknown phenomenon without explaining why.
All your stats have shown is that that the populations of zerg, protoss, and terran favorite players have similar win rates at various levels, and there is an interesting distribution of races across all the leagues. These populations of players aren't the same, so it doesn't show anything about the underlying skill level of players in those leagues nor does it eliminate selection biases for picking races or the possibility of varying levels of imbalances at different skill levels.
For example: It might take a certain type of mind to play zerg well and not all players are suited to it, or the average zerg might have to be better, or the learning curve for zerg is easier but current skill ceiling is lower.
Everything in statistics is retrospective and only past state of affairs. It only captures up to the current state of the SC2 metagame. In fact, the statistics could be hopeless outdated if there is a sharp change in the metagame out there. Moreover, it doesn't say anything about the imbalances should players figure out how to play it optimally.
|
On August 17 2010 07:18 StarcraftGuy4U wrote: None of these stats are worthwhile because the matchmaking system does not assign people like they would in a blind study, instead it is actively adjusting the matches so that every player reaches 50%. The numbers you are pulling are worthless for this reason.
Its been said before, but, due to faulty assumptions made by the author this study must be redone to be relevant.
|
For those who says that the matchmaking system making players have 50-50 is causing the data the guy is using to be incorrect, READ what he wrote:
On August 17 2010 07:13 GagnarTheUnruly wrote: People have pointed out that matchmaking would cause this to happen, because it strives to set each player's win rate at 50%. That in turn would cause the win rate of each race to trend towards 50%. That being the case, poor balance would tend to result in 'weak' races getting pushed into the lower tiers of play. Because we don't see that happening either within or among leagues (data not shown), my data suggest both that the matchmaking system works well and that SC2 is inherrently pretty well balanced. Agreeing with this.
|
On August 19 2010 08:48 TanGeng wrote: Ugh... stats are good for evidence of phenomena. They aren't proof of phenomena. Stats are better at disproving BS theories than they are at proving theories. They are also better at capturing the empirical outcome of an unknown phenomenon without explaining why.
Phenomena are proof of themselves. Mechanisms require validation. The only thing you need to do to prove that different races have different win rates is to show that they do (and they do... 56.06% is different from 55.56%). The question the statistics answer is: is the difference due to chance? Specifically, they estimate the chance that the observed test statistic (chi-square value representing the standardized difference between observed and expected win rates in this case) could occur if it was picked at random. In this case, we take a p-value of something on the order of 1e-12 to be sufficient evidence that the difference is not due to chance, that we can feel comfortable saying it with certainty. The difficulty comes with interpretation and generalization. These require reasoning and logic, and careful consideration of all possible explanations. Such analysis often leads to additional questions, as we've witnessed in this thread. It's the scientific process in a nutshell.
It's a common misconception that science and statistics can't prove anything. I like a statement I just read in an intro plant ecology textbook:
"The popular image of the scientific method protrays it as a process of falsifying hypotheses. This approach was codified by... Karl Popper (1959). In this framework we are taught that we can never prove a scientific hypothesis or theory. Rather, we propose a hypothesis and test it; the outcome of this test either falsifies or fails to falsify the hypothesis. While hypothesis testing and falsification is an important part of theory testing, it is not the whole story, for two reasons."
First, the approach "fails to recognize knowledge accumulation." The author goes on to say that although in a strictly philisophical scientific knowledge can never be known with absolute certainty, "we also recognize that some knowledge is so firmly established and bolstered by so many facts taht the chance taht we are wrong is very much less than the chance of winning the lottery several times in a row." It's important to note that we can estimate with some accuracy what are chance is of being correct.
I would also add that even when we use statistics to 'disprove' a hypothesis, that falsification is associated with it's own probability level. In actual practice, it's much easier to show that patterns exist and processes happen than it is to show the opposite, because if a pattern is not found it's generally impossible to know if it's because it shouldn't be found or because the research approach was inadequate.
Second, science often isn't concerned with falsification, and instead asks questions about the relative importance of processes, and this doesn't fit the Popperian framework. This is generally better science anyways.
- from Gurevitch et al. 2006. The Ecology of Plants
|
I just don't think 0.5% is enough for me to even consider that the game is imbalanced, if it's only .5% it seems pretty balanced to me.
It would be cool if you showed the winning percentages with all the matchups.
|
I can't believe how many people are still using the 'matchmaking explains imbalance' argument when the OP fucking explained it (eventually).
|
What exactly is the point of analyzing the overall win percentage of the races? It would be much more interesting to see win percentages for each matchup (TvZ, TvP, PvZ). Maybe Terrans win 60% of their matches against Zerg, but less than 50% against Protoss.
Or am I missing something?
|
Considering the game was released only 2 weeks ago the game is damn good balanced. But some pro gamers think Terran is a little bit too strong and what happens? Hordes of fanboys and noobs think this is the ultimate truth and start to go on a imbalance crusade.
Just today I met a Protoss player who said "Terran is so fucking imba". I asked him why. His answer: "Dunno, they all say it in the forums"........
Again: No RTS will ever be 100% balanced. NEVER. But considering the game is only 2 weeks in stores the balancing is DAMN GOOD. Now stop whining and enjoy the game.
|
Sanya12364 Posts
Ugh...
Science can prove things with observation and experiment. With science you can even observe the processes and phenomena at work. With statistics, you have none of this. It's merely input states and observed states. All phenomena and mechanism is ignored.
With statistics, you can reach near certainty on certain specific statements, and those statements are very very specific. In general, science has massively abused statistics by jumping to conclusions that do not match the very specific statements being shown to be near certain.
When there is a p score of 1e-12, then it's with near certainty that it's not by pure chance. By assuming the negative then showing the negative to have a low p score, the conclusion is merely that a false positive is unlikely given your assumption. But it doesn't say anything about the probability of the false positives given a positive test result. Nor does it factor in all false positives - some of which have plausible alternate explanations.
Based off of your data, you haven't even shown that there is imbalance. It is merely that the population of zerg, protoss, and terran favorite players have observably different winning percentages across the various leagues, and that if it were a given that the game was perfectly balanced and populations at each level possess the exact same skill sets, it'd be hard to produce the observed winning percentage rates purely by chance. That is all the statistics tells you.
This is an extremely minimalistic conclusion and of no real value at all. There are some conclusions you can deduce from that by looking at it logically, but there isn't much there.
|
How many of you have watched G2 of TLO vs MadFrog in the IEM tournament? + Show Spoiler + Madfrog played perfectly and still got steamrolled. TLO was even behind by a significant margin economically due to MF's counter, yet by the grace of MULEs managed to completely own MF.
EDIT: I'm even somewhat of a TLO fanboy, and have a problem with this. Both games of the series are really telling IMO.
|
I'm sorry, but I'm having a lot of trouble understanding what you're trying to say. I'll give it my best shot, though. Also, please stop prefacing all your posts with 'ugh...' If you're frustrated by what I'm saying, it's possible that you are the one who's missing something, and not me.
I think you're creating a little bit of a false dichotomy between science and statistics. Science is a method of obtaining understanding, and statistics is an important mathematical tool that scientists use. You're right that statistical methods ask and answer very specific questions, and that assumptions of certain tests can limit inference, but statistics isn't the only means of scientific inference. We also use logic and theoretical understanding to interpret statistical results. In fact, doing this is necessary in order to achieve scientific progress. I also think it's very unfair to state that science in general has jumped to conclusions and abused statistics. If you're going to make such sweeping statements you should present examples.
When there is a p score of 1e-12, then it's with near certainty that it's not by pure chance. By assuming the negative then showing the negative to have a low p score, the conclusion is merely that a false positive is unlikely given your assumption. But it doesn't say anything about the probability of the false positives given a positive test result. Nor does it factor in all false positives - some of which have plausible alternate explanations.
Here I'm not sure what you're saying. In this case the null hypothesis was no difference in win rates. My data suggest a significant but very small departure from the null hypothesis (a positive result). The chance of a false positive is the type 1 error, and is equivalent to the p-value. The 'negative' doesn't have a low p-score... a negative result would have a high p-score. I have not calculated the chance of a false negative, and I'm not interested in that question because I haven't seen a negative result. Also, I'm not sure what you mean by there being other false positives. There's only one test statistic and the chance of a false positive is almost zero. Are you talking about other explanations?
Also, the results don't involve the population of zerg, protoss and terran players -- it's the results of zerg, terran, and protoss games that I measured. Players aren't accounted for and aren't important given my reasoning described in other posts. You're right about the results, they show with near certainty that there's a small departure from the null hypothesis (which assumes the first condition you cite but not the second one). And you're right that there's no more that the statistics tells us. Which is why we move from statistical inference to logical inference. Then we learn that the results mean that the game is well balanced.
|
On August 19 2010 11:17 mierin wrote:How many of you have watched G2 of TLO vs MadFrog in the IEM tournament? + Show Spoiler + Madfrog played perfectly and still got steamrolled. TLO was even behind by a significant margin economically due to MF's counter, yet by the grace of MULEs managed to completely own MF.
EDIT: I'm even somewhat of a TLO fanboy, and have a problem with this. Both games of the series are really telling IMO.
You totally freaked me out! I had a different TLO v MadFrog game going in a diff. window, and I thought you were sending some freaky evil post letting me know you were hacking me. Then I realized you were talking about a different game and why you brought it up...
It does seem that at high level play a lot of players feel zerg is too weak. But why is it doing so well in Asia?
|
On August 19 2010 09:44 GagnarTheUnruly wrote:Show nested quote +On August 19 2010 08:48 TanGeng wrote: Ugh... stats are good for evidence of phenomena. They aren't proof of phenomena. Stats are better at disproving BS theories than they are at proving theories. They are also better at capturing the empirical outcome of an unknown phenomenon without explaining why. Phenomena are proof of themselves. Mechanisms require validation ... In actual practice, it's much easier to show that patterns exist and processes happen than it is to show the opposite, because if a pattern is not found it's generally impossible to know if it's because it shouldn't be found or because the research approach was inadequate.
TanGeng got owned
Nice work OP
|
Sanya12364 Posts
On August 19 2010 11:47 GagnarTheUnruly wrote: Also, the results don't involve the population of zerg, protoss and terran players -- it's the results of zerg, terran, and protoss games that I measured. Players aren't accounted for and aren't important given my reasoning described in other posts. You're right about the results, they show with near certainty that there's a small departure from the null hypothesis (which assumes the first condition you cite but not the second one). And you're right that there's no more that the statistics tells us. Which is why we move from statistical inference to logical inference. Then we learn that the results mean that the game is well balanced.
It's about the games? I don't see games statistics by race at sc2ranks. It's grouped by player and only their favorite race is shown. Anyone with a favorite race can play a lesser number of games as any of the other races (including random). There is no insight into the exact number of games won or lost by any of the races selections. You would have to assume that they played their favorite race exclusively.
Also you would also have to assume perfectly even skill distribution among player populations if you want perfectly matching win rates - unless you want to assume that players didn't actually pick their favorite races and they were assigned one randomly by battlenet.
When there is a p score of 1e-12, then it's with near certainty that it's not by pure chance. By assuming the negative then showing the negative to have a low p score, the conclusion is merely that a false positive is unlikely given your assumption. But it doesn't say anything about the probability of the false positives given a positive test result. Nor does it factor in all false positives - some of which have plausible alternate explanations.
This is basic Bayesian logic. Let's call perfect balance with equally skilled players your negative condition, and imbalanced game your positive condition. Your positive test is a significant difference in win rates among the populations players with favorite races. The scenarios covered by your assumption of perfect balance with equally skilled players overlaps a bit with the scenario where you get differences in win rates.
How likely do you think you will have a game with perfect balance with equally skilled players anyways? If your estimate is close to 1, then the overall chances that you got a false positive increases while your chances of a true positive decreases. Your probability of a false positive given a positive test result is high. (In case of rare diseases, doctors often ask for a confirmation test since the false positive rates is nearly the same as a true positive rates.)
If it's close to 0, then your false positive chances arising from a perfectly balance and equally skilled players scenario were nearly nil anyways, so why would you care about it at all to begin with? You want to eliminate other reasons why you might register a false positive for imbalance instead.
|
On August 19 2010 09:10 cocosoft wrote:For those who says that the matchmaking system making players have 50-50 is causing the data the guy is using to be incorrect, READ what he wrote: Show nested quote +On August 17 2010 07:13 GagnarTheUnruly wrote: People have pointed out that matchmaking would cause this to happen, because it strives to set each player's win rate at 50%. That in turn would cause the win rate of each race to trend towards 50%. That being the case, poor balance would tend to result in 'weak' races getting pushed into the lower tiers of play. Because we don't see that happening either within or among leagues (data not shown), my data suggest both that the matchmaking system works well and that SC2 is inherrently pretty well balanced. Agreeing with this.
Really? The OP is wrong, there is no two ways about it.
I'll repost here what I wrote back on page 8... which by the way, the OP conveniently ignores. Pay careful attention to the points regarding MMR and how they are used by the AMM to produce the exact result that Blizzard want you to see which is the exact result that the data shows. It is a SELF FULFULLING PROPHECY! All the math in the world, no matter how fancy you try to be won't be of any use since you are using faulty data/metrics to try and prove a point. It would be like me trying to show that the data is clean, by using only "cleaned" data and filtering out the "dirty" data. Blizzard have "cleaned" the data, it's plain and simple to see.
The problem is not in your methodology as much as it is in the assumptions you are making about the system.
What do we know about the system?
Anyone who wishes to play on the ladder is accepted
* You can have zero knowledge of the game and yet you will be placed in Bronze (in sports, that makes you 3rd place!), the only prerequisites for acceptance into a league is to just show up 5 times (you can even disconnect 5 times once the game starts and still get in). * This is an automatic invalidation of any results emerging from the Bronze league as the range of skill present is astronomical!
You only play matches against people in your region
* Comparisons across different battle.net servers is worthless.
The hidden MatchMakingRating number is based only on wins/losses with respect to the current MMR of yourself and your opponent
* What was required to score that win is not considered at all, nor should it be. Unfortunately, this places the onus on making a game as balanced as possible all the more important or else the MMR is worthless. * This point also explains why the data is "practically worthless".
The AMM will attempt to pair you up with a person with a similar MMR
* Note that I say similar MMR and not similar skill/ability. * If a race imbalance existed, the MMR would not reveal it since it would simply consider the person using the weaker race as a "poorer" player hence a lower MMR and the person using an OP race as a better player rewarding them with a higher MMR. * Thus if you were to compare even between players of similar MMR but across the three races, it would reveal nothing of significance since the reason they were given that MMR is due to their win/loss performance against the same people they are being compared against.
Not everyone will play the same number of games
* You may say "well duh" but it bears repeating. * I think in the original top 200 list one of the players that made it had only played a handful of games, I think it was 7 all up, yet 7 games was enough for the system to determine their MMR to be one of the top 200 on the server. * I honestly believe that a more stringent pre-requisite for diamond league is needed, e.g. 100 games played.
There are many more other points to make, but let's just start with the above for now.
|
On August 19 2010 06:18 Hidden_MotiveS wrote:Show nested quote +On August 19 2010 04:59 texmix wrote: As others have stated, the OP is based on a flawed methodology. If trying to use stats to figure out which race is overpowered, 4 items need to be controlled: 1. Homogeneous skill in race choice (maybe old BW semi-pro's just gravitate towards Terran in sc2) 2. The matchmaking system instead of random opponents 3. Player MU difference (one player may, in the long run, win 60% pvt, another lose 60% pvt) 4. Player MU skill changes over time (maybe a day9 video will change pvt win stats by several bps in a single week)
To control all 4 of these I suggest mining for at least 1,000 players that: 1. Have players over 200 games 2. Played at least 30 games in the last 72 hours 3. Are in the diamond league
From this list, throw out all games involving a random player (less consistent MU performance), everything older than most recent 30 games, and and calculate the group's median win ratio using the most recent 30 games. Keep the 100 players of each race with win ratios closest to the median win ratio and throw out the other 700 players games. For instance if the 1,000 players have win ratios ranging from 35% to 90% (in most recent 30 games), with median of 55%, then pick the 100 zerg, protoss, and terran players who are closest to 55%. From the remaining 300*30 games, a simple win/loss record for each MU will be about the best possible indication of imbalance I believe data mining can come up with (short of using the same methodology with more games or tweaked ratios). I wanted to say this, but feared the backlash of "NO We has psience we is wright". The methodology of the observational study is flawed in a few ways. For one, I don't think you are considering any confounding variables such as how the ranking system comes into effect. If one race is overpowered then it's simple to assume that it will be overrepresented in relation to its total population within the top of diamond rank only. But this could also be confounded by how people think Terran is the strongest race, so the more serious players switch over to that race thinking this is true. In addition, the sample sizes here are very small. I would like to hear what a statistician, or Blizzard statistician has to say about the data. edit: Oh I see, the OP understands that the matchmaking systems kind of voids his analysis. I'm sorry if I sounded harsh. Great effort put into this.
I am a statistician and stand by that methodology as a reasonable indicator of racial balance.
The win rate does not prove imbalance. Would a "perfectly balanced" Starcraft 2 have perfect 50% win rates? No. It absolutely would assuming a control for skill and the matchmaking system which can be approximated in the study.
|
|
Sanya12364 Posts
One question that I always wanted to ask, what is your definition of imbalance?
On August 19 2010 08:12 GagnarTheUnruly wrote: I have reasoned that if the game is imbalanced, that imbalance must manifest as either 1) a difference in win ratios for the different races or 2) a difference in race prevalence as you increase player skill level, except under one of several unlikely scenarios and one likely one.
The degree to which it shifts from condition 1) to condition 2) depends on the strength of the matchmaking system. Since we don't see 1) (as my data show), and we don't see 2) (as I have said and escapeartist has shown), we can conclude that the game is balanced, at least for regular league play.
I'm not sure how you can be confident of 2.
Win ratios should be controlled by point levels in a nice random matchmaking system. In your study, you see evidence of higher win rates for players of higher skill levels. Skill level is continuous (not all players in the same league are of the same level). Average win ratios will not match in a league unless the skill distribution of the races in that league allows for that.
First element in figuring out the skill distribution is selection biases. You have to figure out who chooses certain races and why. The population that picks zerg and the population that picks terran as their favorite are different unless you can prove otherwise.
The second element in figuring out the skill distribution is the learning curve. A difference in racial prevalence at any particular skill level is more a function of how steep the learning curve is relative to normal at that particular skill level.
A skill ceiling is any point where skill curve is really steep.
|
On August 20 2010 00:02 TanGeng wrote:
First element in figuring out the skill distribution is selection biases. You have to figure out who chooses certain races and why. The population that picks zerg and the population that picks terran as their favorite are different unless you can prove otherwise.
A skill ceiling is any point where skill curve is really steep.
It may be true that different people pick different races, but this is not neccesary relevant to the imbalance question. Unless you prove that personality equals skill or that skill equals race, then I don't see why this is relevant.
I also don't see how skill curve equals imbalance. I do agree that zerg needs higher apm than say terrans, but if they perform at the same level then I still doesn't see any imbalance issues, since unless proven different we must assume that they have hit the relative skill cap when playing in the upper diamond level. As I have shown in the upper diamond level Zerg is gaining in population. And withouth any intelligent discussion we can safely assume that random takes more skill than ALL the other races. Even them are gaining in popularity in lower diamond leage. This alone strongly support my statement that difficulty is not equal to balance.
My point here is that there are soooo many people crying imbalance, but still I have seen no evidence of this. OP has tried to find evidence of this, and I have tried to find evidence of this, but both of us came up with nothing. As a result of this both of us seem to be leaning towards thinking that the game is balanced.
I personally think that you are going about the wrong way if you are trying to tell us that statistics is not the right way to do it. Let's just say it's your turn to try and prove the inbalance. Or atleast give us some new data sources. Afterall you seem to be convinced that there is imbalance, but all you seem to base it on is personal opinions.
If I understand you correct then the race in question(Zerg) are picked by the best players since they seem to perform on all levels of play, and poor players(Terran) are performing on equal level to them bequase of the difference in imbalance.
I find this vey hard to swallow considering we are dealing with over 500 000 players. Also why would the best players do this? I have seen no reasoning as to why all the "best" players would pick the "worst" race.
To my experience I must say that if it smells like shit, looks like shit and tastes like shit. It's probably shit....
|
|
|
|