|
Okay, so every month stuff like this and this is released. There are two things blindly obvious:
- They are quite far apart, 57% PvT vs 52%. - Former of them doesn't list the amount of games.
I'm not going to say that any of them lies, so obviously they use different sample pools, and they do, the former only uses 'major and premier' tournaments while the latter esesntially uses pretty much any progame and therefore has a far higher sample size but consequently does not measure only the highest level of play.
Since aligulac does give me sample size, let's just for shits and giggles compute the error bars for a simple binary probability experiment:
- n = 656 experiments (games) - p = 0.527 winrate for P
So we have sqrt(p * (1-p) / n) = 0.019 = 1.9% confidence interval. Which basically says the standard deviation of error is about 2%, not bad, not bad. Meaning that we have a pretty damn large chance 'actual' winrates are in between 57 and 47 percent/
And every statician just reading this should've banged their heads into the wall. What I just did is a grave statistical fallacy which is very often repeated.
On independent probability experiments
What are independent probability experiments you might ask? Well, mathematically it's simply defined as that the chance of them both occuring in succession is identical to the product of their chances occurring individually. Doesn't tell you a whole lot, but in nature it comes down to 'THey do not affect each other in any way and share no common causes.' For instance, the chance that I wear my favourite shirt on monday and tuesday is not independent because it affects each other, I will be less likely to wear it on tuesday if I have worn it on monday because it could've gotten dirty. The chance that I and my aunt whom I never speak eat soy beans on any given day can be assumed independent because we don't influence each other. Tossing a coin is independent because the result of the last coin toss has no effect on the next.
Pro matches are not independent
The mere fact that these matches are played by the same players makes them not independent, let alone the mind games that go on in Bo7 series. In short, all these fantastical formulae about error bars and significance do not apply at all because they can only be used on independent probability experiments. I point you back to the mathematical definition above, these rules are derived and proven correct mathematically assuming that nice property of a chance of repeated occurance being identical to the product of singular occurances. If that property does not hold, neither do the mathematical formulae which depend on it.
The sample size is too damn small
If the games were independent, yes, then the sample size would be meaningful. But because they are not independent the same size is completely meaningless and super small, which is why winrates shift around like crazy every month. This isn't some metagame shit that keeps perpetually shifting every month, this is statistical fluctuation. Most likely the error bars are like 10 instead of 2 my intuition says. These numbers are almost completely meaningless.
Winrate is not win chance
That error bar of which I spoke a lot, what does it even mean? Well, it provides a quantity of how likely the win rate is to be close to the win chance. But there is no 'win chance' here because they are not independent. Win chance assumes independency. The point about win chance is that every single ZvT game would have the same chance of winning or losing, this defies intuition. Of course if we have Soulkey vs ForGG the chances intuitively are going to be extremely different than Sniper vs Innovation. Consider the actual independent probability experiment, the coin toss. Say we toss a coin 500 times, it lands on tail 239 times and on head 261. Let's use the same formula again:
- n = 500 - p = 0.522 head rate
sqrt(p * (1-p)/n) = 2.2%. Meaning that we have about 2/3 likelyhood of the actual head chance to fall in betwee 50% and 54.4%
Now, this appeals to our intuition does it not? We know it's 50% for a fair coin per definition. The idea that every coin toss has the same chance, which might not even be 50% because one side is heavier than the other, is nonetheless quite intuitive. The idea that every ZvT has the same chance of winning or losing in a certain month is not.
For fun, let's end with a riddle:
A coin is tossed a 9999 times and lands on head 9 999 times. What is the chance the coin lands on head when we toss it the 10 000th time?
+ Show Spoiler [answer] + It's a trick question, people are supposed to be baited into saying 'still 50%', which would be true under the assumption the coin is fair, which is defined by 50% chance to land on head in every single toss, so also the 10 000th one. But the probability that a coin which just landed 9999 times in a row on head is fair is quite low. If a coin lands on head that many times in a row, you can assume there's somethig fishy going on and that the coin is loaded in some way. The correct answer is 'the probability is not computable without further data/axioms being given.'
|
Hong Kong9148 Posts
I always thought graphs and math meant that you can just say what you want. It looks like statistics, so it must be true!
|
I've been waiting for someone to post a good explanation of this concept forever. Another huge reason why these winrates are a crappy metric is that the best players contribute FAR MORE to the game pool. In a 256 MLG tournament, the korean terran that wins it plays 8 Bo3's and the american zerg who loses in the third round plays 1.
|
I haven't read the hole blog only the last part with the coin flip. I don't make any sense of your answer (and maybe is because I haven't read the hole blog?).
I think that you are wrong there, The probability of tossing a coin 9999 time and it turns head all of them is 0.5^9999 and I don't think that this is an unfair coin. Where a^b denotes a to the power of b.
|
On September 04 2013 07:35 CNSnow wrote:
I think that you are wrong there, The probability of tossing a coin 9999 time and it turns head all of them is 0.5^9999 and I don't think that this is an unfair coin. Where a^b denotes a to the power of b. Well, 0.5^9999 is one in orders of magnitude more coins than there are on the planet. Assuming there is only one unfair coin on the planet, the chance is astronomically higher that we just happen to have encountered that coin than that a fair coin lands 9999 times in a row on the same side.
|
The point of the binomial assumption is merely to give an idea about what the degree of the accuracy of the point estimate is. Of course the assumptions are not strictly true. With more or less _any_ real world data, the assumption of independence and identically distributed is going to be wrong. The whole idea of modelling is purely to give some sort of feeling towards the accuracy of the point estimate. There's a very relevant quote by George Box (famous statistician),
"Essentially, all models are wrong, but some are useful."
Using a binomial assumption, while not correct, is still a meaningful thing to do. With a sample size of 600, I would personally be perfectly happy to take the independent and identical assumption based on the fact that 600 should be enough points to 'smooth' out the errors. While I know it isn't correct, it is more or less the only meaningful way to make a relatively simple analysis of the data.
Your comments about win rate not being win chance are nonsensical. Obviously for a particular match between two particular people, you shouldn't use the win rate as a rough estimate for the probability that one person will win over the other. What you would use the win rate for is something along the lines of 'all things being equal, if I chose a random zerg player and a random terran player, then what is the likelihood that one will win over the other'. Using overall winrates to judge balance is a perfectly valid thing to use, while it still has it's flaws, it is the most logical simple statistic to use. Again, it's a flawed model, it fails to take into account many things, but all models are flawed, some are useful though.
You also seem to have a misconception about the underlying idea of hypothesis testing.
Consider the actual independent probability experiment, the coin toss. Say we toss a coin 500 times, it lands on tail 239 times and on head 261. Let's use the same formula again:
- n = 500 - p = 0.522 head rate
sqrt(p * (1-p)/n) = 2.2%. Meaning that we have about 2/3 likelyhood of the actual head chance to fall in betwee 50% and 54.4% Contrary to popular belief, hypothesis testing does not actually aim to 'prove' facts, but rather 'disprove' them. Your null hypothesis is 'p=0.5', and the alternative should be 'p does not equal 0.5'. In your example, the estimate for what the probability should obviously be your point estimate 0.522, it would make no sense at all to think otherwise. Your base question is though, is 0.5 in the 'realm of possibility' though? When you construct a confidence interval, it gives you the 'realm of possibility'. Your statement 'we have a 2/3 likelihood of the actual head chance being between x and y' is false and misleading. The actual head probability is a fixed number, it has no randomness assigned to it. How you can use a confidence interval is to state, 'well if my hypothesised value falls outside of this range, then it's probably not correct.' And that is all. The conclusion is _never_ accept the null hypothesis, rather, do not reject the null hypothesis. You would never then say 'okay, well I think p=0.5 then', that makes absolutely no sense, if you were going to use anything as your estimate, it should be 0.522.
Also, while I hate to nitpick, it is what I do for a living. You need to be much more careful with your useage of terminology.
So we have sqrt(p * (1-p) / n) = 0.019 = 1.9% confidence interval. Which basically says the standard deviation of error is about 2%
1.9% confidence interval, is a terrible way of saying things, that's not even the width of the confidence interval, it's a 1/4 of that. You should just stick to giving the full confidence interval. Also where you say 'standard deviation of error', what you mean is standard error. Which is the estimate of the standard deviation of the sampling distribution of your statistic. There is no such thing as a standard deviation of error.
TL:DR The binomial model, while incorrect, is still a useful model. All models are incorrect, just some are useful for giving you a feeling for the data.
|
On September 04 2013 09:33 phwoar wrote: The point of the binomial assumption is merely to give an idea about what the degree of the accuracy of the point estimate is. Of course the assumptions are not strictly true. With more or less _any_ real world data, the assumption of independence and identically distributed is going to be wrong. The whole idea of modelling is purely to give some sort of feeling towards the accuracy of the point estimate. There's a very relevant quote by George Box (famous statistician),
"Essentially, all models are wrong, but some are useful."
Using a binomial assumption, while not correct, is still a meaningful thing to do. With a sample size of 600, I would personally be perfectly happy to take the independent and identical assumption based on the fact that 600 should be enough points to 'smooth' out the errors. While I know it isn't correct, it is more or less the only meaningful way to make a relatively simple analysis of the data. Ye,s the point is that in this case it is so off that it stops being useful, the individual experiments aren't close to being independent, they are so extremely independent at this point that it becomes completely meaningless. I mean, what would ZvT look like if that one player Innovation didn't exist in winrates?
Your comments about win rate not being win chance are nonsensical. Obviously for a particular match between two particular people, you shouldn't use the win rate as a rough estimate for the probability that one person will win over the other. What you would use the win rate for is something along the lines of 'all things being equal, if I chose a random zerg player and a random terran player, then what is the likelihood that one will win over the other'. Using overall winrates to judge balance is a perfectly valid thing to use, while it still has it's flaws, it is the most logical simple statistic to use. Again, it's a flawed model, it fails to take into account many things, but all models are flawed, some are useful though. Indeed, you shouldn't use it for that, but assuming independence means that you can, it's completely intertwined. If you can assume that the games are independent enough to compute the error bars, you can assume they are independent enough to use the overal winrate as a base for a specific instance. That is what an error bar means. It gives you the probability that the win chance in an individual instance is within a certain margin of error of the win rate. Drawing error bars implies you can make that jump, simple as that. And you can't draw error bars because they are meaningless, for the same reason you can't make this jump, and for another unrelated reaso, aligulac drawing a smooth curve between the point is ridiculous, it should be a histogram, the smooth curve is an entirely baseless assumption in this case.
You also seem to have a misconception about the underlying idea of hypothesis testing. Show nested quote +Consider the actual independent probability experiment, the coin toss. Say we toss a coin 500 times, it lands on tail 239 times and on head 261. Let's use the same formula again:
- n = 500 - p = 0.522 head rate
sqrt(p * (1-p)/n) = 2.2%. Meaning that we have about 2/3 likelyhood of the actual head chance to fall in betwee 50% and 54.4% Contrary to popular belief, hypothesis testing does not actually aim to 'prove' facts, but rather 'disprove' them. Your null hypothesis is 'p=0.5', and the alternative should be 'p does not equal 0.5'. In your example, the estimate for what the probability should obviously be your point estimate 0.522, it would make no sense at all to think otherwise. Your base question is though, is 0.5 in the 'realm of possibility' though? When you construct a confidence interval, it gives you the 'realm of possibility'. Your statement 'we have a 2/3 likelihood of the actual head chance being between x and y' is false and misleading. The actual head probability is a fixed number, it has no randomness assigned to it. How you can use a confidence interval is to state, 'well if my hypothesised value falls outside of this range, then it's probably not correct.' And that is all. The conclusion is _never_ accept the null hypothesis, rather, do not reject the null hypothesis. You would never then say 'okay, well I think p=0.5 then', that makes absolutely no sense, if you were going to use anything as your estimate, it should be 0.522. It's not misleading at all, yes, the actual head chance is a fixed number but based on the empirical data of throwing it 500 times we can therefore say with 2/3 confidence that the actual chance lies within 50 and 54.4%, which is the claim stated. Maybe the term 'probability' was slightly inaccurate and 'confidence' would've been a better term yes.
1.9% confidence interval, is a terrible way of saying things, that's not even the width of the confidence interval, it's a 1/4 of that. You should just stick to giving the full confidence interval. Also where you say 'standard deviation of error', what you mean is standard error. Which is the estimate of the standard deviation of the sampling distribution of your statistic. There is no such thing as a standard deviation of error. I didn't call it a 1.9% confidence interval, I said the standard deviation of it was 1.9% Which is something entirely different, it means that on average with repeated experiments the expectancy value of the head rate to the head probability is 1.9% different.
TL:DR The binomial model, while incorrect, is still a useful model. All models are incorrect, just some are useful for giving you a feeling for the data. Agreed except that in this case, it stops being useful because it's not even independent by approximation any more. The fact that drawing samples from two different pools gives you 52% vs 57% winrate and that the winrates fluctuate in every random direction every month is an indication of that these numbers are basically meaningless and the fluctuations are so high and the actual confidence so low that it doesn't mean a lot. If Innovation goes to a single extra foreign tournament or not this month adds like 2% extra to ZvT.
|
I agree with what phwoar said. It's as good as any other statistic you will read about somewhere. You won't get perfect independence for anything ever. And therefore you will never get a perfect model, but it can still be (very) useful.
Of course it is also volatile due to samplesize, but 300+ games per matchup is not too little to be significant. Many scientifical studies as well as the statistics you can read in papers are based upon smaller samplesizes (like asking 100people). That doesn't mean those are bad, you just have to factor in that small differences will occur if you do them again when interpreting them. Which is why in Starcraft monthly winrates aren't a big deal when they are off once, yet, when the problem keeps occuring month after month we can very safely assume that there is a problem.
|
The difference is that if you ask 100 random people that is close to independent. This isn't even close to independent, this is the inverse, these game are extremely dependent on one another.
|
5003 Posts
I think I might be more interested on exactly what you're trying to measure with those statistics. Your critiques on using those numbers and statistics as anything more than an average of many factors is correct and important, but I do notice more often that people often point out many "exceptions" that give some basis on where those numbers came from.
Things can be correlated from one time period to another but we can still work with that, but you're right, just not in a binomial form, and you have to know exactly what is being correlated to build a more formal model to work with it. But making that more formal model might be fairly difficult, since there's a lot of unobservables that affect a win rate.
At the same time, if you have large, large, samples, I think in this case you will eventually approximate a pretty good mean that's reflective of the "true mean". But the job of statisticians aren't to produce numbers and "statistics" but put meaning behind the numbers, which is I think what many people who do "statistics" often forget to do. Which is why it's important to put down what you want to measure out of those statistics. So what if the 55% winrate is a true mean -- what does it mean? Does it mean we have better Terran players? Or more creative Terran players that are more random shocks? Map pool problem? Balance problem? Or did Flash play more that month? etc etc etc. Once you start putting in player skill into it, it's hard to know what those winrates mean so those huge averages generally are always taken with a grain of salt no matter what kind of statistics you try to put into it. There's just too many unobservables and the model is quite complex.
So yeah, binomial isn't the most accurate way of representing the situation, but I think it's simple enough that people can understand it just by throwing in the caveats.
|
On September 05 2013 02:37 SiskosGoatee wrote: The difference is that if you ask 100 random people that is close to independent. This isn't even close to independent, this is the inverse, these game are extremely dependent on one another.
That's simply something we cannot know. Just because games are played in a series doesn't make them dependend on each other. We can talk about mindgames and rivalries all day long, but in the end in the world of statistics these things could just turn out as very insignificant in the greater scale of things. As phwoar said
With a sample size of 600, I would personally be perfectly happy to take the independent and identical assumption based on the fact that 600 should be enough points to 'smooth' out the errors.
|
On September 05 2013 04:26 Big J wrote:Show nested quote +On September 05 2013 02:37 SiskosGoatee wrote: The difference is that if you ask 100 random people that is close to independent. This isn't even close to independent, this is the inverse, these game are extremely dependent on one another. That's simply something we cannot know. Just because games are played in a series doesn't make them dependend on each other. We can talk about mindgames and rivalries all day long, but in the end in the world of statistics these things could just turn out as very insignificant in the greater scale of things. Nope, you can actually empirically justify just how dependent it is, like I said before, if you remove a single player from the stats the winrates can differ as much as 2-4% in some months.
Dependency isn't only things like mindgames in a BoX. Dependency is also that the games are played by the same players.
I mean, let's reduce it to an extreme example. Say we let Soulkey and Innovation play a thousand games and innovation wins 710 of them, this is not unrealistic nay? Should we then say that given our 1 000 sample size that ZvT at the very highest level of play is 29%, certainly not. Most people agree that Innovation is simply better than Soulkey and that TvZ is Inno's best matchup also heavily contributes to this very slanted statistic. This is what is going on in winrates at a lesser extend.
To put it simpler, while the amount of games might be significant, the amount of players isn't. If you end up with a situation where 8 players contribute to say 80% of the games played because those 8 players ar the best and keep going far in tournaments, you can't really derive a lot of meaning from those winrates. If one of those players is T and that one player is just that much better than the other 7 it gives the ilussion that Terran is overpowered while in reality we just have one really good Terran who contributes to say 15% of all TvX games played because he keeps getting super far in tournaments.
|
On September 05 2013 11:24 SiskosGoatee wrote:Show nested quote +On September 05 2013 04:26 Big J wrote:On September 05 2013 02:37 SiskosGoatee wrote: The difference is that if you ask 100 random people that is close to independent. This isn't even close to independent, this is the inverse, these game are extremely dependent on one another. That's simply something we cannot know. Just because games are played in a series doesn't make them dependend on each other. We can talk about mindgames and rivalries all day long, but in the end in the world of statistics these things could just turn out as very insignificant in the greater scale of things. Nope, you can actually empirically justify just how dependent it is, like I said before, if you remove a single player from the stats the winrates can differ as much as 2-4% in some months. Dependency isn't only things like mindgames in a BoX. Dependency is also that the games are played by the same players. I mean, let's reduce it to an extreme example. Say we let Soulkey and Innovation play a thousand games and innovation wins 710 of them, this is not unrealistic nay? Should we then say that given our 1 000 sample size that ZvT at the very highest level of play is 29%, certainly not. Most people agree that Innovation is simply better than Soulkey and that TvZ is Inno's best matchup also heavily contributes to this very slanted statistic. This is what is going on in winrates at a lesser extend. To put it simpler, while the amount of games might be significant, the amount of players isn't. If you end up with a situation where 8 players contribute to say 80% of the games played because those 8 players ar the best and keep going far in tournaments, you can't really derive a lot of meaning from those winrates. If one of those players is T and that one player is just that much better than the other 7 it gives the ilussion that Terran is overpowered while in reality we just have one really good Terran who contributes to say 15% of all TvX games played because he keeps getting super far in tournaments.
The problem you are describing is not dependency. That's statistical stability. If you are insatisfied with the stability, you can use a different way than the simple average (=winrate) to estimate the winchance. E.g. you could cut the top X TvZ and top X ZvT players. Or you could cut all the players with a winrate >X%. Or you could cut all players with "more than X games played". And after you cut those outliers, you can make your statistics. Or you could use the median instead of the average.
If you are insatisfied with how volatile the data is towards single outliers then there are methods to improve that data for you.
|
I told myself I wouldn't post anymore, because it was clear that you didn't really know what you were talking about, but hey, here I am again. I am somewhat apologetic if this post comes off as being aggressive. Lets go through some of the fallacies here.
On September 05 2013 02:37 SiskosGoatee wrote: I mean, let's reduce it to an extreme example. Say we let Soulkey and Innovation play a thousand games and innovation wins 710 of them, this is not unrealistic nay? Should we then say that given our 1 000 sample size that ZvT at the very highest level of play is 29%, certainly not. Most people agree that Innovation is simply better than Soulkey and that TvZ is Inno's best matchup also heavily contributes to this very slanted statistic. This is what is going on in winrates at a lesser extend.
To put it simpler, while the amount of games might be significant, the amount of players isn't. If you end up with a situation where 8 players contribute to say 80% of the games played because those 8 players ar the best and keep going far in tournaments, you can't really derive a lot of meaning from those winrates. If one of those players is T and that one player is just that much better than the other 7 it gives the ilussion that Terran is overpowered while in reality we just have one really good Terran who contributes to say 15% of all TvX games played because he keeps getting super far in tournaments.
None of this makes any statistical sense at all. Here is the basic idea of statistics. I have a population that I am interested in. In this case, my population would be 'pro' players. And now I want to get some information about this population. Well to get the perfect correct answer, what would I need? Hmm, I'd need every player to play every other player infinitely often at the same time under exactly the same conditions. You know what? That's probably not going to happen. I guess I'll just have to take a sample instead. But I have to be careful what sample I take, I have to make sure it is representative of the whole population. If it isn't representative of the target population I want, I should only be able to draw conclusions on the sub-population I have then.
This is where your post makes absolutely no sense at all. Your 'extreme' example purely studies the sub-population of Soulkey and Innovation. Yes you have a large sample size, but _obviously_ it can not be extrapolated to make conclusions on the whole population. Why you would even consider this example baffles me. Your second example with 8 players is also totally moot, our data isn't 8 players. Even if it were, you can still derive plenty of information about your sample, the only problem is your data can only draw conclusions on your population of these 8 players.
Is the sample of WCS + premier tournaments, or whatever it is going to be a reasonable representation? I would be more than happy to accept that as a reasonable assumption. In fact, arguing otherwise, I would find difficult.
The argument that 'oh look innovation always goes deep into tournaments, so it messes up the data' also is wrong. Just from basic inspection we can see it isn't just all terrans at the top of tournaments. For every run made by innovation, there is a deep run made by a player from some other race. So if we should remove innovations runs, we should probably remove all those other deep runs too? But if we remove all the deep runs.... we have no data left. This leads into the next paragraph.
Nope, you can actually empirically justify just how dependent it is, like I said before, if you remove a single player from the stats the winrates can differ as much as 2-4% in some months.
This is absolute nonsense. Like it is correct, but it is statistical nonsense. If I pick and choose bits of data to remove then I can make the results look like anything I like. Saying something like 'Well the winrates look very different if you take out Innovation' is so statistically wrong it's not funny. You can not just go 'oh those data points are weird, I better get rid of them.' As a rule of thumb, you should only remove data points if they are _errors_, not just because they're 'unusual'. Innovation is part of your population, removing him means your sample is now 'everyone except innovation'. You could argue that this is an interesting population, I would argue this is a stupid population. Lets give another example of how ridiculous removing data can be. Jaedong was terrible at ZvP for a long time, then he came out at WCS finals and annihilated every P player he came across. So say I'm looking at his data, should I then go, "well the data from these last two days clearly buck the trend by a mile, and it stuffs up my results, you know what I'm going to remove all those games from my sample so I have some more 'stability' or something stupid like that". Do you see how ridiculous that is? How do you think Jaedong would feel if you did that? Or maybe if I'm analysing all long term data, and I know JvP was terrible, equally terribly bad as innovation's ZvT was terribly good, okay I should cut all JvP out of my data. Should I then cut the next worst player I can find too? Cos that one bucks the trend as well...? Where should I stop cutting then...
Let's do some maths now then shall we?
On September 04 2013 06:00 SiskosGoatee wrote: If the games were independent, yes, then the sample size would be meaningful. But because they are not independent the same size is completely meaningless and super small, which is why winrates shift around like crazy every month. This isn't some metagame shit that keeps perpetually shifting every month, this is statistical fluctuation. Most likely the error bars are like 10 instead of 2 my intuition says. These numbers are almost completely meaningless.
Your intuition is completely off. If you are increasing your 'error bars', from 2 to 10, that's an increase of order 5. Basic statistics will tell you that the standard error shrinks of order 1/sqrt(n). To increase by an order of 5, you are effectively saying in your data set, only 1/25th of the data is useful, or 96% of my data is useless. Your saying, in my data of 600 points, only 24 are useful? You appear to have no gut feeling about how powerful a sample size of 600 is. It is very large and very useful. You may have heard that you can use normal approximation with a reasonable degree of accuracy on binomials if n*p > 5. If p is a half, then with just a sample size of 10, you can already do good approximations. Sure, this doesn't apply to what we're talking about at all, but hopefully it gives you a feeling about how fast approximations can become useful even with tiny sample sizes. As BigJ and I have been trying to tell you, 600 is a huge sample size. If you are unwilling to take this sample size of 600 as being useful, you might as well ignore a massive proportion of scientific studies.
And just to nitpick some more,
On September 04 2013 10:25 SiskosGoatee wrote:Show nested quote +On September 04 2013 09:33 phwoar wrote: 1.9% confidence interval, is a terrible way of saying things, that's not even the width of the confidence interval, it's a 1/4 of that. You should just stick to giving the full confidence interval. Also where you say 'standard deviation of error', what you mean is standard error. Which is the estimate of the standard deviation of the sampling distribution of your statistic. There is no such thing as a standard deviation of error. I didn't call it a 1.9% confidence interval, I said the standard deviation of it was 1.9% Which is something entirely different, it means that on average with repeated experiments the expectancy value of the head rate to the head probability is 1.9% different.
a) Yes you did. Read your original post. b) That's actually not what standard deviation is, another common fallacy. Standard deviation is the square root of the average squared distance from the mean. Quite different from the average distance.
|
On September 05 2013 14:46 Big J wrote:Show nested quote +On September 05 2013 11:24 SiskosGoatee wrote:On September 05 2013 04:26 Big J wrote:On September 05 2013 02:37 SiskosGoatee wrote: The difference is that if you ask 100 random people that is close to independent. This isn't even close to independent, this is the inverse, these game are extremely dependent on one another. That's simply something we cannot know. Just because games are played in a series doesn't make them dependend on each other. We can talk about mindgames and rivalries all day long, but in the end in the world of statistics these things could just turn out as very insignificant in the greater scale of things. Nope, you can actually empirically justify just how dependent it is, like I said before, if you remove a single player from the stats the winrates can differ as much as 2-4% in some months. Dependency isn't only things like mindgames in a BoX. Dependency is also that the games are played by the same players. I mean, let's reduce it to an extreme example. Say we let Soulkey and Innovation play a thousand games and innovation wins 710 of them, this is not unrealistic nay? Should we then say that given our 1 000 sample size that ZvT at the very highest level of play is 29%, certainly not. Most people agree that Innovation is simply better than Soulkey and that TvZ is Inno's best matchup also heavily contributes to this very slanted statistic. This is what is going on in winrates at a lesser extend. To put it simpler, while the amount of games might be significant, the amount of players isn't. If you end up with a situation where 8 players contribute to say 80% of the games played because those 8 players ar the best and keep going far in tournaments, you can't really derive a lot of meaning from those winrates. If one of those players is T and that one player is just that much better than the other 7 it gives the ilussion that Terran is overpowered while in reality we just have one really good Terran who contributes to say 15% of all TvX games played because he keeps getting super far in tournaments. The problem you are describing is not dependency. That's statistical stability. If you are insatisfied with the stability, you can use a different way than the simple average (=winrate) to estimate the winchance. E.g. you could cut the top X TvZ and top X ZvT players. Or you could cut all the players with a winrate >X%. Or you could cut all players with "more than X games played". And after you cut those outliers, you can make your statistics. Or you could use the median instead of the average. If you are insatisfied with how volatile the data is towards single outliers then there are methods to improve that data for you. Call it what you like, the point is that the assumptions over which the formulae for the confidence intervals are derived no longer apply. The confidence interval is meaningless and at this point doesn't say anything any more and it's unclear what it says.
Apart from that, the statistics themselves become meaningless, surely you agree that if the ZvT winrates shift from 54 to 56% if we remove a single player's games from the totality that those stats become meaningless to claim anything with? So if innovation was never born that suddenly changes ZvT from balanced to imbalanced?
On September 06 2013 08:45 phwoar wrote:
None of this makes any statistical sense at all. Here is the basic idea of statistics. I have a population that I am interested in. In this case, my population would be 'pro' players. And now I want to get some information about this population. Well to get the perfect correct answer, what would I need? Hmm, I'd need every player to play every other player infinitely often at the same time under exactly the same conditions. You know what? That's probably not going to happen. I guess I'll just have to take a sample instead. But I have to be careful what sample I take, I have to make sure it is representative of the whole population. If it isn't representative of the target population I want, I should only be able to draw conclusions on the sub-population I have then. And that is exactly the problem and simultaneously the point. The winrates each month are what they are, they are the win rate of games between pro players. But many people extend this to mean that it is some reflexion on the 'balance' at the pro level. This is simply a jump you cannot make because the number of pro players contributing to these games is too low. Furtheremore, they don't all contribute to these games in similar measure, a select few of elite players contributes to about half of the games.
This is where your post makes absolutely no sense at all. Your 'extreme' example purely studies the sub-population of Soulkey and Innovation. Yes you have a large sample size, but _obviously_ it can not be extrapolated to make conclusions on the whole population. Why you would even consider this example baffles me. Your second example with 8 players is also totally moot, our data isn't 8 players. Even if it were, you can still derive plenty of information about your sample, the only problem is your data can only draw conclusions on your population of these 8 players. Please don't tell me you never encountered an argument from continuum? I'm merely providing an extreme example of what happens in a situation where very few players contribute to very many games.
Is the sample of WCS + premier tournaments, or whatever it is going to be a reasonable representation? I would be more than happy to accept that as a reasonable assumption. In fact, arguing otherwise, I would find difficult. Yes, and where is the line, where do you draw the line? Where-ever you draw it, it is arbitrary. That's the point.
The argument that 'oh look innovation always goes deep into tournaments, so it messes up the data' also is wrong. Just from basic inspection we can see it isn't just all terrans at the top of tournaments. For every run made by innovation, there is a deep run made by a player from some other race. So if we should remove innovations runs, we should probably remove all those other deep runs too? But if we remove all the deep runs.... we have no data left. This leads into the next paragraph. Yeah, so what you are saying me is that if we have one protoss, one terran and one zerg player which together have won every single tournament between the three of them and always two of them meet in the finals and the third ends up in second place that we can then call this sample pool in which 3 players contribute to 80% of the games a good pool as long as it's large enough because there's one of each race in it?
This is absolute nonsense. Like it is correct, but it is statistical nonsense. If I pick and choose bits of data to remove then I can make the results look like anything I like. Saying something like 'Well the winrates look very different if you take out Innovation' is so statistically wrong it's not funny. The point is that whether innovation was born or not or even choose to have a progaming career is completely within the realm of chance and this makes these statistics meaningless to say anything about balance.
Do you really not understand that If one variable which is left completely within the realm of chance can drastically alter the statistical outcome then the entire statistic is next to useless to conclude anything from.
Take it from face value if you want but if you want to measure anything about a large group of players and one player of them contributes disproportionally highly to the sample pool then the statistic if you weigh every game the same is not going to say anything meaningful. It's called a randomized sample. If you want to say something about TvZ in general but 50% of the games you use happen to have the same player in it then your sample doesn't say anything about Tvz in general. It at this points stops saying anything about balance, it says things about winrate only.
Let's do some maths now then shall we?
Your intuition is completely off. If you are increasing your 'error bars', from 2 to 10, that's an increase of order 5. Basic statistics will tell you that the standard error shrinks of order 1/sqrt(n). To increase by an order of 5, you are effectively saying in your data set, only 1/25th of the data is useful, or 96% of my data is useless. Your saying, in my data of 600 points, only 24 are useful? You appear to have no gut feeling about how powerful a sample size of 600 is. It is very large and very useful. You may have heard that you can use normal approximation with a reasonable degree of accuracy on binomials if n*p > 5. If p is a half, then with just a sample size of 10, you can already do good approximations. Sure, this doesn't apply to what we're talking about at all, but hopefully it gives you a feeling about how fast approximations can become useful even with tiny sample sizes. As BigJ and I have been trying to tell you, 600 is a huge sample size. If you are unwilling to take this sample size of 600 as being useful, you might as well ignore a massive proportion of scientific studies. Yes, and this is exactly what happens every month in the winrates. The winratse fluctate around 10% left and right every month, they go from 45% this month to 55% the next month, you can say these are 'metagame shifts', but let's assume they are not and they are merely statistical fluctuations because the winrates really hop around in every random direction each month. If the expectancy value is indeed 10% difference each month then hey, the error bars might in fact be 10.
And just to nitpick some more,
On September 04 2013 10:25 SiskosGoatee wrote: a) Yes you did. Read your original post. I didn't.
b) That's actually not what standard deviation is, another common fallacy. Standard deviation is the square root of the average squared distance from the mean. Quite different from the average distance.
Yes, and that happens to be exactly identical to the expectancy value of deviation from the mean given a normal distribution. I never said it was 'average distance', I said it was the expectancy value of deviation from the mean. I'm not even sure what 'average distance' is supposed to mean in a continuous normal function since in theory the population should have no upper and lower bound.
|
On September 06 2013 16:30 SiskosGoatee wrote: Apart from that, the statistics themselves become meaningless, surely you agree that if the ZvT winrates shift from 54 to 56% if we remove a single player's games from the totality that those stats become meaningless to claim anything with?
Wrong. Removing a random data point is not the same as removing the known maximal data point. It's like studying economies of the world and wondering why your conclusions change when you remove the USA from your data. No shit it's going to have a huge effect. Further reading: distribution of order statistics, Pareto Principle, Zipf's Law.
On September 06 2013 16:30 SiskosGoatee wrote: Yes, and where is the line, where do you draw the line? Where-ever you draw it, it is arbitrary. That's the point.
Wrong. Where you draw the line is debatable, but it is not arbitrary. There are many reasonable places to draw the line. Your position---that we are unable to make any meaningful statistical conclusion whatsoever---is not one of them. You remind me of a climate change denialist in the way you instantly jump from "there are some uncertainties and flaws in the data" to "all of the conclusions must be wrong". Further reading: logical fallacies, argument from ignorance.
On September 06 2013 16:30 SiskosGoatee wrote: The winratse fluctate around 10% left and right every month, they go from 45% this month to 55% the next month, you can say these are 'metagame shifts', but let's assume they are not and they are merely statistical fluctuations because the winrates really hop around in every random direction each month. If the expectancy value is indeed 10% difference each month then hey, the error bars might in fact be 10.
Wrong. If the winrates range between 45% and 55%, then the maximum fluctuation from the sample mean is 5%, not 10%. Since we have several dozen or so monthly averages, the maximum fluctuation should be around 2-3 times the standard deviation. This would suggest that the standard deviation of the empirical distribution of monthly winrates is around 2%. I suspect you are just picking out numbers and making claims about them without any idea about the mathematical underpinnings of what a standard deviation means.
On September 06 2013 16:30 SiskosGoatee wrote: Yes, and that happens to be exactly identical to the expectancy value of deviation from the mean given a normal distribution. I never said it was 'average distance', I said it was the expectancy value of deviation from the mean. I'm not even sure what 'average distance' is supposed to mean in a continuous normal function since in theory the population should have no upper and lower bound. This fully confirms my suspicions. You have no clue what a standard deviation is. Please go back to your textbook, or Wikipedia, and learn what the word means before coming back and making arguments about it.
|
On September 08 2013 07:20 pirsq wrote:Show nested quote +On September 06 2013 16:30 SiskosGoatee wrote: Apart from that, the statistics themselves become meaningless, surely you agree that if the ZvT winrates shift from 54 to 56% if we remove a single player's games from the totality that those stats become meaningless to claim anything with?
Wrong. Removing a random data point is not the same as removing the known maximal data point. It's like studying economies of the world and wondering why your conclusions change when you remove the USA from your data. No shit it's going to have a huge effect. Further reading: distribution of order statistics, Pareto Principle, Zipf's Law. If you're making a statistic about the effect of capitalism and it turns out 'Yeah, captialism leads to a very strong economy and prosperity, but removing one capitalist nation suddenly changes the story to that it leads to poverty.' is the stat then meaningful? No, not really?
It put it like this, let's take it to a more extreme example, say thjat removing innovation switched 55% TvZ to 40% TvZ? Would Zergs then be able to say 'Yeah see, Terran is stronger than Zerg in TvZ!', no, it doesn't mean that then any more, it just means that there is one Terran player who is scaringly good while overall TvZ is Z favoured.
On September 06 2013 16:30 SiskosGoatee wrote: Yes, and where is the line, where do you draw the line? Where-ever you draw it, it is arbitrary. That's the point.
Wrong. Where you draw the line is debatable, but it is not arbitrary[/quote]Debatable is the same thing arbitrary, my god. If something becomes debatable then it is meaningless. It cannot be subjective, there have to be hard lines for numbers to make sense, there's a reason mathematics is called a hard, exact science, there is no debate in mathematics, something is true, or it isn't.
There are many reasonable places to draw the line. A purely subjective, non objective evaluation and therefore not the domain of science any more.
Your position---that we are unable to make any meaningful statistical conclusion whatsoever---is not one of them. You remind me of a climate change denialist in the way you instantly jump from "there are some uncertainties and flaws in the data" to "all of the conclusions must be wrong". Further reading: logical fallacies, argument from ignorance. Climate change 'science' then again is not hard science. Obviously neither is denying it but making any conclusion based from the current climate change calculations, positive or negative is not hard science.
That said, any reasonable person would say that the data indicates that climate change is occuring and it is caused by human influence. Just as throwing a coin 6 times, having it land 3 times on head and 3 times on tail indicates that the coin is fair, but obviously neither holds up to hard scientific rigour.
Wrong. If the winrates range between 45% and 55%, then the maximum fluctuation from the sample mean is 5%, not 10%. Since we have several dozen or so monthly averages, the maximum fluctuation should be around 2-3 times the standard deviation. This would suggest that the standard deviation of the empirical distribution of monthly winrates is around 2%. I suspect you are just picking out numbers and making claims about them without any idea about the mathematical underpinnings of what a standard deviation means. But they don't range between 45 and 55 is the point, they go up and down around 7% each month in every direction, it happesn that they go from 47 this month, to 55 next month, to again 59 next month. If they went up 10% this month they aren't per se going down the next month.
On September 06 2013 16:30 SiskosGoatee wrote: This fully confirms my suspicions. You have no clue what a standard deviation is. Please go back to your textbook, or Wikipedia, and learn what the word means before coming back and making arguments about it. Ah yes, the one vaunted argument people will always use when they tyhemselves have no clue what they are talking about 'You have no idea what you're talking about, but I'm not going to tell you were you are wrong, find it out for yourself', when people say that you always know they don't know themselves what they're talking about.
|
|
|
|