Ladder-Balance-Data - Page 23

VediVeci

United States82 Posts

July 13 2012 09:11 GMT

#441

On July 13 2012 13:40 lolcanoe wrote:

He had at least a 50% chance of getting it right. I'm going to ignore the rest of the post has to not encourage further irrelevance from posters who self-admittedly don't read things carefully.

Thats the sort of stuff I'm talking about. Whether or not I gave the math in your post a thorough reading was irrelevant to mine, because I have been reading through almost everything else you've posted.

You talk down to everybody, and at least 3 people have called you out on it so far. Constructive criticism is great, but don't be so damn rude about it. This was a pretty respectful discussion, no need to be so vituperative.

Alexj

Ukraine440 Posts

July 13 2012 11:27 GMT

#442

On July 11 2012 04:01 skeldark wrote:
There is one other method that you can use to show trends:
you look at the change of mmr of an race over time!
Do players of race Z loose mmr? do players of race X win mmr? this will happen after an patch. But perhaps its not imbalance perhaps it correct the imbalance that was there from the beginning.

I would say this is the only way your data could become usefull. Right now you aggregated some MMR stats over 42 years (just kidding, I understand it is a few month, but still quite some time). There might have been a few metagame shifts and patch changes over that time, but your data doesn't reflect them, since some of the calculated MMR values can be 3 months old and the others are from last week. If you could only account the calculated MMR from the last week, it would be in fact actual balance data, and not some averaged out over many months. And if you could do it periodically, you would be able to show trends and shifts. You would also generate a lot of discussion (and by that I mean new waves of balance whine

)

Edit: also, I am not sure if EU MMR and NA MMR have the same weight. These are 2 groups of accounts who never play with each other. I keep facepalming at sc2ranks who also assume that points on different ladders have the same value. At least your data doesn't mix in KR server, which would completely break everything

Jadoreoov

United States76 Posts

July 13 2012 12:24 GMT

#443

On July 13 2012 17:31 Thrombozyt wrote:

Show nested quote +

Shouldn't the interval in which the mean can fall become larger as you lower your level of confidence?

No, the 95% confidence interval should be smaller.

It is similar to if someone asked you to guess a number between 0 and 100.
If you guessed that it was exactly 50 you wouldn't be very confident. (low confidence interval)
If you guessed that it was between 1 and 99 you would be pretty confident that you'd be correct (high confidence interval)

In each calculation the data itself gives us the same amount of uncertainty, so to be more confident in our interval we have to include a greater range of values.

skeldark

Germany2223 Posts

July 13 2012 13:31 GMT

#444

On July 13 2012 20:27 Alexj wrote:

Show nested quote +

Eu and na mmr are very close to each other.
I have user that have eu and us accounts and have very similar mmr on both accounts

The data is from 3 weeks not more.

lolcanoe

United States57 Posts

July 13 2012 13:46 GMT

#445

On July 13 2012 14:34 Cascade wrote:
Ok, let me prove it for you then.

My claim is that if the set of samples is large enough, we can use the normal distribution with S/sqrt(N) width to estimate the errors. For simplicity, let me prove that the 2*S/sqrt(N) interval is close to 95%:

Let the distribution f(x) have an average 0 and standard deviation S. An average X from a sufficiently large (specified in the proof) set of N samples from f(x) will fall within 2*S/sqrt(N) of the average 0 with a probability between 0.93 and 0.97.
proof:
Calculating the average x from N samples (from many different sets, each of N samples) will give a distribution of averages A_N(x) that approaches a normal distribution as N goes to infinity, centred around 0, and with a width of S/sqrt(N). This is the CLT.

Specify "sufficiently large N" such that A_N(x) is similar to a normal distribution g(x) of width S/sqrt(N). Close enough so that the integral from -2*S/sqrt(N) to 2*S/sqrt(N) is between 0.97 and 0.93 (it is close to 0.95 for g). As A_N approaches g as N-->infty, this will happen for some N. The more similar f(x) is to a normal distribution, the lower N is required.

Now take a single average X from f(x), using N samples (this would be the OP). This average is distributed according to A_N(x), and with a sufficiently large N, the probability that X is between -2*S/sqrt(N) and 2*S/sqrt(N) is larger than 0.93, and smaller than 0.97. QED.

No, reread part A. The claim that the distribution that the sample distribution approaches normality only applies when the population data itself is normal. This is extraordinarily intuitive as you watch your sample size approach the entire population. In your claim here, you used standard distribution around a known average to describe a population. In our data, we do not know if SD's can be applied to the population, as the SDs we are calculating are really only accurate for Gaussian distributions.

It is a common misapplication of CLT to state that a sample size of 30 guarantees approximate normality. This iteration tends to be true only because populations tend to be normally distributed. To be mathematically precise, the correct statement is that with a sufficient amount of samples of at least size ~ 30, the distribution of the means of these samples will begin approaching normality, with only a slight regard to the original distribution.

The normality test is essential when running the two-side t test if you want to be thorough when dealing with an unknown population distribution. The textbook, wiki, and other websites have confirmed it. I do not understand why this question persists.

Edit: I should further add that the tendency of sampling means and samples themselves in normal populations to approach normality only occurs when the sample is RANDOMLY procured. In this case it is clearly NOT random (we have different population means vs sample means), so the normality test is ABSOLUTELY a reasonable thing to be concerned about.

Treehead

999 Posts

July 13 2012 14:33 GMT

#446

To me, what the result here indicates is the opposite of what a lay person would think from reading the post.

I'd read your results as "this means that Terran players in general have a lower MMR". But based on your data:

Analysis

"Terran Average MMR, STD
1559.214909, 546.131097

Protoss Average MMR, STD
1620.764863, 509.5809733

Zerg Average MMR, STD
1672.129547, 495.3121321"

What the above seems to imply is that, although the average player included in the study has a smaller MMR, as you go higher, MMR seems to be higher for Terrans than other races. In particular, Mean + 2*STDev = (Cutoff for Top 5% of Normal Distribution) is:

T 2651.47
P 2639.92
Z 2662.75

Giving us much different looking results. As we strive to study based on arbitrarily good players (as player skill increases over time), I would think we'd want to look more heavily at analysis of the implications of Terran's higher STDEV.

A question

Are you sure you can assume normality here? How well do your distributions fit a normal distribution having the same mean and St dev? The reason I ask, of course, is that a T-test can only be used meaningfully on normal distributions.

If normality doesn't fit so well, I'd reccomend MA Stephens article on k-sample Anderson-Darling tests, which uses ranking and therefore needs only continuity as an assumption to move forward.

Edit: Link to the test I'm referring to: http://www.cithep.caltech.edu/~fcp/statistics/hypothesisTest/PoissonConsistency/ScholzStephens1987.pdf.

lolcanoe

United States57 Posts

July 13 2012 14:59 GMT

#447

On July 13 2012 23:33 Treehead wrote:

Analysis

"Terran Average MMR, STD
1559.214909, 546.131097

Protoss Average MMR, STD
1620.764863, 509.5809733

Zerg Average MMR, STD
1672.129547, 495.3121321"

Are you sure you can assume normality here? How well do your distributions fit a normal distribution having the same mean and St dev? The reason I ask, of course, is that a T-test can only be used meaningfully on normal distributions.

If normality doesn't fit so well, I'd reccomend MA Stephens article on k-sample Anderson-Darling tests, which uses ranking and therefore needs only continuity as an assumption to move forward.

I'd really suggest reading my post again, as it already includes the Anderson-Darling test! See the probability plot curve and the associated p value which was done using the Anderson-Darling test in Minitab. Anyways, let me be a little more articulate with what you are saying and address them one at a time. Can we assume normality? No. However, in this case the Anderson-Darling test results is inconclusive. Keep in mind, Anderson-Darling tends to be OVERLY powerful with large sample sizes. Your best bet is actually looking at the fitted histogram to judge approximate normality yourself! To me, given the hugely significant P values far under .01, and no strong evidence of non-normality - I'd say that that we can put the majority of these concerns to rest.

Now what is more interesting is that we have massive standard deviations, and relatively low actual differences. The two sample t-test only tests whether or not the sample means are EXACTLY equal or not - the magnitude of this difference should not be directly inferred from the p-value, but rather through observation. For instance, with 2 sample sizes with the the size of 1 billion, even a negligible actual MMR difference would result in very low p values. It has to be up to the interpreter to decide whether the maximum 7% difference between T and Z is effectively significant (and not just statistically significant).

I hope that addresses your concerns.

skeldark

Germany2223 Posts

July 13 2012 16:42 GMT

#448

Ok i have an hard question for you guys.

If i want to publish average mmr of the data in timeline.
What is the minimum value of profiles to still be accurate ?
Someone can test a weekly / monthly update ?

Treehead

999 Posts

July 13 2012 18:01 GMT

#449

On July 13 2012 23:59 lolcanoe wrote:

Show nested quote +

My bad - you already did some of the work I suggested. Honestly, I didn't read most of the thread terribly closely except the OP, which I read over a couple times to make sure he hadn't posted anything definitive on this.

Here's the thing though. Maybe you'll get better p-values to convince ourselves of normality. But maybe you won't. 0.05-0.1 isn't bad, and if the T-test returns as good a result as stated in the OP, I doubt you'll get worse than .05 on the Anderson-Darling test if the thing is anywhere close to normal. My suggestion (which can be ignored without any hard feelings) is that if we want this to be clear of scrutiny, we can remove normality concerns by just using Anderson-Darling to compare the races to begin with, instead of saying something like "well, you can almost reject the null at a significance value of 0.05 - so hopefully the reader is convinced..." when you can just skip that part. My suspicion is that A-D results will be just as low anyway - but in a serious study (which this doesn't have to be), you'd want to post those values, and not the T-test ones, because there's likely no downside to doing so.

I completely agree with your assertion that the differences are rather low compared to the mean and stdev values. I wish this were more clearly reflected in the OP - as it would be easier to interpret for someone with a limited numerical background.

And of course, the predominant concern I always have with using statistics to begin with is that pdfs are created with the unwritten assumption that your data (and hence, winning and losing) is analogous to a random variable, which is much harder to back up than any concerns about normality. I think that this is probably the reason for the large stdev and small differences seen in the data - because as time goes on, playstyles evolve, so we aren't looking at one set of distributions, we're looking at many sets of distributions which change over time as playstyles evolve and devolve.

For example, I'm guessing 1-1-1 is still reasonably effective in master's TvP these days. Maybe next month, though, some protoss badass comes out with a build that doesn't just beat it - it CRUSHES the 1-1-1 and puts you in a good spot against other builds as well. This might show in our data as a downswing in Terran MMR, but really what's happening is a metagame shift. The pdf for MMRs of TvPers doing 1-1-1 and the pdf for MMRs of TvPers doing other builds are almost assuredly different - especially when our new TvP strat is... new. Maybe I'm wrong, but this example was a hypothetical anyway. Point is - builds are still changing quite a bit, and combining pdfs always gives us weird looking data.

Edit: I don't mean to be dismissive here. The work done is really great (and far better than other stats workups I've seen on these boards), deserves credit and it does have some meaning to it. I only include this in the discussion above for the sake of good bookkeeping on assumptions.

Also, maybe if more data is continues to be gathered, enough will be obtained to use the data as a time series (which it is), rather than as a sample. Just some thoughts. Keep up the good analysis, though. I liked reading all this. Good to see some other quanty nerds in here.

lolcanoe

United States57 Posts

July 13 2012 18:14 GMT

#450

Skeledark - the number of profiles you'd want depends on the size of the confidence interval you want at a certain mean. If you wanted to make these calculations you'd need to use Excel's solver plugin to work back from interval size to sample size. Alternatively, you could guess and check to approximate it.

On July 14 2012 03:01 Treehead wrote:
My suggestion (which can be ignored without any hard feelings) is that if we want this to be clear of scrutiny, we can remove normality concerns by just using Anderson-Darling to compare the races to begin with, instead of saying something like "well, you can almost reject the null at a significance value of 0.05 - so hopefully the reader is convinced..." when you can just skip that part. My suspicion is that A-D results will be just as low anyway - but in a serious study (which this doesn't have to be), you'd want to post those values, and not the T-test ones, because there's likely no downside to doing so.

My experience is that the A-D test is actually not as common as you think, especially given it's tremendous sensitivity at high sample values. It's much more common to show a fitted histogram as I've done to show that approximate normality is fufilled.

The purpose here is simply to show that the SD's are relevant calculations. If 1 SD cover 68% of the normalized data, but in actuality 72% of the real data, it's not a terrible problem when you're making observations over 3 SD's down the line, as the majority of your error is going to be somewhat centralized.

On July 14 2012 03:01 Treehead wrote:
I completely agree with your assertion that the differences are rather low compared to the mean and stdev values. I wish this were more clearly reflected in the OP - as it would be easier to interpret for someone with a limited numerical background.

Yes. But defining effectively significant here is difficult.

On July 14 2012 03:01 Treehead wrote:
And of course, the predominant concern I always have with using statistics to begin with is that pdfs are created with the unwritten assumption that your data (and hence, winning and losing) is analogous to a random variable, which is much harder to back up than any concerns about normality. I think that this is probably the reason for the large stdev and small differences seen in the data - because as time goes on, playstyles evolve, so we aren't looking at one set of distributions, we're looking at many sets of distributions which change over time as playstyles evolve and devolve.

The high SD values for lower means was surprising for me too. Typically you'd expect it to be the other way around. I would be cautious of making any real conclusions about that though...

On July 14 2012 03:01 Treehead wrote:
For example, I'm guessing 1-1-1 is still reasonably effective in master's TvP these days. Maybe next month, though, some protoss badass comes out with a build that doesn't just beat it - it CRUSHES the 1-1-1 and puts you in a good spot against other builds as well. This might show in our data as a downswing in Terran MMR, but really what's happening is a metagame shift. The pdf for MMRs of TvPers doing 1-1-1 and the pdf for MMRs of TvPers doing other builds are almost assuredly different - especially when our new TvP strat is... new. Maybe I'm wrong, but this example was a hypothetical anyway. Point is - builds are still changing quite a bit.

You've left the scope and purpose of this study so I'm not sure if I shoudl answer that.

skeldark

Germany2223 Posts

July 13 2012 18:21 GMT

#451

when the day comes i install exel, i buy a mac, quit programming and dont look in the mirror again...

I willl wait and split the data into timelines in near future if it works out i just go on from there.
The problem is, i get new user and loose old, so my data-income is not as stable as i wish.

Treehead

999 Posts

July 13 2012 19:21 GMT

#452

On July 14 2012 03:14 lolcanoe wrote:

The high SD values for lower means was surprising for me too. Typically you'd expect it to be the other way around. I would be cautious of making any real conclusions about that though...

...

You've left the scope and purpose of this study so I'm not sure if I shoudl answer that.

Of course I'll be cautious. When confidence cannot accurately be assessed, people tend to be overconfident when the idea is their own and overcritical when it isn't. I'd be foolish to ignore that and proceed as though I were right about my "multiple distributions" theory.

If I were right though, it wouldn't be statistically provable without knowing more about each data and qualitatively categorizing different types of games into different categories - which a person couldn't really do for thousands of games without a lot more involved. You could try to place the games in some kind of pockets based on what info is known (such as time) and perform some kind of goodness-of-fit analysis, but fitness and disparity never proves a theory, it only shows that the data is what a theory would expect - which is less than useful. When something is not statistically provable, then, it must remain as theory. You have to admit, though, that the idea of varying MMR pdfs for varying builds in varying matchups is at least qualitatively plausible, I hope.

The paragraph you mention that has "left the scope of the study" was just a random example illustrating my theory. Don't read more into it than that.

cndaks

United States95 Posts

July 14 2012 02:23 GMT

#453

Nice Job in taking the time to do so and informing all of us!~

xelnaga_empire

627 Posts

July 15 2012 04:31 GMT

#454

This data shows Blizzard needs to buff Terran to bring back balance to the game. I hope somebody at Blizzard looks at this data because they need to realize the game has balance issues at this moment.

themell

43 Posts

July 15 2012 07:27 GMT

#455

Is it possible to see what average time it takes for a race to win?

For example, if TvZ win ratio in the early game is 50%, then we can say the early game is fair. But then we can see TvZ in late game is 20% win rate for Terran, then we can say Terrans are having difficulty in the late game.

Crashburn

United States476 Posts

July 15 2012 07:29 GMT

#456

@ xelnaga_empire

ಠ_ಠ

[image loading]

skeldark

Germany2223 Posts

July 15 2012 07:33 GMT

#457

On July 15 2012 16:27 themell wrote:
Is it possible to see what average time it takes for a race to win?

For example, if TvZ win ratio in the early game is 50%, then we can say the early game is fair. But then we can see TvZ in late game is 20% win rate for Terran, then we can say Terrans are having difficulty in the late game.

yes
even way more accurate.
I dont have time at the moment but the data is there

skeldark

Germany2223 Posts

July 15 2012 11:53 GMT

#458

Update the result with a lot of stats:

Result

Source Main Data
+ Show Spoiler +

- The data is biased towards EU/US and towards higher skill-rate.

Gamescount: 125976
Sc2-Accounts: 45203

-worst to best player: 3200 MMR
-one average win/loose on Ladder: +16 / -16 MMR

TIME Filter: only between 1 Jan 1970 00:00:00 GMT - 12 Jul 2012 16:52:47 GMT

Average MMR per Race
+ Show Spoiler +

Race account count: 15814
Data average MMR: 1539.46

Difference in average MMR per Matchup:
T-P: -62.14
T-Z: -117.03
P-Z: -54.89

Average Win-ratio per Race
+ Show Spoiler +

TvP 50.43 Games: 6700
TvZ 46.7 Games: 8118
PvZ 51.61 Games 9189

Win-ratio per Race over Game-Time
+ Show Spoiler +

TvP
gamelength,%race1 win,%race2win, %of games
0,44.9,55.1,3.66
5,40.71,59.29,13.9
10,58.32,41.68,24.21
15,59.7,40.3,24.78
20,45.72,54.28,18.31
25,37.79,62.21,9.16
30,35.04,64.96,3.49
35,46.71,53.29,2.49

TvZ
gamelength,%race1 win,%race2win, %of games
0,37.13,62.87,3.78
5,33.78,66.22,9.15
10,46.91,53.09,15.96
15,52.51,47.49,22.12
20,47.88,52.12,22.9
25,44.36,55.64,14.3
30,50.0,50.0,6.65
35,48.08,51.92,5.12

PvZ
gamelength,%race1 win,%race2win, %of games
0,47.38,52.62,4.57
5,38.3,61.7,11.39
10,59.72,40.28,25.07
15,50.17,49.83,25.36
20,49.97,50.03,17.34
25,53.21,46.79,9.14
30,51.0,49.0,4.37
35,58.89,41.11,2.75

Methy

United Kingdom74 Posts

July 15 2012 13:13 GMT

#459

This is fantastic work, well done.

I'd just like to make the (obvious) point that the concept of an 'instantaneous balance' is a bad one that should be ignored. As skeldark has said many times, one of the ways to detect imbalance is to track the MMR of the player base over time - I'd argue that this is the only reasonable way to do it. A sufficiently large sample of games determined in a small time period is rather meaningless for the 'balance' of a game, especially due to the competitive nature and the way balance is completely tied to perception.

To give an example, if you had built a sample of games in the month following the NASL season 1 final, you probably would've seen an 'imbalance' in TvP - players of those races that had equal MMRs before Puma unveiled 1/1/1 would not have a 50% winrate once 1/1/1 became common. As such there would be a short term spike in TvP winrates, and the Protoss average MMR would drop until this winrate normalised to some extent. This would produce a corresponding rise in PvZ winrates as Protoss players are getting matched against zergs with a lower MMR than they're used to facing and nothing significant has changed in the matchup.

As such a development in the TvP matchup influences PvZ winrates and this happens fairly consistantly at all MMR ranges (with the possible exception of the bottom end MMR range). The only way you can distinguish the development of 1/1/1 from 'imbalance' in PvZ is by monitoring the MMRs over a sufficiently large time.

Furthermore does this mean the game is 'imbalanced'? Not even remotely. 1/1/1 was eventually solved without significant patching (immortal range is the only really important change), but before the solution was found no one could claim to know a solution would be found, so how could we comment on balance? Well we couldn't at the time... we needed to let games be played over a long enough period, then, if after months and months of 1/1/1 dominance we could possibly conclude that that particular 'strategy' was overpowered.

But the crucial point is that this works in the other direction as well. Let's assume that all 3 races have a player base with identical MMR distributions and all matchups havea 50-50 winrate. This doesn't mean the game is 'balanced' - someone might think up a strategy that causes one race to gain a significant advantage and is never overcome. Thus to determine 'balance' we need to be analysing a period of years, not months - a position we are now easily able to monitor thanks to skeldarks efforts.

But the main point I'm trying to make here is that balance is actually largely tied to perception and nothing more. The root of the problem lies in the fact that we're using one word to describe multiple concepts. If we say 'players of equal skill should get to where they are regardless of race choice,' we are being utterly foolish. What is meant by skill? Sheer mechanical speed? Strategising ability? On-the-fly decision making? There are so many factors of what constitutes 'skill' that you can't possible keep a general universal decision.

In fact I'd like to explicitly make the point that it is a BAD thing if a player gets to exactly the same MMR with all three races - this is a sign of a one dimensional game. I am a person possessed of certain abilities - those abilities happen to align with the skillset required by one particular race more than the others - hence I play that race, and accept that if I switch race I will not perform as well.

If we then ignore people using balance whine as a crutch to justify their own poor performance, we can only begin to talk about balance 'at the highest levels of the game.' The beauty of the game lies in the fact that 'balance' is inseparable from the 'distribution' of human abilities. If we genuinely cared about the game being balanced, we would have to care about 'the best possible player of starcraft 2' - which would undoubtedly be a computer ai possessed of unlimited apm that we don't quite have the ability to code yet. All we truly care about is A) the perception that over a sufficient period of time all three of the races perform 'equally well' at the highest level of human ability (ie tournaments) and B) active innovation is occuring.

I realise I've ranted on for quite some time and I must apologise, but +10 points if you managed to read this entire post.

Methy

United Kingdom74 Posts

July 15 2012 13:16 GMT

#460

I'd actually just like to follow up with a far more simple single statement that I believe cuts right to the point:

If you do not believe that the 'overpowered and imbalanced' race is the one you are playing then you've chosen the wrong race - balance is a function of the skill set required by a particular race matched to the corresponding distribution of skills in the human population. Your race should always feel like the 'easiest race' for *any* player at *any* skill level, or your abilities simply do not match up with those required by the race you've chosen. As such the best way to determine 'balance' is actually just to look at the percentage of players in each race over a long period of time as long as there are equal numbers of each race at any given bracket, then you can flat out conclude the game is 'balanced' in the only meaningful sense of the word.

Prev 1 21 22 23 24 25 26 Next All

Please or register to reply.

Ladder-Balance-Data - Page 23

Completed

Ongoing

Upcoming