Ladder-Balance-Data - Page 22

hunts

United States2113 Posts

July 13 2012 01:28 GMT

#421

On July 13 2012 10:23 Jadoreoov wrote:
First off I'd like to point out that the normality of the data doesn't really matter because of the Central Limit Theorem, so please stop discussing that like it matters.

Continuing with lolcanoe's analysis, I found the 99% confidence intervals for the difference in mean for each group.

Race Results
+ Show Spoiler +

For US:
ZvT
(62.0, 164.6)
PvT
(8.9, 115.0)
ZvP
(3.3, 99.4)

For EU:
ZvT
(19.6, 108.6)
PvT
(18.3, 113.2)
ZvP
(-45.3, 42.0)

US and EU:
ZvT
(51.5, 118.8)
PvT
(28.9, 99.6)
ZvP
(-11.1, 53.2)

As for US vs EU, the 99% confidence interval for the mean difference in MMR is:
(21.9, 77.1)

For each interval a positive difference indicates the mean of the first population is higher than the second, so for US vs EU it reads, 99% of such samplings will yield a result such that the mean MMR of the US player base is between 21.9 and 77.1 MMR higher than that of the EU player base.

The meaning of a 99% confidence interval for the mean is as follows:
If we were to randomly pick samples of the same size* from each population and found the difference of the means between the groups, 99% of such samplings would result in a difference of means within the given interval.

*By same size I mean the same sizes as were sampled to construct the interval, so if the interval were constructed by sampling 10 Zergs and 15 Protosses, it would be random samples of 10 and 15, respectively.

I've provided the MATLAB code I used for the analysis if anyone can run it and wants to do analysis on future data:

Helper Function
+ Show Spoiler +

function [lower,upper] = findInterval(pop1,pop2,confidence)
mu1 = mean(pop1);
mu2 = mean(pop2);
s1 = std(pop1,1);
s2 = std(pop2,1);
n1 = length(pop1);
n2 = length(pop2);
diff = mu1-mu2;
df = (s1^2/n1 + s2^2/n2)^2/((s1^2/n1)^2/(n1-1)+(s2^2/n2)^2/(n2-1));
tcrit = tinv(1-(1-confidence)/2,df);
s = sqrt(s1^2/n1 + s2^2/n2);
halfrange = tcrit*sqrt(s1^2/n1 + s2^2/n2);
lower = diff-halfrange;
upper = diff+halfrange;
end

Main script
+ Show Spoiler +

%script for calculating balance

%get data from file (would be ez if OP hadn't put quotes in the .csv, BAD!)
fid = fopen('balance.csv');
str = char(fread(fid))';
fclose(fid);

omitFirstLine = '(?<=\n).*';
stripped = str( regexp(str,omitFirstLine):end ); %strip first line
rawdata = textscan(stripped, '%s %s %d', 'delimiter',' \t\n,"',...
'MultipleDelimsAsOne', 1);

%define some constants (not saying protoss #1)
protoss=1;
zerg=2;
terran=3;
US = 1;
EU = 2;

%combine into one big array
col = length(rawdata{3});
data = zeros(col, 3);
data(:,3) = rawdata{3};
for i=1:col
if ( rawdata{1}{i}(1) == 'U')
data(i,1) = US;
else
data(i,1) = EU;
end

if ( rawdata{2}{i} == 'z')
data(i,2) = zerg;
elseif ( rawdata{2}{i} == 'p')
data(i,2) = protoss;
else
data(i,2) = terran;
end
end

%define filters
tF = data(:,2) == terran;
pF = data(:,2) == protoss;
zF = data(:,2) == zerg;
uF = data(:,1) == US;
eF = data(:,1) == EU;

%construct the 99% confidence intervals based on a two-sided t-test
%zerg vs protoss
confidence = 0.99;
place = eF | uF; %lets you quickly change if US,EU, or both (uF | eF)
[zpLower,zpUpper] = findInterval( data(zF & place,3), data(pF & place,3),confidence);
[ztLower,ztUpper] = findInterval( data(zF & place,3), data(tF & place,3),confidence );
[tpLower,tpUpper] = findInterval( data(tF & place,3), data(pF & place,3),confidence );
[UsEuLower,UsEuUpper] = findInterval( data(uF,3), data(eF,3), confidence);

Nice work, though it might be nice to narrow it down to a 95% CI to get a slightly better measurement I think. I'm too lazy to do it though :D

Jadoreoov

United States76 Posts

July 13 2012 02:20 GMT

#422

Done:

95% confidence intervals for the EU and US combined:
ZvT:
(59.5, 110.7)
PvT
(37.3, 91.2)
ZvP
(-3.7, 45.5)

US vs EU
(28.5, 70.5)

lolcanoe

United States57 Posts

July 13 2012 02:45 GMT

#423

No. No. No. No....More misinformation. Normal distributions are indeed pretty prevalent in the real world, and the central limit theorem is a good rule of thumb, but its these sorts of assumptions that have lost certain financial entities billions as well.

Take stock prices returns - approximately normal - but with a fat left-tail. If you used a normal distribution you would severely undervalue the possibility of total disaster and hence under-price risk. Hence, returns are best modeled with a modified distribution to account for the extremities. Or waiting times in a queue, where you have a very long right tail but a distinctly left weighted distribution (think about it, you have a minimum of 0, but a max of infinity, with a peak that is much closer to left than right).

Most of all, we are dealing with an entirely man-made distribution here. If you counted by league only, you'd have 20 20 20 20 20 EVENLY distributed. For MMR, the way the curve shaped is ENTIRELY shaped by modeling software. If Blizzard wanted to they could create a distribution of any type. With our data we can only guess the distribution and approximate our statistics under reasonable normal guidelines (after establishing that normality is a possible model).

Hope this makes sense, and I really encourage you to keep this in mind, especially if you ever plan to work on Wall Street in your life time.

Excalibur_Z

United States12235 Posts

July 13 2012 02:53 GMT

#424

Yes, the MMR cap exists. A floor likely also exists.

Don't get defensive when other community members demand more thorough data or a stronger analysis. Understanding the ladder is a communal effort. lolcanoe and Lysenko bring up salient points that should be addressed in order to produce more concrete hypotheses, even if this means refuting existing hypotheses.

We call the reverse-engineered values (points -> adj.pts -> adj.pts with offsets removed) "MMR" because that's the closest representation of MMR we have. We know that the "actual" hidden MMR factors in an uncertainty value when determining the degree of change after a match, but it's unlikely that will ever be deciphered.

The league and division offsets used by the MMR tool are not exact, but they're somewhat close. Still, this introduces a margin of error. This is probably mitigated by the volume of data, and even the relatively arbitrary values that are calculated can be used when compared to each other for the purposes of gauging race balance, because the margin of error applies universally to each race and matchup.

One thing I want to be very careful about is considering any part of this interpretation as "final" data. Every other person who has posted theories about how the ladder works in the past has fallen into the same trap of interpreting his data incorrectly until it fits his conclusions, so it's important we don't repeat that mistake. The data must remain impartial. The only additional information we have about the ladder comes from Josh himself.

Also a special side note: the ladder isn't 20/20/20/20/18/2 anymore. There were some offset corrections and I don't know the new targeted distribution, but I would say conservatively it's closer to 20/20/20/20/16/4. I don't expect Blizzard to release the new target values.

Jadoreoov

United States76 Posts

July 13 2012 03:15 GMT

#425

@lolcanoe

The issue wasn't whether the distribution itself was close to normal at all. It can be the most skewed thing in the world. The issue is that the sample size is very large, so the distribution of the SAMPLING MEAN is approximately normal.

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed.

The students t-test assumes that the distribution of the sampling mean is approximately normal, but makes no assumptions regarding the underlying distribution of the data itself.

Cascade

Australia5405 Posts

July 13 2012 03:38 GMT

#426

Oh, it's nice that you guys are redoing what I did back at page 10, but now with more statistics.

- Yes, I think we have enough statistics, and the distribution is well behaved enough so that central limit theorem will give a sufficiently accurate estimate of the statistical error.

- However, it does assume that the samples are uncorrelated. OP, you said that you removed duplicates from the list, but do you think there can be other correlations in the list of samples? You probably know best exactly what is in the list. If there are still correlations, it means that the error should be larger than what you get from a central limit analysis. But it seems like the (small) signal will still be significant, even if the error is increased a bit. Hopefully there shouldn't be large correlations in there?

lolcanoe

United States57 Posts

July 13 2012 03:40 GMT

#427

On July 13 2012 12:15 Jadoreoov wrote:
@lolcanoe

The issue wasn't whether the distribution itself was close to normal at all. It can be the most skewed thing in the world. The issue is that the sample size is very large, so the distribution of the SAMPLING MEAN is approximately normal.

You should scroll down the page you quoted.

"In a specific type of t-test, these conditions are consequences of the population being studied, and of the way in which the data are sampled. For example, in the t-test comparing the means of two independent samples, the following assumptions should be met:
Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro-Wilk or Kolmogorov–Smirnov test, or it can be assessed graphically using a normal quantile plot.
If using Student's original definition of the t-test, the two populations being compared should have the same variance (testable using F test, Levene's test, Bartlett's test, or the Brown–Forsythe test; or assessable graphically using a Q-Q plot). If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances.[7] Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.
The data used to carry out the test should be sampled independently from the two populations being compared. This is in general not testable from the data, but if the data are known to be dependently sampled (i.e. if they were sampled in clusters), then the classical t-tests discussed here may give misleading results."

(http://en.wikipedia.org/wiki/Student's_t-test#Assumptions) Keep in mind we are using a two-sample t-test here... you did scroll down right?

Cascade

Australia5405 Posts

July 13 2012 03:45 GMT

#428

On July 11 2012 16:27 Not_That wrote:

Show nested quote +

How do I figure out error margins for a graph with granularity?
Fixed colors btw.

Sorry, missed this post...
The error is sqrt(N) in each bin, before normalisation. Then when you rescale, just scale the error with the same factor. Equivalently, the relative error in each bin is 1/sqrt(N). N is the number of entries in that that bin btw.

That way, when you group up bins, you can expect the error to go down a factor 2 if you go from 50 to 200 granularity.

When N gets too low (rule of thumb: it is ok down to N = 20), this error estimate starts becoming a bit shaky, but for a plot like this, it is good enough. Below N = 20, we wont be able to see much anyway I think, so the bin will just say that there is not enough statistics.

Cascade

Australia5405 Posts

July 13 2012 03:59 GMT

#429

On July 13 2012 12:40 lolcanoe wrote:

Show nested quote +

No need for that tone imo. We are all working together here as far as I know.

Yes, for these probability calculations to be mathematically accurate, you need normal distributions. But according to central limit theorem, the more you sample any distribution, the more it will look like a normal distribution. The better behaved (ie, normal distribution-like) the distribution is, the faster the convergence. So while these errors are not 100% mathematically accurate, with a distribution that is well behaved like this (no strong tails), and with sample sizes of thousands, they are close enough.

VediVeci

United States82 Posts

July 13 2012 04:03 GMT

#430

On July 13 2012 08:21 lolcanoe wrote:

Show nested quote +

I'm not requiring anyone to have anything. My criticisms are objectively based on the analysis and not the source.

There is no ivory tower here. I've proven that my methods can be applied in a statistically coherent and easily understandable way, so your accusations that my suggestions are impractical (or "ivory tower") are pretty moot.

Im not arguing that your methods aren't better, they probably are, (I didn't read your post very closely). You're attacks have been pretty consistently derisive, rude, and especially condescending though, in my opinion. And I know it's not a smoking gun, but his results seem pretty consistent with yours, so he didn't do too poorly.

And I'm glad you have such good insight into how the financial crisis happened and can tell us about it. Now that you're on the case we can rest assured it won't happen again!!

And skeldark, when I say you "manipulated" the data, I don't mean you did anything negative, I just mean you performed a series of calculations or "manipulations" on the data.

Edit: clarity

DwindleFlip

United States32 Posts

July 13 2012 04:23 GMT

#431

All this talk just to deny the simple truth that terran is in rough shape. Sc2 WOL is abandonware to Blizzard now.

User was temp banned for this post.

lolcanoe

United States57 Posts

July 13 2012 04:40 GMT

#432

On July 13 2012 13:03 VediVeci wrote:

Im not arguing that your methods aren't better, they probably are, (I didn't read your post very closely). You're attacks have been pretty consistently derisive, rude, and especially condescending though, in my opinion. And I know it's not a smoking gun, but his results seem pretty consistent with yours, so he didn't do too poorly.
Edit: clarity

He had at least a 50% chance of getting it right. I'm going to ignore the rest of the post has to not encourage further irrelevance from posters who self-admittedly don't read things carefully.

On July 13 2012 12:59 Cascade wrote:
Yes, for these probability calculations to be mathematically accurate, you need normal distributions. But according to central limit theorem, the more you sample any distribution, the more it will look like a normal distribution. The better behaved (ie, normal distribution-like) the distribution is, the faster the convergence. So while these errors are not 100% mathematically accurate, with a distribution that is well behaved like this (no strong tails), and with sample sizes of thousands, they are close enough.

Ok, let's separate the statements clearly so I can explain why your explanation is inaccurate and why his is pretty much entirely misplaced. I understand the confusion here because my high school math teacher needed to be corrected on the same misunderstanding.

Imagine a population with a distribution that is skewed in one way or another (not normally distributed). If you take a a sample, and increase the sample size from n in an orderly fashion, what happens? Eventually your sample size is the entire population and your sample distribution and population distribution are unsurprisingly identical! So in this 1 sample situation, the shape of the distribution is dependent on the population being sampled. If the population is normal, and only if it is, the sampling distribution will become increasingly normal as n grows. This idea is pretty intuitive once you imagine a sample size equal that of your population (that's exactly what's going on here). This is why a normality test is important!

The central limit theorem specifically relates to the distribution of sampling means and infinite random samples (which isn't exactly what we have here). The distribution of sampling means does NOT equal the sample distributions themselves, as you have incorrectly equated! It refers to the distribution of the AVERAGE values in each sample, and this distribution becomes increasingly normal, not as the number of samples increase but rather as n, the sampling size, increases. In this regard it makes complete sense (with a formal mathematical proof) why the population distribution tends to be irrespective of the distribution of sampling means!
Please look into http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/stat_workshp/cnt_lim_therm/cnt_lim_therm_02.html
to understand why neither of your posts are accurate and how a completely non-normal distribution can have normally distributed sample means as n increases.

Hopefully, you'll begin to understand how you guys are misapplying CLT!

Cascade

Australia5405 Posts

July 13 2012 04:42 GMT

#433

On July 13 2012 13:23 DwindleFlip wrote:
All this talk just to deny the simple truth that terran is in rough shape. Sc2 WOL is abandonware to Blizzard now.

User was temp banned for this post.

ahaha, ok guys, we are busted. We can stop all this statistics BS now. You know, the one we make up out of thin air as we type, completely baseless. We got called on the bluff, nothing more to say. Was fun while it lasted. No point in trying to pretend that analyzing data is of any use when we have people like DwindleFlip laying down the simple truth like a B40UwwwwzzZZZzz!!!11oneone

SeAK

Canada69 Posts

July 13 2012 05:30 GMT

#434

Its always easier to rip something apart then it is to build something... kinda like what I just did

Cascade

Australia5405 Posts

July 13 2012 05:34 GMT

#435

On July 13 2012 13:40 lolcanoe wrote:

Show nested quote +

He had at least a 50% chance of getting it right. I'm going to ignore the rest of the post has to not encourage further irrelevance from posters who self-admittedly don't read things carefully.

Show nested quote +

Ok, let's separate the statements clearly so I can explain why your explanation is inaccurate and why his is pretty much entirely misplaced. I understand the confusion here because my high school math needed to be corrected on the same misunderstanding.

Imagine a population with a distribution that is skewed in one way or another (not normally distributed). If you take a a sample, and increase the sample size from n in an orderly fashion, what happens? Eventually your sample size is entire population and your sample distribution and population distribution unsurprisingly identical! So in this 1 sample situation, the shape of the distribution is dependent on the population being sampled. If the population is normal, and only if it is, the sampling distribution will become increasingly normal as n grows. This idea is pretty intuitive once you imagine a sample size equal that of your population.(that's exactly what's going on here). This is why a normality test is important!

The central limit theorem specifically relates to the distribution of sampling means and infinite random samples (which isn't exactly what we have here). The distribution of sampling means does NOT equal the sample distributions themselves! It refers to the distribution of the AVERAGE values in each sample, and this distribution becomes increasingly normal, not as the number of samples increase but rather as n, the sampling size, increases. In this regard it makes complete sense (with a formal mathematical proof) why the population distribution tends to be irrespective of the distribution of sampling means!
Please look into http://www.wadsworth.com/psychology_d/templates/student_resources/workshops/stat_workshp/cnt_lim_therm/cnt_lim_therm_02.html
to understand why neither of your posts are accurate and how a completely non-normal distribution can have normally distributed sample means as n increases.

You guys are misapplying CLT!

Ok, let me prove it for you then.

My claim is that if the set of samples is large enough, we can use the normal distribution with S/sqrt(N) width to estimate the errors. For simplicity, let me prove that the 2*S/sqrt(N) interval is close to 95%:

Let the distribution f(x) have an average 0 and standard deviation S. An average X from a sufficiently large (specified in the proof) set of N samples from f(x) will fall within 2*S/sqrt(N) of the average 0 with a probability between 0.93 and 0.97.
proof:
Calculating the average x from N samples (from many different sets, each of N samples) will give a distribution of averages A_N(x) that approaches a normal distribution as N goes to infinity, centred around 0, and with a width of S/sqrt(N). This is the CLT.

Specify "sufficiently large N" such that A_N(x) is similar to a normal distribution g(x) of width S/sqrt(N). Close enough so that the integral from -2*S/sqrt(N) to 2*S/sqrt(N) is between 0.97 and 0.93 (it is close to 0.95 for g). As A_N approaches g as N-->infty, this will happen for some N. The more similar f(x) is to a normal distribution, the lower N is required.

Now take a single average X from f(x), using N samples (this would be the OP). This average is distributed according to A_N(x), and with a sufficiently large N, the probability that X is between -2*S/sqrt(N) and 2*S/sqrt(N) is larger than 0.93, and smaller than 0.97. QED.

Then at what N it reaches "sufficiently large" is a trickier matter. But I am personally convinced (from experience) that with the well behaved distribution of MMR we see, and with thousands of samples, the errors are accurate enough so that the conclusion stands. Ie, that there is a significant signal that the terran MMR is lower than the zerg MMR. Due to the finite (aawwwww

) sample size there is little point in claiming confidence levels of exactly 0.99957353526452, but if this method gives a confidence level of 99.9% I think it is safe to say that you are more than 99% sure. This would also include other errors, such as correlations in the sample (as I was nagging about earlier).

skeldark

Germany2223 Posts

July 13 2012 06:58 GMT

#436

Discussion:
I think its time to forget the past and start new again.
Most of us did not behaviour in the past like they should have ( me included)
After we do all agree on the main points we can let the personal stuff aside.

On July 13 2012 12:38 Cascade wrote:
- However, it does assume that the samples are uncorrelated. OP, you said that you removed duplicates from the list, but do you think there can be other correlations in the list of samples? You probably know best exactly what is in the list. If there are still correlations, it means that the error should be larger than what you get from a central limit analysis. But it seems like the (small) signal will still be significant, even if the error is increased a bit. Hopefully there shouldn't be large correlations in there?

Duplicates
-I can 100% guarantee that there are no duplicated accounts

The profile list is generated backwards ( last upload game first ) and filtered by:
- The mmr of the account is valid
- The race of the player is known
- The player is not a random player
- The account is not already in the list

In fact there is a mistake that i exclude data unnecessary:
i forgot that the id is only unique for an server and i only check for id not for server+id

Other correlations:
Only thing i can think of is that the users-mmr and the opponent-mmr is analysed in total different way.
And the analyser for the opponent take the result of the player into account
I can mark witch data value is userdata and witch is opponent data
Also all opponents of one player are obvious not far away from each other.
I can also mark witch opponent values are submitted by the same user.

Beside this the analyse and collection of the mmr is very complicated
I can not guarantee that i dont have any structural mistakes at some place that could create correlations
But at the moment i dont see such an factor.

Data
I can add some useful information to the profile list and publish it again
What i think of is:
-Time the game was played ( this is sadly user time not server time. i should fix this in the long term)
- An id of the user that submitted the data
- An id of the account that is shown
- mark if the data comes from a user or an opponent
- mainrace of the account +the race of the account in the last game he played
Anything else?

High mmr cap:
I have some more arguments but its offtopic and i just wake up.
Let us leave this topic for now and perhaps catch up on it later.

Also a special side note: the ladder isn't 20/20/20/20/18/2 anymore. There were some offset corrections and I don't know the new targeted distribution, but I would say conservatively it's closer to 20/20/20/20/16/4. I don't expect Blizzard to release the new target values.

Total agree with this. The data move away from normal slowly and they try to correct with offsets. However i have the feeling they decided not to do so anymore because they dont want to create demotion/promotion waves. On the other hand they could do so at session start and obvious did not with start of season 8. Example the platin offsets are not equal to silver what should be the case if the data is normal. So they corrected with this offsets towards 20/20... already.

Cascade

Australia5405 Posts

July 13 2012 07:23 GMT

#437

Sure, add all the data you can think off.

I think a more interesting analysis can be made from the list of games though. Although there we will REALLY have to think of the systematics, as each player submits many games, and what if a player that is really good at say PvZ submits 30 games? That is for another thread though.

Do you think it is a problem that the samples are weighted by activity? Ie, if (X level) terrans feel frustrated and play less, they will face your users less often, and be less represented in the statistics (at X level). What we measure is actually not only MMR as a flat average over all players, but an average weighted by their current activity.

Otherwise I'm not sure there is much more I have to say. Doing measurement of single leagues (intervals in MMR) doesn't really make sense, as it would only measure the difference in slope of the distribution for the different races. Also I won't have much access to internet over the weekend.

cheers

skeldark

Germany2223 Posts

July 13 2012 07:31 GMT

#438

On July 13 2012 16:23 Cascade wrote:
Sure, add all the data you can think off.

I think a more interesting analysis can be made from the list of games though. Although there we will REALLY have to think of the systematics, as each player submits many games, and what if a player that is really good at say PvZ submits 30 games? That is for another thread though.

That is true.
I already notice when i try to collect division data, that i see the same division all the time because the first players of new season create them and this are the guys who play all the time.
The active userbase is way smaller than the total userbase and the very small very active userbase create alone most of the games.
It could get a problem if you make the time interval shorter.
But i have a feeling this is again a definition of balance. If good players of one race stop playing is this an balance indicator?

Otherwise I'm not sure there is much more I have to say. Doing measurement of single leagues (intervals in MMR) doesn't really make sense, as it would only measure the difference in slope of the distribution for the different races. Also I won't have much access to internet over the weekend.

But the difference in slope of the distribution for the different races in different mmr intervals is a interesting fact too.

The total gamedata is published in my MMR-Tool thread.
I will update it soon with the race data and the game length.

Thrombozyt

Germany1269 Posts

July 13 2012 08:31 GMT

#439

On July 13 2012 11:20 Jadoreoov wrote:
Done:

95% confidence intervals for the EU and US combined:
ZvT:
(59.5, 110.7)
PvT
(37.3, 91.2)
ZvP
(-3.7, 45.5)

US vs EU
(28.5, 70.5)

Shouldn't the interval in which the mean can fall become larger as you lower your level of confidence?

skeldark

Germany2223 Posts

July 13 2012 09:04 GMT

#440

UPDATE

Games & Player:
datafile