Blizzard's "skill-adjusted-win-percentages" - Page 6

Lysenko

Iceland2128 Posts

August 05 2011 12:09 GMT

#101

On August 05 2011 20:29 bmn wrote:
Now you're just muddying the waters by saying that we can't possibly know what they do, so we have no reason to criticize anything they say, because maybe they're doing something brilliant and we'll never know.
That's a cop-out :-)

I'm not saying that it's invalid to criticize anything they say or any change they make.

I am saying that it's not possible, on the outside looking in, to criticize their statistical analysis on technical grounds, because we can't possibly know exactly what they're doing in terms of data collection or analysis. Importantly, though, that's a very small part of how they balance the game (as that video points out.)

In fact, they use statistics that appear out of whack as hints to suggest where to follow up in other areas, looking at player feedback, playing the game themselves as David Kim does, using their own testing tools, etc. Then, they make choices about how to change game rules, if they should think it's warranted, based on the totality of all of that combined with their own personal judgements as game designers and what they think feels right when played.

This isn't chess, with two identical sides and a turn-based system where you can instantly tell who has the sole advantage the game offers. It's a game with three asymmetric race choices and complex mechanics where any change, large or small, may have unintended, unanticipated consequences down the line. Every change to the game is completely subjective -- it has to be. You can't simulate what a change's impact is going to be, because the impact may rely on tactics or strategies that have not yet been thought up. You have to try it and hope, which is one reason that they put balance changes on the PTR.

This is why they look at a 55% win ratio in a matchup and say they're reasonably happy with that. It's not that that isn't statistically significant, it's that you can know nearly absolutely that the balance isn't perfect in a matchup and still have no way to tell in advance that a change you think might help won't make things worse.

And that, by the way, goes double when they see a matchup leaning one way in one region and another way in another region.

I don't see anyone suggesting that the match-making system is designed to balance the races.

There are several comments that allude to this, but here's the one that particularly set me off about it, because it's the best-expressed version of an idea that's simply not true, that striving for a 50% win/loss for players somehow obscures racial differences in the aggregate:

A system that ensures a 50% win rating not only in general, but race to race will hide imbalance by virtue of actively seeking that 50% regardless of skill level. This means two players of identical skill with two different races will both be at 50%, but will have very different MMRs if their respective races are imbalanced against one another.

bmn

886 Posts

August 05 2011 12:23 GMT

#102

On August 05 2011 21:09 Lysenko wrote:

Show nested quote +

I'm not saying that it's invalid to criticize anything they say or any change they make.

I am saying that it's not possible to criticize their statistical analysis on technical grounds, because we can't possibly know exactly what they're doing in terms of data collection or analysis. Importantly, though, that's a very small part of how they balance the game (as that video points out.)

I think your statement is far too strong, the amount of available data and man-power Blizzard has here is very limited, and the solution space is nowhere as large in practice as it is in theory, simply because there are only that many proven systems that are sufficiently robust.

But fair enough: If you contend that we can't even criticize the statistical analysis, there is no point in discussing any analysis based on their statistics. You cannot validly argue any further saying "they use statistics that appear out of whack as hints" if you yourself state that we cannot assume anything about the validity of their statistics.

I don't really see the point of the rest of your reply if we start from the assumption that we cannot know what their statistics do and, based on that, cannot argue technically about what they might be measuring or not.

This is why they look at a 55% win ratio in a matchup and say they're reasonably happy with that. It's not that that isn't statistically significant, it's that you can know nearly absolutely that the balance isn't perfect in a matchup and still have no way to tell in advance that a change you think might help won't make things worse.

Since we already started out by saying that we cannot know what their statistics actually mean, trying to defend their decision-making is entirely baseless here too. Yeah, they probably have an idea what they're doing -- but that's an appeal to their authority, there's nothing you or I could actually discuss meaningfully about this.

I know this sounds harsh, but starting with the assumption that we're talking about an entirely unknown black-box system (without fully knowing either input or output of it, let alone the inner workings) leaves the only conclusion that there is no merit in technically discussing how good or bad that system is.
I would not have said that we can't know anything about the match-making system, but that's a personal choice based on how confident you are about knowing what general approaches they might use to create such a ladder system.

Show nested quote +

I didn't interpret that comment the way you did. I took it as saying that the match-making system will hide imbalance at the player level by matching you up with "weaker" players who still have an even chance of winning, and that observation is entirely correct (if trivial).

If you take it as saying that it will hide imbalance in that Blizzard cannot detect this skew, then it is wrong.

But I don't think this implies that exposing the balance of races was the goal of the match-making system. As a player, the match-making system is all we see, so that's the only way we can judge the racial balance from ladder play.

IronDoc

United Kingdom27 Posts

August 05 2011 12:24 GMT

#103

On August 05 2011 11:12 carwashguy wrote:
It may be instructive to consider how, in chess, white pieces tend to have an advantage over black pieces. Among weaker players, there's not much advantage. At higher levels, it starts to become a factor. Among top rated computers, white scores about 55%. However, in chess, players take turns playing white and black. It seems to me that the best way to do it for Starcraft would be to look at top-rated random race players. If there is a tendency for them to win with certain races, then I believe this would reveal something meaningful. To put it briefly, the best random race players' should be immune to MMR's affect on their races winnability. An obvious problem is that the random matchups are not totally analogous to the standard matchups since the non-random player doesn't know his opponent's race off hand.

This seems like a good point that's been glossed over a bit. Random players' win percentages should be free of any effect of race on a player's MMR. I'd be surprised if this wasn't used as a pretty significant indicator of balance for Blizzard. Actually, working through it in my head, it may be the case that you would need to only take RvR matchups, since only then will the opponents MMR be free of influence from race.

I can see 2 main problems with this.
Firstly, it still doesn't address the issue of any systematic bias for or against playing random. The sc2ranks data shows that it's much less common in Master league than any of the lower 5.
Secondly, Terran and Protoss are arguably share more similarities than either race does with Zerg. This might mean that skill is more transferable between the 2 and thus a random player's win rate for each race are not independent.

Sbrubbles

Brazil5776 Posts

August 05 2011 12:27 GMT

#104

On August 03 2011 14:09 whacks wrote:
Thanks for the responses all.

A lot of people have responded with some variant of "If Blizzard sees all the Zerg players have significantly lower MMR, they'll know there's something wrong."
If Blizzard is doing this, then what they're basically doing is comparing the average zerg player's MMR with the average Terran player's MMR. This approach can break for so many reasons, which I'm not going to get into now.

It can break for so many reasons, but comparing average MMR and league placements is the only way the adjusted ELO system allows to account for balance.

For example: if Protoss is 20% of the player base and the masters and grandmasters leagues have less than 20% of Protosses while diamond and lower have more than 20%, that's a sign (just a sign) of race (or map) imbalance.

There may be other explanations to this that don't involve race/map imbalance, but still, comparing average MMR is the best we got.

ChickaChuckWally

Australia85 Posts

August 05 2011 12:33 GMT

#105

lol at the guy talking about the muta at the end.

Fungal Growth

United States434 Posts

August 05 2011 13:18 GMT

#106

Nice posts by bmn.

ChickaChuckWally...The kid obviously was trying to ask if the Thor was supposed to be an answer to the mutalisk, and it wasn't, then doesn't make the mutalisk a balance problem (good implied question as it does force terrain to be one dimensional in going mass marine and blizzard had no idea the magic box would be so effective). What's even funnier is it appears David Kim wasn't even paying attention because then he talked about marauders in his answer.

In fact in that interview the Blizzard had a number of interesting things to say... They strongly defended the marauder as a needed answer to zerg. They also felt the marauder wasn't even that great of a unit and the benefits units marauders got from stimming frequently didn't counter the damage done. Oh boy...

Lysenko

Iceland2128 Posts

August 05 2011 17:58 GMT

#107

On August 05 2011 21:23 bmn wrote:
I don't really see the point of the rest of your reply if we start from the assumption that we cannot know what their statistics do and, based on that, cannot argue technically about what they might be measuring or not.

You're going way too far with this. We can speculate all we want about what they might be measuring based on what they've said, and that might be an interesting discussion, but it would be a mistake to turn around and say they're idiots because they're doing some unjustifiable thing or another that we've imagined they might be doing. Also, the question of how a statistical analysis of any kind fits into their larger decision-making is perfectly reasonable to discuss even if we don't know the details.

The only thing we can't do is break down the exact mathematics of their statistical analysis and say it's valid or invalid for this or that reason, and that's what this thread appears to be trying to do.

Blizzard's operation may be small, but they absolutely have one or a couple experienced statisticians on their Battle.net team who are fully capable of performing some kind of reasonable analysis. How useful those results are in a larger sense may be difficult to say, but that's probably not the statisticians' fault. Furthermore, last I checked they didn't need our approval before balancing their game however they saw fit, so I don't see why our opinion on their statistical approach matters beyond entertaining ourselves with speculation, or alternatively to entertain ourselves by complaining just to complain.

My point is this: We can criticize the specific changes to the game based on the difference between the impact you think the change will have vs. the impact they think it will have. We can say we'd rather they have a greater focus on whatever race's issues / whatever league's issues we happen to be in. We can even slice and dice their offhand comments about this or that unit or whatever and call them ill-considered.

What we can't do is assess the accuracy of their specific statistical analysis or the mathematical underpinnings of their matching system beyond what limited information they choose to share with us. That limited info is not nearly enough to say they're doing it wrongly.

Lysenko

Iceland2128 Posts

August 05 2011 18:04 GMT

#108

Regarding using random players as a control group, that's an interesting idea. I can think of a third problem, though:

On August 05 2011 21:24 IronDoc wrote:
I can see 2 main problems with this.
Firstly, it still doesn't address the issue of any systematic bias for or against playing random. The sc2ranks data shows that it's much less common in Master league than any of the lower 5.
Secondly, Terran and Protoss are arguably share more similarities than either race does with Zerg. This might mean that skill is more transferable between the 2 and thus a random player's win rate for each race are not independent.

A third issue is that it's simply not possible for a random player to practice any one race with the depth that players who prefer one race can devote to theirs. This means that they're likely to have deficient and maybe early-game-centric play with all three races, and that may eliminate their value as a control.

dreamsmasher

816 Posts

August 05 2011 18:33 GMT

#109

On August 05 2011 15:01 kckkryptonite wrote:

Show nested quote +

No, you don't. The level of ignorance in your post is astonishing, after the first year of Calc, you are be able to solve basic differential equations (in my curriculum at least). Pouring money into something must mean it's the best right (US HEALTHCARE/EDUCATION)? To top it off you cite wikipedia.

Show nested quote +

Really? WTF. There are derivatives and integrals, fractions and exponents. Seriously? 1%? Where did you get this number? How are you coming to your various conclusions?

Show nested quote +

Actually, the variables are assumed to be defined as the set of all real numbers.

W/e guys, I'm not gonna get into a math debate with people who know how to use google and wikipedia.

i've taken more than my fair share of math in college and I have no idea what that formula exactly means other than the fact that there are some normal distributions involved in the calculation, and it is used to solve some sort of conditional probability problem.

there is no way you can understand what that formula means with only one year of calculus.

i'm pretty sure differential equations isn't even involved in that formula

Wren

United States745 Posts

August 06 2011 03:14 GMT

#110

On August 05 2011 11:55 Lysenko wrote:

Show nested quote +

The problem is that everyone who's been arguing that the data set is "flawed" somehow have been saying so without any reasoning or explanation behind it, other than to completely misunderstand or misrepresent the impact of the matchmaking system on the data set.

Nobody in this thread knows what their data set is or exactly how they're analyzing it, so all the criticism of it is fantasy based on imagined details to fill in the blanks.

They've said that their data set is every ladder game, repeatedly. My understanding of the match-making and MMR system is that it's essentially the same as every other computerized ranking system: who you played, if you won, how much the other guy wins. All blizzard tracks is who beat who on which map.

This data is flawed because it cannot tell (just an example) if Terran is the best because the best people play Terran or if Terran is the best because the balance is skewed. All it can tell you is that Terran wins x% of their games.

Apply MMR to GSL open 3 and it will tell you that Rain was a better player than IMMvp, because Rain cheesed to the finals while Mvp was cheesed out in Ro16.

disclaimer: I'm not a math expert, just trying to understand things like everyone else. If I've made a mistake, please correct it.

----------------------------
Ok, Lysenko, I've read this thread fairly carefully, and have a question to pose to you.

On August 04 2011 10:55 Lysenko wrote:
The way you adjust for skill is to look at overall MMR distribution among each race's population. If one race, let's say Zerg, has a population distribution that's weighted toward lower MMRs, chances are it's the race that's doing it unless there's some external indication that better players systematically favor the other races for some reason.

Is this the only worthwhile balance-related statistic we can get from the ladder?

If so, maybe the OP claim is correct, and even adjusted win-rates aren't very useful.

dreamsmasher

816 Posts

August 06 2011 03:27 GMT

#111

On August 06 2011 12:14 Wren wrote:

Show nested quote +

Is this the only worthwhile balance-related statistic we can get from the ladder?

If so, maybe the OP claim is correct, and even adjusted win-rates aren't very useful.

statistics are about averages, things occuring in the long run. you can't really do that to GSL open 3 since it is an extremely small sample size.

if you watch the video statistics are there to see if there any statistical evidence (significance) for racial imbalance across leagues. they address a lot of factors beyond statistics (such as his comment about TvP, they dont want a game of defend and win even if that led to a 50% winrate matchup average even at top leagues). they also stated that they were careful with balance changes due to the qualitatively different nature of korean ladder.

mathematics only gives positive analysis, for example they cited statistical evidence suggesting that P was too strong against T, however its possible to make a plethora of changes to 'balance' the game. some 'balances' might not be fun, some might, statistics gives you no idea of those types of things. this combined with their insight that 4G was too strong in a myriad of situations and their desire to change PVP is what led to WG nerf. its important to note that its important to balance around both aspects -- if you have statistically significant data across all leagues saying that one race is dominant against another, that it is an important issue to address because there *shouldn't* be huge skill disparity between players of each race.

for example if they found statistical evidence that P was too strong against Z they could just systematically lower the dps of the most popular P unit (the stalker) until win rates adjusted to ~50% even at the highest end, but that wouldn't exactly be what i call good game design.

Lysenko

Iceland2128 Posts

August 06 2011 06:03 GMT

#112

On August 06 2011 12:14 Wren wrote:
All blizzard tracks is who beat who on which map.

That's what the matchmaking system uses. I guarantee you that they store more info than that for each game -- for example, you can go back in someone's game history and look at their build orders. They store information in resources collected over time and units produced over time. Do they use any of that additional information in their analysis? Do they filter their matchmaking data in a way that provides greater insight than just looking at the whole population? You don't know the answer to those questions, and neither do I, so it makes no sense to criticize their quantitative analysis.

Ultimately, how that information gets fed back into changes to the game is a fuzzy process anyway.

paralleluniverse

4065 Posts

August 06 2011 07:56 GMT

#113

On August 03 2011 13:07 whacks wrote:
Disclaimer: I’m not concerned about game balance at all. I’m hoping to have a discussion on the math & statistics behind Blizzard's adjusted-win-percentage that they rely on heavily.

Late last year, Blizzard released a bunch of ladder statistics on “skill-adjusted-win-percentages” for the different matchups. The reason I have it in quotes, is because they never really explained how they did the skill-adjustments. I’ve always been skeptical about whether such a “skill-adjustment” is really possible.

Well recently, I found the following video where Blizzard partially explains how they calculate the “skill-adjusted-win-percentages.” Watch the first 5 minutes of the following video:

Gist of what they said: Raw league matchup numbers aren’t very meaningful because of matchmaking’s system ability to matchup players with equally challenging opponents. The math guy mentions specifically: Not only does the system put players in 50-50 matches, it also tries to keep the race matchups at 50-50 as well. Because of this, we have to adjust for player skill to calculate the true matchup win rates. Example: a ZvP match is about to be played. The Zerg player’s rating (odds of winning) relative to the Protoss player is 55-45. The Zerg race’s rating relative to the Protoss race is 53-47. If the Protoss player ends up winning, the player ratings will then converge to 51-49. The race ratings will also converge to 52-48.

Their explanation just didn’t click with me. Rating systems such as ELO are great when you’re dealing with a single unknown (relative player strength). But can they really work if you’re trying to differentiate between 2 unknowns? Both relative player skill & race balance? I constructed the following scenario which seems to suggest that this is impossible.

It’s important to first establish the following: Any good rating system, including ELO & the point system, relies on the following principle:
• Give each agent (could be a player, or a race) a certain rating as an estimate for how strong the agent is
• If 2 agents play and one wins at a higher percentage, the more successful agent should eventually end up with a higher rating
• If a higher rated agent & a lower rated agent play against each other, and each wins with an equal percentage, the 2 ratings should eventually converge

The ELO system that Blizzard uses for MMR is an optimized algorithm that allows ratings to stabilize much quicker, but other rating systems that utilize the above principle (including the point-system), can achieve the same results in the long run.

Now going back to the scenario, consider the case where Blizzard releases a new patch which nerfs Zerg and makes it UP relative to both Protoss & Terran (eg, drones now cost 60 min). Consider what will happen to the average Zerg player. He will start losing more than 50% of his games, and his MMR will start dropping. Because of his lower MMR, he’ll start playing against weaker opponents. Eventually, his MMR will stabilize at a level where he starts winning 50% of his future games.

Now let’s say Blizzard had assigned each race a rating as well, to track how “strong they think it is.” Suppose that before the patch, all the races were balanced & had equal rating. Immediately after the patch, because the Zerg population goes through a losing streak, the Zerg rating will drop.

But eventually, the Zerg players will have stabilized their MMR and start winning 50% of their games. At this point, because of the last bullet point in the rating system’s principles (ratings will converge at 50% win rates), the Zerg rating will start increasing again. Remember also that the stabilized Zerg players are playing against opponents of the same MMR, so there’s no way to “account for player skill.” Eventually, the zerg rating will once again converge with the other races, even though Zerg is now UP.

Based on this scenario, it seems impossible to determine whether a race is truly UP, using Blizzard’s rating system. Thoughts? Any ideas on how Blizzard could possibly be “accounting for player skill” in calculating race balance?

I'll do my best to explain this.

Blizzard uses Bayesian Inference. This is usually taught as a 3rd year or honors year statistics course at most universities. I only say this to impress upon you that this is not simple stuff.

The formula that is shown in the Youtube video is this:

Firstly, notice that the fraction in the formula looks the same as the formula here: http://en.wikipedia.org/wiki/Posterior_probability#Calculation
The fraction is a posterior probability.

Now notice that this is multiplied by a function and then integrated. This gives a Bayesian estimator, it looks the same as the formula shown here:
http://en.wikipedia.org/wiki/Bayesian_estimator#Definition

So, the whole formula is for the Bayesian estimator where the posterior probability is the product of 3 normal distributions (the 3 MMR variables), multiplied over all g (probably stands for games, i.e. takes into account all games played).

Now what does this mean?

What a Bayesian estimator does is it estimates a parameter (in this case the probability of winning) given the evidence (in this case the skill of the player).

Essentially, they have a prior belief about the probability of winning (very likely the simple unadjusted win ratio), this probability is updated by the skill of the player over all games, forming a posterior distribution, and then using this, the probability of winning given the skill of the player is calculated with a Bayesian estimator.

What isn't clear is what each variable stands for, so we don't know if they take into account the map or game length or other variables. Although from the talk, the impression is that only skil, (i.e. MMR) is taken into account to adjust the probabilities of winning.

paralleluniverse

4065 Posts

August 06 2011 08:05 GMT

#114

On August 04 2011 10:32 bamman1108 wrote:
I like that part where they're satisfied with 5% differences in W/L when that percent is based off millions of matches. Even a 1% difference with that many matches means that one race very, very significantly favors the other. Wtf are they talking about when a 55% win rate for a specific race matchup is just "borderline?"

Given a sufficiently large sample size, it's possible to make a 0.00001% difference statistically significant, because a 0.00001% difference is a nonzero difference.

But statistically significant doesn't imply an actual appreciable significant difference in everyday language,

The following example from Wikipedia (http://en.wikipedia.org/wiki/Statistical_significance) explains this concept well:
As used in statistics, significant does not mean important or meaningful, as it does in everyday speech. For example, a study that included tens of thousands of participants might be able to say with great confidence that residents of one city were more intelligent than residents of another city by 1/20 of an IQ point. This result would be statistically significant, but the difference is small enough to be utterly unimportant.

whacks

25 Posts

August 29 2011 01:20 GMT

#115

I just got back from vacation, so forgive me for resurrecting this thread so late

Paralleluniverse, thanks for taking the time to clarify. It sounds like that formula is conceptually pretty similar to ELO, possibly taking into account multiple factors other than player skill, such as racial "scores." This is exactly what I suspected in my OP.

Lysenko, you mention over & over again that the ladder data can yield useful balance information by letting us compare average-MMR difference across the races. I agree completely on this. However, this is NOT what Blizzard is doing. How do I know this?

1) There is actually VERY significant MMR differences between the races. Terran is skewed very heavily towards Bronze, and Zerg is skewed very heavily towards Plat/Diamond/Master's. Blizzard's numbers paint a very rosy picture, but if you compared average MMRs, you'll see very wide differences.

2) Calculating average MMRs is 7th grade math. You sum up all the MMRs of each player, and divide by number of players. You definitely won't need any complicated math like what they announced.

Clearly, when Blizzard presented to us the ladder data, they weren't basing it off average-MMR. You haven't presented any other methods they could be using that works in our ladder system, so you might actually be in agreement with the point I'm trying to make in my OP, and with what others like lhpares & bmn have been saying.

Again, if you have "blind faith" in Blizzard's abilities... I respect that, but that's not what this thread is about.

darmousseh

United States3437 Posts

September 29 2011 22:48 GMT

#116

Bumping this because I found a good article on a bayesian approximation method for online ranking. I have started deciphering the variables in the equation.

http://jmlr.csail.mit.edu/papers/volume12/weng11a/weng11a.pdf

So far it appears that they are analyzing the games as if the race the person has chosen is considered an additional player in the match. I'm assuming that the different sigmas represent the sigmas of the 3 races (meaning each race has a MMR). Normally I would just scoff this off as "impossible to calculate", but since they are using the sigmas themselves to calculate the values, it seems a lot more reasonable as it doesn't matter what type of matchmaking system is being used. Like the above post, this is the posterior probability function.

I will provide an update once I figure out all of the variables. I'm mostly having trouble on that Psi.

Edit; In games where ties are not allowed or very infrequent, gamma is typically used for score variance. It's possible that the score at the end of the game is being used to calculate it.

Edit 2; After consideration, the 3 sigmas being used in the equation are "player skill", "matchup skill", and "overall skill". For example. A player might have an mmr of 2000/100. Protoss (vs zerg) has an mmr or 1500/50. and protoss (vs all) has an mmr of 1600/75. That's the conclusion I am coming to so far. I will attempt to verify this hypothesis.

CluEleSs_UK

United Kingdom583 Posts

September 29 2011 22:55 GMT

#117

But surely this doesn't work out? Each race will have a different win percentage at each league. Zerg for instance has a high win percentage in lower leagues, because bronzies can't deal with ling runbys, but at high levels where this isn't as viable, the Zerg win rate is far lower.

Warble

137 Posts

September 30 2011 02:51 GMT

#118

Perhaps I am wrong about this, but it seems like Blizzard's approach to balancing is to assume that each race has approximately the same skill distribution.

This certainly simplifies the task.

And, personally, I think it is probably the most practical way to approach this matter. The more readily pros switch to races they consider overpowered, the more likely this approach is to yield an outcome close to objective balance.

We could argue that switching races is quite difficult, which means there will be more imbalance.

It's hard to see any other way to simplify the task, which we have already established is intractable without any simplifications.

whatthefat

United States918 Posts

September 30 2011 03:08 GMT

#119

On August 03 2011 13:07 whacks wrote:
Their explanation just didn’t click with me. Rating systems such as ELO are great when you’re dealing with a single unknown (relative player strength). But can they really work if you’re trying to differentiate between 2 unknowns? Both relative player skill & race balance? I constructed the following scenario which seems to suggest that this is impossible.

This has come up a few times, and yes you're right, it is impossible. It's possible that on average players of one race are actually better players than those of another. Based just on game results, there is absolutely no way of distinguishing that from the race being overpowered. Somewhere along the line you have to make an assumption, and I think the assumption they have used is that the player pool for each race is equally "skilled" (another problem is that there's no formal definition of skill), and any further discrepancies in win/loss rates (once matchmaking is accounted for) are due to imbalances in the game. Is it a reasonable assumption? Maybe.

FieryBalrog

United States1381 Posts

September 30 2011 06:37 GMT

#120

Very interesting thread to read, particularly Lysenko's posts.

Prev 1 4 5 6 7 Next All

Please or register to reply.

Blizzard's "skill-adjusted-win-percentages" - Page 6

Completed

Ongoing

Upcoming