Statistical Analysis of StarCraft 2 Balance - Page 7
Forum Index > SC2 General |
Dst
United States1 Post
| ||
DrShaiHulud
United States14 Posts
| ||
LaughingTulkas
United States1107 Posts
Thanks for you contributions to the body of knowledge; you've set the bar and further work will need to build off of and surpass what you've done if it wants to be taken seriously. | ||
contraSol
United States185 Posts
You might have answered this already, but what program did you use for the analysis? | ||
KermitTheFro
United States25 Posts
A couple suggestions -- it would be helpful for clarity if you explained the details of your model at some point in the paper. From what I gathered, you had a binary indicator variable for each Player and a ternary indicator variable showing the Map/Player1/Player2 combination. If you could write this out formally, it would greatly simplify your description of the logit equation you are using to calculate the odds of winning. Second, I would be very curious to see how many data points (games) you had for each Player1/Player2/Map combination in your data set. You mention the obvious concern that you don't have enough data for some games to be included, but you say you fixed these players' skill parameters to 0...won't this just skew your data? For instance, if MC had relatively few data points, you would set him to zero which would artificially make PvX look a lot better, since his wins are forced to be explained by the PvX regression term increasing. Finally, what is your reasoning behind using L1 regularization? Since it applies equally across all parameters, you are forcing all your regression terms toward zero. This will be effective in making regression terms for players which have only a couple games in the data go to zero, since they can't (by definition) have a large effect on the final accuracy of your regression, but the resulting effect on all of your other parameters seems unintended and hard to justify. In reality, it seems that you expect very few of your regression terms to be zero a priori. Like I said...very cool and props to you for writing this all up. Would be very cool to make this model slightly more complicated (could easily be done just by factoring in some basic time-series information to the racial balance) and see if you can capture meaningful shifts in the metagame. | ||
Pro]ChoSen-
United States318 Posts
| ||
Warble
137 Posts
He wouldn't have used a ternary variable. He would have had separate binary variables for each map and matchup - one for XNC(TvZ), one for XNC(TvP), one for XNC(PvZ) and for each map. I was reluctant to post technical comments earlier because it's so easy to destroy and so hard to create and I think this sort of thing is a good step up from what we've been seeing. I really saw that you put a lot of work into this. Since more people are interested in this now and the threads have been cross-referenced I think I'll write up some proper feedback. In the other thread I mostly just talked about the lack of data you showed us - no ANOVA tables for instance. Particularly because others like Shai has expressed interest in the data and we may see more people do tests, it would be good to improve things a bit before people put a lot of effort into analysing poor models. I've finished reading your article, footnotes, and this thread. I don't think the references require checking, so I'll start writing it up when I have some spare time. I'll probably have it ready in a fortnight. In the meantime I think it's important to emphasise a point before others get the wrong idea: Your model concludes that there is no evidence for racial imbalance. I got the feeling when reading it that you were trying to show that the game was actually imbalanced since all the steps you took were geared in that direction and you didn't talk about doing other checks against concluding that there is imbalance. I think it's an excellent start and certainly on the right track. We'll need a different model, though. More on that in 2 weeks. | ||
starcraft911
Korea (South)1263 Posts
On May 05 2011 09:52 awesomoecalypse wrote: Very interesting that everyone cries about Protoss being too strong, yet not one statistical analysis backs it up in any way. Thanks very much for posting this, and welcome to TL. That's because the sample regression is based on a pool that you're not part of. Hypothetically say that sc2 is a balanced game at the highest tier then there is an almost certainty that it will be imbalanced at the lowest tier. This is where the concept of skill cap comes in and this statistic doesn't address that at all which is understandable as trying to do so would require a statistic on each player under various conditions... i.e. map, opponent, game in the series, previous experiences, known patterns, etc. tldr: protoss deathball owns noobs hence the qq. | ||
KermitTheFro
United States25 Posts
On May 24 2011 18:57 Warble wrote: He wouldn't have used a ternary variable. He would have had separate binary variables for each map and matchup - one for XNC(TvZ), one for XNC(TvP), one for XNC(PvZ) and for each map. Ah of course...laziness on my brain's part =) I was reluctant to post technical comments earlier because it's so easy to destroy and so hard to create and I think this sort of thing is a good step up from what we've been seeing. I really saw that you put a lot of work into this. Since more people are interested in this now and the threads have been cross-referenced I think I'll write up some proper feedback. In the other thread I mostly just talked about the lack of data you showed us - no ANOVA tables for instance. Particularly because others like Shai has expressed interest in the data and we may see more people do tests, it would be good to improve things a bit before people put a lot of effort into analysing poor models. I've finished reading your article, footnotes, and this thread. I don't think the references require checking, so I'll start writing it up when I have some spare time. I'll probably have it ready in a fortnight. We will all appreciate the feedback, I'm sure. It would be really exciting to start getting some good, well-explained methodology into these types of questions. Given the sheer amount of data that can be collected from something like SCGears, it seems that the only thing stopping deeper analysis is the availability of deep data sets on SC2 games. | ||
Warble
137 Posts
Since Stra hasn’t replied for a while I’ll assume he’s gone on a prolonged absence so this won’t be directed at him but to others interested in conducting an analysis. Stra’s work was definitely a step up from everything else we’ve seen. With some more work we might get meaningful analysis of publicly available game data. I think it was definitely an excellent effort and in the right direction and the sort of thing we should see more of. One thing that bugged me most about his article was how keen he was in trying to show there was imbalance. The proper conclusion from his findings is that there is no evidence of racial imbalance at the GSL level but he only talked about how there are signs of imbalance and everything was geared around trying to show that. It’s very easy to use statistics to show whatever you want to show, it just depends on how deeply you are willing to bury the deception. I think in this case we can bill it to eagerness – after all, all of us here know the emotions that are stirred when thinking about imbalance. A side effect of his eagerness would have been haste – since he’s the first to do this, I find it completely understandable that he’d focus more on just getting something done to present to the community even if it wasn’t the most robust way about it. In terms of pioneering this sort of work, I think he’s done well and has possibly spared us from a lot of bad “analysis” others may have posted in the meantime since he’s raised the bar. My point here is that I don’t want this to be seen as criticism of him, but as ideas on how to improve on this in future work by anyone in the community willing to do it. Since this can get rather involved, I’ll focus on what I think is most important for others to keep in mind when conducting an analysis in the future. Model Specification + Show Spoiler + I’ll admit that when I had a look at the model I was puzzled: “How did he get that to solve?” Sure, he’d applied lasso – but he implied that he’d run the regression without lasso since he said that he’d compared the two results. More on that in a moment. Let’s take a look at the model first to see why I was puzzled. As we shall see, specifying models is hard. This model is misspecified because the variables are linear combinations of each other and so he should have perfect collinearity. Which means it shouldn’t solve. At all. You would get an error if you tried to solve it. If you keep insisting, your computer will literally grow a leg and kick you in the groin. I suspect this is the real reason he hasn’t been revisiting his thread: I was in hospital for 3 months when I tried it for the first (and last!) time, so he’ll probably be gone for another 8 weeks. In this instance there are quite a few linear combinations present and they should be quite obvious when you know to look for them:
You get the idea. There’s a bunch of them. The first 2 are instances of the dummy variable trap. He eliminated the constant, which can compensate for 1 instance of the dummy variable trap, but cannot compensate for 2 instances. The proof: + Show Spoiler + It is easier to show with a simpler model. Consider the classic male-female black-white model. That gives us 2 unique sets of linear combinations:
Note that M+F=B+W is just a combination of the unique sets. Consider the properly specified model:
Note that the stars denote our well-specified parameters (they’re not multiplication signs). Consider a model that compensates for the dummy variable trap by removing the constant:
In this case we still have a closed form solution:
This was only possible because we’d dropped the constant. Otherwise we wouldn’t have a closed form solution:
Now consider the properly specified model:
Consider a model that tries to compensate for the 2 linear combinations by removing the constant:
In this case we don’t have a closed form solution:
Removing the constant only allows us to compensate for one set of linear combinations. The dummy variable trap is very easy to avoid so I won’t go into the details here. The primary trouble comes from the third unique linear combination I presented above. The 2 sets of dummy variables are actually related to each other by several sets of linear combinations and I don’t think there’s an easy way around this. I think this model must be abandoned. And even if the data allowed us to solve it, consider what that actually means. Aside from mistakes in the data, it means that some players switched races. How reliable do you think it would be using the same skill variable for the player in both races? That would make us unable to trust our estimates, right? So to account for this, we’d create a new variable for them in their off race – and end up back with perfect collinearity. Another thing to avoid is dropping the constant, like in this model. This is generally bad practice and biases our estimators. Sure, it can help us avoid the dummy variable trap – but that doesn’t mean we should do it that way. Before removing the constant, we must always consider the consequences. In this case there was no justification for setting the intercept to 0. Even without all the other problems compounding to it, consider the logic behind the idea: it’s because we’re forcing the unmodified win rate to be 50%. And, sure, that’s what we would expect from our data – except some of the observations were removed and so we no longer expect a 50% win rate. And even if we had kept those observations, it would still be desirable to leave the constant in and let it solve to 0 by itself, with the benefit that if it doesn’t solve to 0 then we know to look deeper at the data. There’s no reason to remove it, since all that will come from it is biased estimators, which means we cannot trust our results. So how did he get it to solve? He got it to solve after all, didn’t he? And he did it without using lasso as well. The problem is…the way he did it basically imitated lasso and resulted in biased estimators. The intuition of Elean and others were correct in that it didn’t make sense to use lasso to solve this, even if they couldn’t quite explain why. What he did was set some of the parameters to 0 – which is what lasso does – and that immediately eliminates the perfect collinearity problem. The only problem is that it makes the results meaningless since it’s undermined the logic of the model. No matter how insignificant a parameter looks, if the logic behind the model dictates that a variable must be present, then we must keep it in our model. Setting its parameter to 0 eliminates the variable from the model and biases our other estimators. In this model, he wanted to control for map and racial imbalance and player skill, so setting a player’s parameter to 0 basically says, “This player has a base 50% chance of winning based on their skill.” Which, considering that he only removed the variables for players who played few games, we can generally say that many such players would have been eliminated early and some may even have dropped out of the GSL, so their base win rate would likely have been lower than 50%. However, we cannot say this for all players (maybe some were upstarts who had only recently entered the GSL). What this means is that we cannot use lasso on this model. Doing so artificially deflates the variances, which makes the results look more significant than they really are, and introduce bias. That’s a very bad combination: It makes the results biased and makes them look significant at the same time. I would extend this by saying that we cannot use regularization techniques to solve this model at all. My impression is that Stra was overly worried about overspecification and introduced bias into the model as a result of his fears. Risk of overspecification is preferable to bias. The irony is that he was right because the model was also overspecified. It’s just that the only real solution is to create a new model but he tried to salvage the model. Analysis + Show Spoiler + My biggest complaint is the lack of results that were presented. There were no ANOVA tables, no tables summarising the estimates and standard errors, nor anything else to help us evaluate the results for ourselves. So future analyses should publish their tables. Just put them in the appendices. We’ve already discussed Stra’s concerns regarding overspecification. This issue was compounded by the fact that his tests showed that overfitting was not a big problem. (Although I would question the validity of those tests in this instance, I don’t think it’s an important topic to discuss here.) I can’t comment much more on this due to the lack of tables summarising his estimates. For now I’ll proceed under the assumption that the estimates for player skills were significant and in accordance with his tests showing that the model wasn’t overspecified. In that case, the lack of significance in the racial imbalance parameters means there’s no evidence of racial imbalance while there is evidence that player skill plays a role in GSL results. Interestingly, this lack of significance for the racial imbalance parameters is despite the estimates being biased and having inflated significance. It may be possible that the estimates for the player skills were also not significant, which is quite plausible considering the high level of multicollinearity we expect from this model. Our inability to assess this goes back to my primary complaint: lack of presented results in the report. He displayed graphs showing that the estimates for player skill centred above 0 but didn’t talk about the primary cause of this, which was that he had set the parameters for players with few games to 0. If he hadn’t done that, those players would probably have had negative estimates and the estimates for player skill would have centered closer to 0. This centering near 0 would not necessarily have been the case if he’d had a properly specified model with a constant. I would advise downplaying imbalance. A lot of the tests in this analysis seemed geared to show imbalance, and he didn’t highlight the point that the results showed no evidence of imbalance. Since he should have known that most of those reading his article would not have much of an understanding of statistics and would thus jump immediately onto the numbers for imbalance that appear non-zero while ignoring standard errors, it would have been prudent if he’d downplayed his numbers and placed more emphasis on the fact that they don’t show imbalance. This is something that’s too easy to forget and I urge those who publicly release the results of statistical tests on imbalance in the future to keep in mind. The problem with hypothesis testing is that we can never prove the null hypothesis and only fail to disprove it. Considering the community’s propensity towards assuming imbalance, they will likely misinterpret any statistical conclusions by saying, “But the possibility still exists…” or, “But it almost looks significant,” or even the reverse, “The data shows no imbalance,” without realising it’s a moot point. So care must be taken when presenting the conclusions of these tests and maybe we can find a way to present any conclusions that can minimise these misunderstandings (I will be interested in hearing what methods others come up with). I was quite surprised not to see a test on whether the variables for each matchup were jointly significant. That is to say, all of the maps jointly for TvZ, then for TvP, then PvZ, or even all 3 combined. If they all come back jointly insignificant, we have more weight to declare that there is no racial imbalance. With that said, even if we were as concerned as he was about overspecification, I wouldn’t respecify the model without them even if they were jointly insignificant since that would bias our other variables. The benefit of such a test is that it also provides much better conclusive proof if imbalance does exist as well. Even if the estimates for imbalance on each map were insignificant, if they were jointly significant then we know that racial imbalances can affect the matchups, i.e. that they do exist in some form, and it’s just that it’s difficult to pinpoint where. I’m not sure why he did bootstrap tests as I couldn’t see any rationale for them. The bootstrap tests essentially found (1-p) and they were in line with what we would have found calculating the p-value using just the estimates and standard errors. Hence they also supported our conclusion that there is no evidence of imbalance. I’m not quite convinced regarding his reasoning that bootstrap tests are necessary just because the logit model has no closed-form solution. I’m not too big on the maths but I think the lack of a closed-form solution is due to transforming the observations for the dependent variable via log(pi/(1-pi)), which for a binary variable is undefined. So I think the values are just adjusted a little so they’re not precisely 0 and 1 and converted that way, and this obviously has no closed-form solution. If there’s anyone here studying statistics who is familiar with the process, I would love clarification from you. In any case, assuming I’m right, while this means we can scale the estimates, it doesn’t actually affect their significance nor introduce any relative bias, so I believe we can just use basic inference methods. So bootstrapping is probably unnecessary for our purposes. Further Design + Show Spoiler + I think it’s important to make the data available to others. This will allow others to verify the work. So I would encourage anybody publishing their analysis also to publish their data sets. The lack of tables in Stra’s article was also troublesome, so I think it is a good idea to make them available in the appendices in the future. I think he did well to identify the other drawbacks and uses of the analysis. I agree with his conclusion that this sort of analysis will be useful in identifying imbalanced maps. That sort of information would be useful to players, map makers and in balance discourse. I think imbalance on individual maps would provide an easy channel to help balance the game, and if there is also imbalance in aggregate then we could start thinking about tweaking the races themselves. As an extension, I think it’s important to consider spawn positions. For example, I believe that although Metalopolis looks balanced overall, it is heavily imbalanced based on spawn positions. Consider if overall statistics for TvZ shows that both races have a 50% win rate on Metalopolis, and we ignore close air spawns. It’s commonly accepted that close spawns favour T. If T has a 70% win rate on close spawns, then for Metalopolis to have a 50% win rate for both races, then far spawns must necessarily favour Z by 70%. This essentially introduces luck into the TvZ matchup on the map, with spawns determining which race is favoured, and the map feels horrible to play as a result. Hence it is desirable to account for spawn positions if possible. This is particularly salient considering that many tournaments now exclude close spawns on this map. This represents a significant change in the map and so statistics for Metalopolis on old policies allowing close spawns will not be applicable to Metalopolis under current policies. Stra identified a major difficulty with the data regarding the fact that many players only had 2 observed games, and almost half of the players had 5 or less games recorded. This makes it difficult to use a model that specifies player skill as a key variable, and we would expect high variances. We either need a model that doesn’t specify player skill, or we need to transform the data in some way. As discussed earlier, we cannot set any of the parameters to 0. So we may be better off just removing observations for players with few games from the data so that we can remove their variables from the model. I believe Stra considered this since he discussed the need for data reduction. That’s right – we can improve our analysis by using less data. Let’s see if that gets quoted out of context, shall we? :-) We may also consider looking for data in round robin tournaments, if any are frequently held. With regards to formulating a new model, I have a few ideas but am hesitant to post them without having fully analysed them myself since I don’t want others to do a lot of work based on something I post only for me to later say, “Oh, but I found a drawback.” However, I have nothing against suggesting a few likely directions and letting you run with the ideas and doing your own models since then it’s all on you and I have already provided this caution. :-) The challenge is that we can only use publicly available data, and I’m assuming that we only want to balance for the top level, so that means we can use tournament results. Since we are mostly interested in racial imbalance, that means we will need to retain those variables (while avoiding the dummy variable trap). Paradoxically, this means we cannot have variables for player skills. As we have seen here, that would just lead to problems with perfect collinearity. This does not necessarily pose an intractable problem since the variables for imbalance will capture the effects of imbalance so long as we are able to capture the effects of player skill in such a way that bias is not a big problem. In my opinion, the most likely avenues to explore at the moment are the use of proxy and instrumental variables. In particular, I have been looking at possible proxy variables that can stand in for player skill. I’ll leave it at that. Final Words I think Stra’s effort is definitely much better than the other statistical “analyses” we have been seeing, although that shouldn’t be discouraged altogether (from my other thread, if a current graph of win rates shows a rock-paper-scissors situation, that is strong indication that racial imbalance exists somewhere). I think it should be possible to get a meaningful model, although it is harder than it seems at first, as we’ve seen here. I think we can get some meaningful results and that it’s a matter of getting people with the right knowledge and time together to do it. There will need to be some caution when it comes to publishing our findings, though, since we will need to keep in mind how the results will appear to those without training in statistics. I say this because I believe there is a good reason Blizzard has stopped releasing much statistics from the game and the community is apt to get overexcited. There is also a question of motivation. Even if we do find imbalance, it won’t matter for the majority of players since we’re only examining the dynamics for the very top level. This could serve as an interesting exercise, may have applications in improving the game as a spectator sport, and may be of interest to players considering going pro. It could also be useful for map designers. However, all we’ll be able to find are the balances for the game in its current state. Further strategic development by the races without any balance changes by Blizzard could just as easily change the estimated imbalances in the future. In common parlance: the metagame may still evolve. So there is a risk that any results could be used to push for unwarranted balance changes. This probably isn’t a big enough concern to stop further analysis from being conducted since curiosity is a powerful force and analysis will be conducted anyway, so perhaps things will still turn out well if those conducting the analyses are moderate in their conclusions. | ||
| ||