At any rate, I have, on and off for the past couple years, tracked results in some amount of detail; after my first two flailing efforts I was immensely surprised and gratified when Kiante contacted me and asked if I would be willing to write something after Round Three had finished. (Much to my chagrin, I did not manage to get the write-up complete before playoffs started, but it is in this week's news post - by far the least professional piece there, alas.)
However, I imagine the number one question my numbers raise is, "Where did the numbers come from?" That is what I want to attempt to answer here. First of all - which I forgot to mention - an unexpected result "swings" the number at least 4 - from +2 to -2, or maybe more. So what the number I gave represents could be broken down as a chance that the underdog (whichever race it is) will win a game: many were around 1 (positive or negative). A PvT score of -1, for example, would mean that on that map 1 out of 4 Terrans (or a little less: maybe 1 out of 5 or 6) would win as a underdog. A ZvP balance of +0.60 would mean a Zerg underdog would win about 1 out of 6+ times.
As for why those numbers? Trying to judge the actual balance of a map is difficult in all but extreme cases (and even in this case, there were fascinating ZvP results in the minor league which the necessities of making playoffs forbade the A-teams from attempting to duplicate). Another well-known example of the difficulty is Nostalgia: by pure numerical results, one of the most balance maps ever made, but the complaints extended for years and resurface with any mention of the map.
But interpreting how much of any existent balance issues is the problem with the map is difficult. In the first case, any map is used in the context of the existing game (which for Brood War is at this point fixed) and metagame (which as in any competitive endeavor is constantly shifting a little bit at a time). Furthermore, the habits of the Brood War leagues have been to cycle maps fairly quickly - probably too quickly - so that for all but those few maps which receive instant acclaim the sample size to judge their actual quality on is incredibly small - and the results, in large part, depend on the players involved. If we take a map like Battle Royal, which I mentioned above, and by some fluke Flash vs Hyuk is the only matchup played for seven games, Flash will probably win at least four, and leave us with a false impression of the map's balance, at least until we remember how godawful Hyuk is at ZvT when he is not in the OSL Round of 16 (and how that ever happened once, let alone multiple times, is one of the great mysteries of the universe - but I digress).
I want to illustrate the difficulties further by considering the map Electric Circuit, which I nominated as the best map so far of the 2011-2012 SK Planet Proleague (Season 1), in its PvT matchup.
The matchups in the regular season were as follows:
Jaehoon vs Flash
BeSt vs Light
Movie vs fantasy
Dear vs Sea
Dear vs Light
Tyson vs Leta
Bisu vs Canata
M18M vs Flash
With no research and no "context", I would favor (in order) Flash, BeSt, fantasy, Sea, Dear, Tyson, Bisu, Flash and expect a final record of 4-4. Taking Sea's slump into account, I would favor Dear in both games: 5-3. As it happened, M18M was able to pull off an upset (by any account), while Flash and fantasy won their "obviously favored" games, for a final record of 6-2.
And yet, my (current) attempt at statistical analysis suggests this map is Terran-favored. The only clear evidence I have for this is that Movie "should have" won his game (in my opinion): did he "throw it away" or did the map prevent him finishing his advantage? Add in the fact that Light, of all Terrans, with remarkably solid if uninspired play, almost beat BeSt. What my current evaluation turns on is an (over-?)reliance on individual game probabilities - and a few quirks of results along the way.
I guessed going in that the method which would most simply yield accurate results would be to base predictions on record over a previous year. The result - for this map - was that to my surprise, the three "obvious favorites" (Flash, fantasy, Flash) were in fact predicted to lose their individual games (to Jaehoon, Movie, and M18M respectively). Jaehoon going in was 12-5 over the past year against Terran; compared to Flash's 19-11, this made him numerically a favorite. There is no way to account for the fact that Jaehoon beats bad and average Terrans, while losses are mainly to the best players (say, Flash), while Flash's losses are due largely to inherent game imbalance, and are scattered sort randomly - or that half of them were from BeSt and Jangbi. A potential solution would be to use ELO in some form instead, but I do not have access to what me might call "game-time ELO" - and ELO would present the problem of underrating newer players (like Dear). For the other two: Movie is simply much better at PvT than we usually remember; M18M suffered from over-prediction due to a small sample size of games in the last year. The result is that going game-by-game, the predicted result of the 8 games was 8-0 for Protoss. When Terran won two - never mind that it was Flash and fantasy winning the games, or that Flash lost to M18M - this generated a "Terran favored" review for the map.
Which brings us back to my example above: a score of -1.00 came from "underdog" Terrans winning 2 out of 8, or 1 out of 4, games on the map. What to do about it? I am not entirely sure: it is partly a problem with my current modeling which I am going to make an effort to correct: individual predictions are clearly carrying too much weight.
However, there was an additional problem which would be corrected over a larger sample size: the fact is that Electric Circuit has so far seen either excellent TvP players (Flash and fantasy) or terrible TvP players (Light, Sea-in-slump, Leta, Canata). There have not been any "average" or "merely good" TvP Terrans on the map - not that there are many left in the current Proleague. Sea if he un-slumps; firebathero and his new-found skill; maybe TurN or Mind?
At any rate, I hope this gives the curious some further idea of the things I am looking at. One thing I want to look at incorporating is average win probability: on Electric Circuit it came out at 59% for Protoss, which would lead us to expect a 5-3 result over 8 games, and 1 off of that is clearly acceptable.
* I am not ignoring the fact that trying to predict team league results has turned out to be largely a fool's errand so far. It is probably possible, but the amount of work it would take even to narrow down which are the critical areas where we need data to evaluate is immense.