http://www.teamliquid.net/forum/viewmessage.php?topic_id=429661
Please check that out before reading this!
That control group configurations allow skilled individuals to identify smurfs
has been known since the days of Brood War and notably demonstrated by roMAD. I
present a small collection of heuristics for systematically evaluating the
similarity between different hotkey setups and subsequently a method for
replay-based player identification. Some of these techniques are currently
implemented in vroMAD. These techniques will hopefully allow for the automatic
identification of players from large repositories of replay data.
Similarity Measures
Similarity measures are used in a wide variety of disciplines ranging from
applied mathematics to bioinformatics. They are commonly used in similarity
matrices, which can be thought of as a graph describing how close a given data
point is to another. We can adopt the idea of a similarity measure to hotkey
setups used by various players. The relevant question posed is: given two
players, how can we quantify how similar their hotkey setups are?
If we have a quantification of similarity between hotkey setups, we can do
several things:
- Generate a ranking of similarity between players from an anonymous/"unknown"
replay and a database of replays with confirmed player identifications - Cluster players from replays into groups with similar setups
- From the clusters, classify players
At the moment, vroMAD only does the first of those things.
Now, we move into the juicy details and introduce three similarity models:
Frequency-Distribution Based Similarity
This "low-hanging fruit" (I know people hate this phrase) is the similarity
measure is currently implemented in vroMAD because of its extreme simplicity.
Frequency-Distributions of hotkey selection are generated from replay data. That
is, given the player data from a replay, we proceed as follows:
1. Extract all hotkey selections
2. Bin each of these selections according to their number {0,1,2,3,4,5,6,7,8,9}
3. Generate a frequency distribution vector in R^10 space where each element
corresponds to the frequency of selection of a specific hotkey.
Example vector: [0 0.5 0.2 0.2 0.3 0.1 0 0 0 0] (selections/second)
4. Calculate a similarity using a Gaussian function: given player 1 with
frequency distribution x1 and player 2 with frequency distribution x2, we
compute exp(-((x1-x2)^2)/2(sigma)^2). In this case, we take sigma as the
standard deviation of data with respect to each of the dimensions.
Note that this measure is theoretically race-agnostic. That is, it is not
directly influenced by a player's race, as it is not mapped to any race-specific
unit or buildings. This is what I refer to as a "roMAD-complete" similarity
measure, as it can be used to inform on players suspected of offracing. (roMAD
was famously able to identify off-racing progamers just from their hotkey
setups)
Fixed Unit Mapping Based Similarity
This is the "most-obvious" model for identification, and works as follows: given
a player and a race, we generate a vector in R^10 where the value of each
dimension corresponds to a race-specific unit e.g. Drone/Hatchery/Queen/Roach
for Zerg. For hotkeys with multiple types of units bound, we simply choose the
most frequent unit or adopt a similar technique. This technique is not
"roMAD-complete" unless we choose a very general mapping of unit types to
numbers. With a hotkey vector for each player, we apply the Gaussian function as
described previously.
For the sake of example, say we have a Zerg player and 1 maps to Roach, 2 to
Hydra, 4 to Hatchery, 5 to Queen, and 7 to Infestor. -1 Maps to no-selection.
Example vector: [-1 1 2 7 4 5 -1 -1 -1 -1] (unitless)
Floating Unit Mapping Based Similarity
We can improve formulation of "Fixed Unit Mapping Based Similarity". This is
because of each of these techniques attempt to map a hotkey setup into some
vector space and compute a similarity based on distance. However, it can be
seen that "Fixed Unit Mapping Based Similarity" doesn't generalize well to the
concept of distance. That is, (given two Zerg players) if one binds
control-group 1 to Roaches only, and other to Zerglings only, what is the
distance between their setups? Even if we say ground units are closer to other
ground units and further from air units and even further from buildings, "Fixed
Unit Mapping Based Similarity" remains an awkward model. To address this
problem, I introduce the "floating" version of this model. This model switches
the organization of the vector: that is, we instead define classes or types of
units a priori as the dimensions of our vector and assign values based on the
control-group number. Here, "floating" refers to the dimension of the vector.
This model generalizes better to the idea of a distance: we can say hotkey
setups where a given type of unit is mapped 1 key apart are closer than hotkey
setups where the same type of unit is mapped 4 keys apart. To compute a
similarity from this model, we again apply the Gaussian function described
previously. Note that the "roMAD-completeness" of this model depends on whether
we choose classes to be abstract such as "air units/ground units/buildings" or
race-specifc units.
For the sake of example, we define the first dimension as
Marine/Marauder/Medivac, the second as Viking/Banshee/Raven, the third as
Spellcasters, the fourth as Command Centers, the Fifth as ground production, the
Sixth as air production, and the Seventh as upgrades.
Example vector: [1 3 2 6 4 5 6] (unitless coordinates, but generalizes to
distance)
Note on the Gaussian function used:
The Gaussian function used has a range of (0, 1], and essentially operates
on the raw Euclidean distance of the vectors. Identical vectors have similarity
1, whereas very dissimilar vectors will have a similarity close to 0. For
experimental purposes, vroMAD also includes the ability to rank based on the raw
Euclidean distance in vroMAD. A high similarity corresponds to low Euclidean
distance, and vice versa.