For the time being, I am not going to pursue the Support Vector Machines approach to classifying replays, because we already have an instance that can learn a replay signature just by looking at one sample (romad). I took it as proof that it is possible to formulate a much simpler replay classifier, and that it is the feature selection that is prominently the hardest task.
Support Vector Machines might be more accurate in the long run, after I amass hundreds of samples per player, but realistically, the main problem with identifying replays is their scarcity.
I'm using the Nearest-Neighbors algorithm to classify replays, with a slight modification. I am currently designing it to be half-human half-machine.
There are currently five categories, the fifth being the combination of previous categories.
1) Building hotkeys
2) Hotkey actions
3) Hotkey spam
4) Hotkey usage
5) Total
The program produces a "closeness" rating for a given replay in each category, and looks for replays in the database that are closest to the given replay.
For the database, the program uses replays from TSL (there are a ton of replays from TSL!).
Example
I put in the recent ret vs. IdrA replay. Here are the results:
1) Building hotkeys
Top fifty consists entirely of IdrA.
2) Hotkey actions
The top 4 consists of IdrA.
Top fifty consists mostly of IdrA.
Top fifty has many ret's.
Top fifty also has BRAT_OK's and some Haypro's.
3) Hotkey spam
EDIT: Almost entirely of IdrA. I had a bug in my previous program.
4) Hotkey usage
Lots and lots of Horror. Some IdrA's. The thing is, the top fifty have NO differences from the given replay. That is, hotkey usage is a pretty poor discriminant. I'm sure many IdrA reps have the same hotkey usages, and that if I listed the top 100, IdrA would appear.
5) Total
Top fifty all consist of IdrA's.
It's clear from this example that often, the player will not win in every category. However, put together, it's likely that the player will be at the top.
I will be adding / removing / refining categories in the future.
Also, eventually, there will be an online database and an online real-time classifier, for which I need the Linux version of the repasm.dll that serves as the PHP extension in Windows (I'm using Taiche's RepASM).
Anyway, please send me replays of foreigners if you'd like to test my current program!