I'm currently working on a web scraper which would allow these lists to be updated automatically, without any manual labor.
https://github.com/eqy/autofan
The current development environment is Arch GNU/Linux, and I've ported it over to Ubuntu 12.04 with almost no issues (after fixing some compatibility issues with Ubuntu's slightly staler libraries). If you'd like to build it, simply clone the repository and make sure you have the libraries specified in the README. If you encounter any problems trying to build, let me know.
It is trivial to build: [animated-gif]
http://i.imgur.com/I4EmOgX.gif
And trivial to use: [animated-gif]
http://i.imgur.com/H05aOgQ.gif
It's still in the very early stages of development, so there are still a number of obvious issues:
- The rules for adding people to fan clubs are still very simple at the moment, and there are only a few of them.
- People can be added multiple times. This most obvious cause of this is through quotes (see below). This is trivially fixable through any number of tools (e.g. sort -u, regex, etc.).
- Quotes are not handled well--the parser can't distinguish what someone actually said and what the person they're quoting said. This is a slightly annoying issue, because despite the copious amounts of tables that TL uses, quotes are not explicitly tables--so the parser that I'm using doesn't identify quotes as separate elements. The most obvious solution I see at the moment is to do some kind of regex matching where posts that say "...wrote:" are checked against previous posts. Finding posts that likely contain quotes is trivial, but the matching step would be computationally expensive.
- It still spits out mountains of debug information to stderr (actual output goes to stdout) because I still need the debug information to add features/fix the issues above.
The plan is to either eventually port it over to Windows after slapping on some sort of qt-based interface so it becomes usable for the 99% of TL or to launch it as some sort of web app. Both are equally foreign territory for me.
If you feel like contributing, feel free to send me a pull request! When development quiets down, I also plan on repackaging the core of this project (a library called tltopic) so that it can be used for other projects that want to parse TL threads.
History of the project:
(I don't have db access scraping is all I can do #YOLO)
I began thinking about this around two weeks ago and have been lazily hacking at it since. After my REU program ended this summer, I realized that I had pretty much been writing exclusively MATLAB for the last year or so. That year or so destroyed whatever tiny amount of coding chops in C/C++ I had (my background is not in CS), and my little weekend side projects in Python/Arduino were not helping. The implementation language of C++ reflects my need to refresh my knowledge--seriously, who writes web scrapers in C++ these days? I think this justifies the language choice as mental masturbation, not masochism.
I basically had zero knowledge of any of the non-standard libraries I used in this project (libcurl, libtidy, libxml2, boost) before I started. Everything was inspired by this single stack overflow post: http://stackoverflow.com/a/834879. Because of the odd mishmash of libraries used, the implementation is about 80% C++ with 20% of C style code thrown in to play nice with curl, tidy, and boost.
The funniest part of the code at the moment is the step I call the "post-table getting step". Basically, it traverses the syntax tree generated by libxml2 to find the "juicy table" (I was kind of out of it when coming up with names at this point) which consists of the all of the posts in a given thread. The method I used to find this table is both hilarious and ugly at the same time. Basically, as libcurl is not "logged in" to TL, the bottom of the posts contains that "Please log in or register to reply." field. I search for the smallest syntax tree that contains this line AND the header to a forum post (e.g. [Username] PM ... Profile Quote...). Considering that this table is buried between dozens of layers, this method is surprisingly robust.
Often I find that what I learn during these projects ends up being more interesting than the results--and here's a brief selection of some things so far:
- Makefiles are amazing
- Good initial design goes a long way
- Valgrind is your salvation in C++
- Make will include obvious directories for libraries by default
- Pay attention to ugly/non-intuitive include directory layouts
- Playing nice with C-style code may mean you need to break encapsulation
- Give the type of an argument that will be passed but no name to let the compiler know you're intentionally not using it
- Header-only libraries exist
- g++ seems to dislike = {0}, but will stop complaining if you use the lower level memset() instead
- If the API poorly documents the options in a library, just fucking read the source and figure out what's available
- HTML has ugly characters like   that will break XML parsers
- TL's site layout is really complicated (seriously, look at the source for this page)
- References to single characters are not null-terminated, and will riddle your strings with garbage if you expect them to be
- Unprintable characters may ruin your day
- If you're getting a pointer to a data structure from a function, you've got to ask yourself one question.
Do you feel lucky?Was the data allocated before you called the function? If not, you better delete that shit. - Malloc and delete do not mix well, as do new and free. Check your API to see how you should dispose of buffers.
- If you use std::string, expect Valgrind to find "still-reachable" leaks