• Log InLog In
  • Register
Liquid`
Team Liquid Liquipedia
EDT 14:24
CET 19:24
KST 03:24
  • Home
  • Forum
  • Calendar
  • Streams
  • Liquipedia
  • Features
  • Store
  • EPT
  • TL+
  • StarCraft 2
  • Brood War
  • Smash
  • Heroes
  • Counter-Strike
  • Overwatch
  • Liquibet
  • Fantasy StarCraft
  • TLPD
  • StarCraft 2
  • Brood War
  • Blogs
Forum Sidebar
Events/Features
News
Featured News
Team Liquid Map Contest #22 - Presented by Monster Energy5ByuL: The Forgotten Master of ZvT30Behind the Blue - Team Liquid History Book19Clem wins HomeStory Cup 289HomeStory Cup 28 - Info & Preview13
Community News
Blizzard Classic Cup @ BlizzCon 2026 - $100k prize pool30Weekly Cups (March 9-15): herO, Clem, ByuN win32026 KungFu Cup Announcement6BGE Stara Zagora 2026 cancelled12Blizzard Classic Cup - Tastosis announced as captains18
StarCraft 2
General
Blizzard Classic Cup @ BlizzCon 2026 - $100k prize pool Team Liquid Map Contest #22 - Presented by Monster Energy Serral: 24’ EWC form was hurt by military service Weekly Cups (March 9-15): herO, Clem, ByuN win Weekly Cups (August 25-31): Clem's Last Straw?
Tourneys
RSL Season 4 announced for March-April Sparkling Tuna Cup - Weekly Open Tournament WardiTV Team League Season 10 KSL Week 87 [GSL CK] #2: Team Classic vs. Team Solar
Strategy
Custom Maps
Publishing has been re-enabled! [Feb 24th 2026] Map Editor closed ?
External Content
The PondCast: SC2 News & Results Mutation # 517 Distant Threat Mutation # 516 Specter of Death Mutation # 515 Together Forever
Brood War
General
ASL21 General Discussion Gypsy to Korea JaeDong's form before ASL BGH Auto Balance -> http://bghmmr.eu/ BSL Season 22
Tourneys
[BSL22] Open Qualifiers & Ladder Tours [Megathread] Daily Proleagues Small VOD Thread 2.0 IPSL Spring 2026 is here!
Strategy
Simple Questions, Simple Answers Soma's 9 hatch build from ASL Game 2 Fighting Spirit mining rates
Other Games
General Games
Nintendo Switch Thread General RTS Discussion Thread Path of Exile Stormgate/Frost Giant Megathread Dawn of War IV
Dota 2
Official 'what is Dota anymore' discussion The Story of Wings Gaming
League of Legends
G2 just beat GenG in First stand
Heroes of the Storm
Simple Questions, Simple Answers Heroes of the Storm 2.0
Hearthstone
Deck construction bug Heroes of StarCraft mini-set
TL Mafia
Five o'clock TL Mafia Mafia Game Mode Feedback/Ideas Vanilla Mini Mafia TL Mafia Community Thread
Community
General
US Politics Mega-thread Canadian Politics Mega-thread Things Aren’t Peaceful in Palestine Russo-Ukrainian War Thread Mexico's Drug War
Fan Clubs
The IdrA Fan Club
Media & Entertainment
[Req][Books] Good Fantasy/SciFi books [Manga] One Piece Movie Discussion!
Sports
2024 - 2026 Football Thread Formula 1 Discussion Tokyo Olympics 2021 Thread General nutrition recommendations Cricket [SPORT]
World Cup 2022
Tech Support
Laptop capable of using Photoshop Lightroom?
TL Community
The Automated Ban List
Blogs
Funny Nicknames
LUCKY_NOOB
Money Laundering In Video Ga…
TrAiDoS
Iranian anarchists: organize…
XenOsky
FS++
Kraekkling
Shocked by a laser…
Spydermine0240
Unintentional protectionism…
Uldridge
ASL S21 English Commentary…
namkraft
Customize Sidebar...

Website Feedback

Closed Threads



Active: 2139 users

autofan: A tool for updating fan clubs

Blogs > Loser777
Post a Reply
Loser777
Profile Blog Joined January 2008
1931 Posts
Last Edited: 2013-09-06 01:38:33
September 06 2013 01:08 GMT
#1
The member lists of most fan clubs, are, at the moment, unmanageable. Some of the larger ones haven't had their lists updated in over two years.

I'm currently working on a web scraper which would allow these lists to be updated automatically, without any manual labor.
https://github.com/eqy/autofan

The current development environment is Arch GNU/Linux, and I've ported it over to Ubuntu 12.04 with almost no issues (after fixing some compatibility issues with Ubuntu's slightly staler libraries). If you'd like to build it, simply clone the repository and make sure you have the libraries specified in the README. If you encounter any problems trying to build, let me know.

It is trivial to build: [animated-gif]
http://i.imgur.com/I4EmOgX.gif

And trivial to use: [animated-gif]
http://i.imgur.com/H05aOgQ.gif


It's still in the very early stages of development, so there are still a number of obvious issues:
  • The rules for adding people to fan clubs are still very simple at the moment, and there are only a few of them.
  • People can be added multiple times. This most obvious cause of this is through quotes (see below). This is trivially fixable through any number of tools (e.g. sort -u, regex, etc.).
  • Quotes are not handled well--the parser can't distinguish what someone actually said and what the person they're quoting said. This is a slightly annoying issue, because despite the copious amounts of tables that TL uses, quotes are not explicitly tables--so the parser that I'm using doesn't identify quotes as separate elements. The most obvious solution I see at the moment is to do some kind of regex matching where posts that say "...wrote:" are checked against previous posts. Finding posts that likely contain quotes is trivial, but the matching step would be computationally expensive.
  • It still spits out mountains of debug information to stderr (actual output goes to stdout) because I still need the debug information to add features/fix the issues above.


The plan is to either eventually port it over to Windows after slapping on some sort of qt-based interface so it becomes usable for the 99% of TL or to launch it as some sort of web app. Both are equally foreign territory for me.
If you feel like contributing, feel free to send me a pull request! When development quiets down, I also plan on repackaging the core of this project (a library called tltopic) so that it can be used for other projects that want to parse TL threads.

History of the project:
(I don't have db access scraping is all I can do #YOLO)

I began thinking about this around two weeks ago and have been lazily hacking at it since. After my REU program ended this summer, I realized that I had pretty much been writing exclusively MATLAB for the last year or so. That year or so destroyed whatever tiny amount of coding chops in C/C++ I had (my background is not in CS), and my little weekend side projects in Python/Arduino were not helping. The implementation language of C++ reflects my need to refresh my knowledge--seriously, who writes web scrapers in C++ these days? I think this justifies the language choice as mental masturbation, not masochism.

I basically had zero knowledge of any of the non-standard libraries I used in this project (libcurl, libtidy, libxml2, boost) before I started. Everything was inspired by this single stack overflow post: http://stackoverflow.com/a/834879. Because of the odd mishmash of libraries used, the implementation is about 80% C++ with 20% of C style code thrown in to play nice with curl, tidy, and boost.

The funniest part of the code at the moment is the step I call the "post-table getting step". Basically, it traverses the syntax tree generated by libxml2 to find the "juicy table" (I was kind of out of it when coming up with names at this point) which consists of the all of the posts in a given thread. The method I used to find this table is both hilarious and ugly at the same time. Basically, as libcurl is not "logged in" to TL, the bottom of the posts contains that "Please log in or register to reply." field. I search for the smallest syntax tree that contains this line AND the header to a forum post (e.g. [Username] PM ... Profile Quote...). Considering that this table is buried between dozens of layers, this method is surprisingly robust.

Often I find that what I learn during these projects ends up being more interesting than the results--and here's a brief selection of some things so far:
  • Makefiles are amazing
  • Good initial design goes a long way
  • Valgrind is your salvation in C++
  • Make will include obvious directories for libraries by default
  • Pay attention to ugly/non-intuitive include directory layouts
  • Playing nice with C-style code may mean you need to break encapsulation
  • Give the type of an argument that will be passed but no name to let the compiler know you're intentionally not using it
  • Header-only libraries exist
  • g++ seems to dislike = {0}, but will stop complaining if you use the lower level memset() instead
  • If the API poorly documents the options in a library, just fucking read the source and figure out what's available
  • HTML has ugly characters like &nbsp that will break XML parsers
  • TL's site layout is really complicated (seriously, look at the source for this page)
  • References to single characters are not null-terminated, and will riddle your strings with garbage if you expect them to be
  • Unprintable characters may ruin your day
  • If you're getting a pointer to a data structure from a function, you've got to ask yourself one question. Do you feel lucky? Was the data allocated before you called the function? If not, you better delete that shit.
  • Malloc and delete do not mix well, as do new and free. Check your API to see how you should dispose of buffers.
  • If you use std::string, expect Valgrind to find "still-reachable" leaks





*****
6581
GHOSTCLAW
Profile Blog Joined February 2008
United States17042 Posts
September 06 2013 01:57 GMT
#2
good luck ^^

On September 06 2013 10:08 Loser777 wrote:
(I don't have db access scraping is all I can do #YOLO)


5/5
PhotographerLiquipedia. Drop me a pm if you've got questions/need help.
tarpman
Profile Joined February 2009
Canada720 Posts
Last Edited: 2013-09-06 02:02:50
September 06 2013 02:01 GMT
#3
Nice work, and neat program! I've recently used libcurl and libxml2 at work, so I was interested in reading your code.

Is there a reason why you're searching the DOM manually instead of running an XPath query? I only played with xmllint for a few seconds, but I feel like something like
//td[@class="forumPost"]/text()
ought to match what you're looking for.

Concerning the struct initialization: as I understand it the usual way to do it in C++ is
TidyBuffer output = {};
which will initialize all members to their default values, rather than 0. When you just have data members it doesn't matter, but if your struct contains objects they might need to be initialized. See this Stack Overflow question for more.
Saving the world, one kilobyte at a time.
Loser777
Profile Blog Joined January 2008
1931 Posts
September 06 2013 02:21 GMT
#4
On September 06 2013 11:01 tarpman wrote:
Nice work, and neat program! I've recently used libcurl and libxml2 at work, so I was interested in reading your code.

Is there a reason why you're searching the DOM manually instead of running an XPath query? I only played with xmllint for a few seconds, but I feel like something like
//td[@class="forumPost"]/text()
ought to match what you're looking for.

Concerning the struct initialization: as I understand it the usual way to do it in C++ is
TidyBuffer output = {};
which will initialize all members to their default values, rather than 0. When you just have data members it doesn't matter, but if your struct contains objects they might need to be initialized. See this Stack Overflow question for more.

Hmm, I didn't know about that part of API for libxml2 yet--I'll have to check that out! It could potentially fix the quote problem I was having!

As for the variable initialization, it probably doesn't matter as it's just a buffer that's going to get overwritten anyway. As it's not an object, I prefer using memset. Given that the tidy implementation appears to be strictly C, I don't think the struct is going to contain any objects .
6581
Please log in or register to reply.
Live Events Refresh
LAN Event
16:30
StarCraft Madness
Airneanach73
Liquipedia
PSISTORM Gaming Misc
15:55
FSL semifinals: PTB vs ASH
Freeedom27
Liquipedia
uThermal 2v2 Circuit
15:00
Bonus Cup #6
uThermal507
SteadfastSC264
IndyStarCraft 189
Liquipedia
[ Submit Event ]
Live Streams
Refresh
StarCraft 2
uThermal 507
SteadfastSC 264
Liquid`TLO 260
IndyStarCraft 189
JuggernautJason72
StarCraft: Brood War
Calm 4872
EffOrt 813
Horang2 556
ggaemo 224
Free 162
Shuttle 152
hero 151
Pusan 90
Dewaltoss 23
Hm[arnc] 18
[ Show more ]
IntoTheRainbow 15
SilentControl 10
ivOry 9
Dota 2
monkeys_forever316
ROOTCatZ15
LuMiX1
League of Legends
JimRising 457
Counter-Strike
fl0m4473
Heroes of the Storm
Khaldor640
Liquid`Hasu431
Lowko218
Trikslyr80
MindelVK14
Other Games
Grubby2386
FrodaN1804
B2W.Neo839
byalli290
Fuzer 128
Hui .88
Organizations
Other Games
gamesdonequick584
Dota 2
PGL Dota 2 - Main Stream176
StarCraft 2
Blizzard YouTube
StarCraft: Brood War
BSLTrovo
sctven
[ Show 20 non-featured ]
StarCraft 2
• Adnapsc2 17
• printf 16
• Reevou 6
• Kozan
• sooper7s
• AfreecaTV YouTube
• intothetv
• IndyKCrew
• LaughNgamezSOOP
• Migwel
StarCraft: Brood War
• blackmanpl 19
• Michael_bg 8
• Pr0nogo 3
• STPLYoutube
• ZZZeroYoutube
• BSLYoutube
Dota 2
• WagamamaTV1059
League of Legends
• Jankos1880
• Shiphtur283
Other Games
• imaqtpie864
Upcoming Events
BSL
1h 36m
RSL Revival
15h 36m
herO vs MaxPax
Rogue vs TriGGeR
BSL
1d 1h
Replay Cast
1d 5h
Replay Cast
1d 14h
Afreeca Starleague
1d 15h
Sharp vs Scan
Rain vs Mong
Wardi Open
1d 17h
Monday Night Weeklies
1d 22h
Sparkling Tuna Cup
2 days
Afreeca Starleague
2 days
Soulkey vs Ample
JyJ vs sSak
[ Show More ]
Replay Cast
3 days
Afreeca Starleague
3 days
hero vs YSC
Larva vs Shine
Kung Fu Cup
3 days
Replay Cast
4 days
KCM Race Survival
4 days
The PondCast
4 days
WardiTV Team League
4 days
Replay Cast
5 days
WardiTV Team League
5 days
RSL Revival
6 days
Cure vs Zoun
WardiTV Team League
6 days
Liquipedia Results

Completed

Proleague 2026-03-20
WardiTV Winter 2026
Underdog Cup #3

Ongoing

KCM Race Survival 2026 Season 1
Jeongseon Sooper Cup
BSL Season 22
CSL Elite League 2026
RSL Revival: Season 4
Nations Cup 2026
NationLESS Cup
BLAST Open Spring 2026
ESL Pro League S23 Finals
ESL Pro League S23 Stage 1&2
PGL Cluj-Napoca 2026
IEM Kraków 2026
BLAST Bounty Winter 2026
BLAST Bounty Winter Qual

Upcoming

ASL Season 21
Acropolis #4 - TS6
2026 Changsha Offline CUP
CSL 2026 SPRING (S20)
CSL Season 20: Qualifier 1
Acropolis #4
IPSL Spring 2026
Kung Fu Cup 2026 Grand Finals
HSC XXIX
uThermal 2v2 2026 Main Event
IEM Cologne Major 2026
Stake Ranked Episode 2
CS Asia Championships 2026
Asian Champions League 2026
IEM Atlanta 2026
PGL Astana 2026
BLAST Rivals Spring 2026
CCT Season 3 Global Finals
IEM Rio 2026
PGL Bucharest 2026
Stake Ranked Episode 1
TLPD

1. ByuN
2. TY
3. Dark
4. Solar
5. Stats
6. Nerchio
7. sOs
8. soO
9. INnoVation
10. Elazer
1. Rain
2. Flash
3. EffOrt
4. Last
5. Bisu
6. Soulkey
7. Mini
8. Sharp
Sidebar Settings...

Advertising | Privacy Policy | Terms Of Use | Contact Us

Original banner artwork: Jim Warren
The contents of this webpage are copyright © 2026 TLnet. All Rights Reserved.