• Log InLog In
  • Register
Liquid`
Team Liquid Liquipedia
EDT 16:32
CEST 22:32
KST 05:32
  • Home
  • Forum
  • Calendar
  • Streams
  • Liquipedia
  • Features
  • Store
  • EPT
  • TL+
  • StarCraft 2
  • Brood War
  • Smash
  • Heroes
  • Counter-Strike
  • Overwatch
  • Liquibet
  • Fantasy StarCraft
  • TLPD
  • StarCraft 2
  • Brood War
  • Blogs
Forum Sidebar
Events/Features
News
Featured News
Maestros of the Game: Week 1/Play-in Preview9[ASL20] Ro24 Preview Pt2: Take-Off7[ASL20] Ro24 Preview Pt1: Runway132v2 & SC: Evo Complete: Weekend Double Feature4Team Liquid Map Contest #21 - Presented by Monster Energy12
Community News
LiuLi Cup - September 2025 Tournaments2Weekly Cups (August 25-31): Clem's Last Straw?39Weekly Cups (Aug 18-24): herO dethrones MaxPax6Maestros of The Game—$20k event w/ live finals in Paris47Weekly Cups (Aug 11-17): MaxPax triples again!15
StarCraft 2
General
Weekly Cups (August 25-31): Clem's Last Straw? Team Liquid Map Contest #21 - Presented by Monster Energy Heaven's Balance Suggestions (roast me) Geoff 'iNcontroL' Robinson has passed away Speculation of future Wardii series
Tourneys
Maestros of The Game—$20k event w/ live finals in Paris RSL: Revival, a new crowdfunded tournament series LiuLi Cup - September 2025 Tournaments Sea Duckling Open (Global, Bronze-Diamond) Sparkling Tuna Cup - Weekly Open Tournament
Strategy
Custom Maps
External Content
Mutation # 489 Bannable Offense Mutation # 488 What Goes Around Mutation # 487 Think Fast Mutation # 486 Watch the Skies
Brood War
General
Pros React To: herO's Baffling Game BGH Auto Balance -> http://bghmmr.eu/ BW General Discussion BSL Polish World Championship 2025 20-21 September ASL20 General Discussion
Tourneys
Is there English video for group selection for ASL Small VOD Thread 2.0 [Megathread] Daily Proleagues [ASL20] Ro24 Group F
Strategy
Simple Questions, Simple Answers Muta micro map competition Fighting Spirit mining rates [G] Mineral Boosting
Other Games
General Games
General RTS Discussion Thread Nintendo Switch Thread Path of Exile Stormgate/Frost Giant Megathread Warcraft III: The Frozen Throne
Dota 2
Official 'what is Dota anymore' discussion
League of Legends
Heroes of the Storm
Simple Questions, Simple Answers Heroes of the Storm 2.0
Hearthstone
Heroes of StarCraft mini-set
TL Mafia
TL Mafia Community Thread Vanilla Mini Mafia
Community
General
Russo-Ukrainian War Thread US Politics Mega-thread Things Aren’t Peaceful in Palestine Canadian Politics Mega-thread YouTube Thread
Fan Clubs
The Happy Fan Club!
Media & Entertainment
Anime Discussion Thread Movie Discussion! [Manga] One Piece [\m/] Heavy Metal Thread
Sports
2024 - 2026 Football Thread Formula 1 Discussion TeamLiquid Health and Fitness Initiative For 2023
World Cup 2022
Tech Support
Computer Build, Upgrade & Buying Resource Thread High temperatures on bridge(s)
TL Community
The Automated Ban List TeamLiquid Team Shirt On Sale
Blogs
A very expensive lesson on ma…
Garnet
hello world
radishsoup
Lemme tell you a thing o…
JoinTheRain
How Culture and Conflict Imp…
TrAiDoS
RTS Design in Hypercoven
a11
Evil Gacha Games and the…
ffswowsucks
INDEPENDIENTE LA CTM
XenOsky
Customize Sidebar...

Website Feedback

Closed Threads



Active: 1905 users

autofan: A tool for updating fan clubs

Blogs > Loser777
Post a Reply
Loser777
Profile Blog Joined January 2008
1931 Posts
Last Edited: 2013-09-06 01:38:33
September 06 2013 01:08 GMT
#1
The member lists of most fan clubs, are, at the moment, unmanageable. Some of the larger ones haven't had their lists updated in over two years.

I'm currently working on a web scraper which would allow these lists to be updated automatically, without any manual labor.
https://github.com/eqy/autofan

The current development environment is Arch GNU/Linux, and I've ported it over to Ubuntu 12.04 with almost no issues (after fixing some compatibility issues with Ubuntu's slightly staler libraries). If you'd like to build it, simply clone the repository and make sure you have the libraries specified in the README. If you encounter any problems trying to build, let me know.

It is trivial to build: [animated-gif]
http://i.imgur.com/I4EmOgX.gif

And trivial to use: [animated-gif]
http://i.imgur.com/H05aOgQ.gif


It's still in the very early stages of development, so there are still a number of obvious issues:
  • The rules for adding people to fan clubs are still very simple at the moment, and there are only a few of them.
  • People can be added multiple times. This most obvious cause of this is through quotes (see below). This is trivially fixable through any number of tools (e.g. sort -u, regex, etc.).
  • Quotes are not handled well--the parser can't distinguish what someone actually said and what the person they're quoting said. This is a slightly annoying issue, because despite the copious amounts of tables that TL uses, quotes are not explicitly tables--so the parser that I'm using doesn't identify quotes as separate elements. The most obvious solution I see at the moment is to do some kind of regex matching where posts that say "...wrote:" are checked against previous posts. Finding posts that likely contain quotes is trivial, but the matching step would be computationally expensive.
  • It still spits out mountains of debug information to stderr (actual output goes to stdout) because I still need the debug information to add features/fix the issues above.


The plan is to either eventually port it over to Windows after slapping on some sort of qt-based interface so it becomes usable for the 99% of TL or to launch it as some sort of web app. Both are equally foreign territory for me.
If you feel like contributing, feel free to send me a pull request! When development quiets down, I also plan on repackaging the core of this project (a library called tltopic) so that it can be used for other projects that want to parse TL threads.

History of the project:
(I don't have db access scraping is all I can do #YOLO)

I began thinking about this around two weeks ago and have been lazily hacking at it since. After my REU program ended this summer, I realized that I had pretty much been writing exclusively MATLAB for the last year or so. That year or so destroyed whatever tiny amount of coding chops in C/C++ I had (my background is not in CS), and my little weekend side projects in Python/Arduino were not helping. The implementation language of C++ reflects my need to refresh my knowledge--seriously, who writes web scrapers in C++ these days? I think this justifies the language choice as mental masturbation, not masochism.

I basically had zero knowledge of any of the non-standard libraries I used in this project (libcurl, libtidy, libxml2, boost) before I started. Everything was inspired by this single stack overflow post: http://stackoverflow.com/a/834879. Because of the odd mishmash of libraries used, the implementation is about 80% C++ with 20% of C style code thrown in to play nice with curl, tidy, and boost.

The funniest part of the code at the moment is the step I call the "post-table getting step". Basically, it traverses the syntax tree generated by libxml2 to find the "juicy table" (I was kind of out of it when coming up with names at this point) which consists of the all of the posts in a given thread. The method I used to find this table is both hilarious and ugly at the same time. Basically, as libcurl is not "logged in" to TL, the bottom of the posts contains that "Please log in or register to reply." field. I search for the smallest syntax tree that contains this line AND the header to a forum post (e.g. [Username] PM ... Profile Quote...). Considering that this table is buried between dozens of layers, this method is surprisingly robust.

Often I find that what I learn during these projects ends up being more interesting than the results--and here's a brief selection of some things so far:
  • Makefiles are amazing
  • Good initial design goes a long way
  • Valgrind is your salvation in C++
  • Make will include obvious directories for libraries by default
  • Pay attention to ugly/non-intuitive include directory layouts
  • Playing nice with C-style code may mean you need to break encapsulation
  • Give the type of an argument that will be passed but no name to let the compiler know you're intentionally not using it
  • Header-only libraries exist
  • g++ seems to dislike = {0}, but will stop complaining if you use the lower level memset() instead
  • If the API poorly documents the options in a library, just fucking read the source and figure out what's available
  • HTML has ugly characters like &nbsp that will break XML parsers
  • TL's site layout is really complicated (seriously, look at the source for this page)
  • References to single characters are not null-terminated, and will riddle your strings with garbage if you expect them to be
  • Unprintable characters may ruin your day
  • If you're getting a pointer to a data structure from a function, you've got to ask yourself one question. Do you feel lucky? Was the data allocated before you called the function? If not, you better delete that shit.
  • Malloc and delete do not mix well, as do new and free. Check your API to see how you should dispose of buffers.
  • If you use std::string, expect Valgrind to find "still-reachable" leaks





*****
6581
GHOSTCLAW
Profile Blog Joined February 2008
United States17042 Posts
September 06 2013 01:57 GMT
#2
good luck ^^

On September 06 2013 10:08 Loser777 wrote:
(I don't have db access scraping is all I can do #YOLO)


5/5
PhotographerLiquipedia. Drop me a pm if you've got questions/need help.
tarpman
Profile Joined February 2009
Canada719 Posts
Last Edited: 2013-09-06 02:02:50
September 06 2013 02:01 GMT
#3
Nice work, and neat program! I've recently used libcurl and libxml2 at work, so I was interested in reading your code.

Is there a reason why you're searching the DOM manually instead of running an XPath query? I only played with xmllint for a few seconds, but I feel like something like
//td[@class="forumPost"]/text()
ought to match what you're looking for.

Concerning the struct initialization: as I understand it the usual way to do it in C++ is
TidyBuffer output = {};
which will initialize all members to their default values, rather than 0. When you just have data members it doesn't matter, but if your struct contains objects they might need to be initialized. See this Stack Overflow question for more.
Saving the world, one kilobyte at a time.
Loser777
Profile Blog Joined January 2008
1931 Posts
September 06 2013 02:21 GMT
#4
On September 06 2013 11:01 tarpman wrote:
Nice work, and neat program! I've recently used libcurl and libxml2 at work, so I was interested in reading your code.

Is there a reason why you're searching the DOM manually instead of running an XPath query? I only played with xmllint for a few seconds, but I feel like something like
//td[@class="forumPost"]/text()
ought to match what you're looking for.

Concerning the struct initialization: as I understand it the usual way to do it in C++ is
TidyBuffer output = {};
which will initialize all members to their default values, rather than 0. When you just have data members it doesn't matter, but if your struct contains objects they might need to be initialized. See this Stack Overflow question for more.

Hmm, I didn't know about that part of API for libxml2 yet--I'll have to check that out! It could potentially fix the quote problem I was having!

As for the variable initialization, it probably doesn't matter as it's just a buffer that's going to get overwritten anyway. As it's not an object, I prefer using memset. Given that the tidy implementation appears to be strictly C, I don't think the struct is going to contain any objects .
6581
Please log in or register to reply.
Live Events Refresh
Maestros of the Game
17:00
Group Stage - Group A
Classic vs TriGGeR
Reynor vs SHIN
ComeBackTV 1322
SteadfastSC526
IndyStarCraft 184
CranKy Ducklings178
EnkiAlexander 63
Rex49
LiquipediaDiscussion
[ Submit Event ]
Live Streams
Refresh
StarCraft 2
SteadfastSC 526
IndyStarCraft 184
BRAT_OK 90
ProTech87
JuggernautJason81
Rex 49
StarCraft: Brood War
Britney 22140
EffOrt 1027
Larva 332
hero 231
Dewaltoss 124
TY 124
firebathero 122
Bonyth 93
sSak 64
Aegong 30
[ Show more ]
NaDa 18
Dota 2
The International24585
420jenkins80
monkeys_forever52
Counter-Strike
Stewie2K435
Heroes of the Storm
Liquid`Hasu347
Other Games
summit1g3621
FrodaN1216
fl0m524
mouzStarbuck300
ToD196
C9.Mang0155
SortOf139
Livibee119
Mew2King45
PPMD26
Organizations
StarCraft 2
Blizzard YouTube
StarCraft: Brood War
BSLTrovo
sctven
[ Show 19 non-featured ]
StarCraft 2
• musti20045 35
• davetesta15
• StrangeGG 9
• Reevou 6
• IndyKCrew
• AfreecaTV YouTube
• sooper7s
• intothetv
• Kozan
• LaughNgamezSOOP
• Migwel
StarCraft: Brood War
• Michael_bg 5
• STPLYoutube
• ZZZeroYoutube
• BSLYoutube
Dota 2
• masondota21452
League of Legends
• TFBlade923
Counter-Strike
• imaqtpie1034
• Shiphtur175
Upcoming Events
OSC
6h 28m
MaNa vs SHIN
SKillous vs ShoWTimE
Bunny vs TBD
Cham vs TBD
RSL Revival
13h 28m
Reynor vs Astrea
Classic vs sOs
Maestros of the Game
20h 28m
Serral vs Ryung
ByuN vs Zoun
BSL Team Wars
22h 28m
Team Bonyth vs Team Dewalt
CranKy Ducklings
1d 13h
RSL Revival
1d 13h
GuMiho vs Cham
ByuN vs TriGGeR
Cosmonarchy
1d 17h
TriGGeR vs YoungYakov
YoungYakov vs HonMonO
HonMonO vs TriGGeR
Maestros of the Game
1d 20h
Solar vs Bunny
Clem vs Rogue
[BSL 2025] Weekly
1d 21h
RSL Revival
2 days
Cure vs Bunny
Creator vs Zoun
[ Show More ]
Maestros of the Game
2 days
Maru vs Lambo
herO vs ShoWTimE
BSL Team Wars
2 days
Team Hawk vs Team Sziky
Sparkling Tuna Cup
3 days
Monday Night Weeklies
3 days
The PondCast
6 days
Liquipedia Results

Completed

CSL Season 18: Qualifier 2
SEL Season 2 Championship
HCC Europe

Ongoing

Copa Latinoamericana 4
BSL 20 Team Wars
KCM Race Survival 2025 Season 3
BSL 21 Qualifiers
ASL Season 20
CSL 2025 AUTUMN (S18)
RSL Revival: Season 2
Maestros of the Game
BLAST Open Fall Qual
Esports World Cup 2025
BLAST Bounty Fall 2025
BLAST Bounty Fall Qual
IEM Cologne 2025
FISSURE Playground #1
BLAST.tv Austin Major 2025

Upcoming

LASL Season 20
2025 Chongqing Offline CUP
BSL Polish World Championship 2025 – Warsaw LAN
BSL Season 21
BSL 21 Team A
Chzzk MurlocKing SC1 vs SC2 Cup #2
EC S1
BLAST Rivals Fall 2025
IEM Chengdu 2025
PGL Masters Bucharest 2025
Thunderpick World Champ.
MESA Nomadic Masters Fall
CS Asia Championships 2025
ESL Pro League S22
StarSeries Fall 2025
FISSURE Playground #2
BLAST Open Fall 2025
TLPD

1. ByuN
2. TY
3. Dark
4. Solar
5. Stats
6. Nerchio
7. sOs
8. soO
9. INnoVation
10. Elazer
1. Rain
2. Flash
3. EffOrt
4. Last
5. Bisu
6. Soulkey
7. Mini
8. Sharp
Sidebar Settings...

Advertising | Privacy Policy | Terms Of Use | Contact Us

Original banner artwork: Jim Warren
The contents of this webpage are copyright © 2025 TLnet. All Rights Reserved.