(works, but code is somewhat messy, needs tons of whitespace changes)
Tested build on Arch 3.10.10-1 and Ubuntu 12.04.2 (3.2.0-51).
Changes
- Post parsing is now done with Xpath
- Quotes do not interfere with processing
- Redundant usernames do not appear in output
Following tarpman's suggestion on using XPath to navigate the parsed xml instead of unreadable/unmaintainable mess of name/content-based approaches, I've successfully merged my experimental XPath branch in to the master branch of the repo! The XPath-based approach allows for cleaner parsing, and most importantly, easy handling/removal of quotes within posts.
During the switch to XPath, there was a slightly annoying issue that led me to do this:
//We have a choice here between keeping the code simple, or keeping it pretty.
//This is because tidy does not like the use of span and div in the post header
//in the original page source, so it inserts an empty span with the same
//attribute (forummsginfo) as the offending span. This causes libxml2 to double
//count post headers and leads to a mess. We avoid this by using an uglier
//(table) XPath expression.
//Should using the complicating XPath be necessary at some point, the fix would
//be to discard any post headers that are empty as the tidy-generated spans are
//empty.
//const char * tltopic::POST_HEADER_XPATH = "//span[@class='forummsginfo']";
const char * tltopic::POST_HEADER_XPATH = "//td[@valign='top'and @class='titelbalk']";
Otherwise, most of the other changes were pretty straightforward: Quotes are handled by having a flag for ignoring quotes in tltopic objects that effectively nukes the nodes associated with the XPath expression for quotes. Protip: always work from end to the beginning of a set nodes when using xmlNodeSetContent:
for (q = n_quotes-1; q >= 0; q--)
{
xmlNodeSetContent(quote_object->nodesetval->nodeTab[q], (xmlChar *) "");
}
Working from beginning to end leads to nasty behavior involving nodes being freed twice due to the internal handling of the node tree.
Redundant usernames were eliminated with a hash table (I was lazy, so that's done with std::map).
The next step is to learn basic web development to make this widely accessible (I figured trying to port to Windows isn't worth the investment). I currently know what html tags are. Javascript+whatever else, here we come!
Who doesn't enjoy some shameless mental masturbation?