[Programming Blog] autofan v0.05

Loser777

1931 Posts

September 10 2013 02:28 GMT

https://github.com/eqy/autofan
(works, but code is somewhat messy, needs tons of whitespace changes)
Tested build on Arch 3.10.10-1 and Ubuntu 12.04.2 (3.2.0-51).

Changes

Post parsing is now done with Xpath
Quotes do not interfere with processing
Redundant usernames do not appear in output

Following tarpman's suggestion on using XPath to navigate the parsed xml instead of unreadable/unmaintainable mess of name/content-based approaches, I've successfully merged my experimental XPath branch in to the master branch of the repo! The XPath-based approach allows for cleaner parsing, and most importantly, easy handling/removal of quotes within posts.

During the switch to XPath, there was a slightly annoying issue that led me to do this:


//We have a choice here between keeping the code simple, or keeping it pretty.
//This is because tidy does not like the use of span and div in the post header
//in the original page source, so it inserts an empty span with the same
//attribute (forummsginfo) as the offending span. This causes libxml2 to double
//count post headers and leads to a mess. We avoid this by using an uglier
//(table) XPath expression.
//Should using the complicating XPath be necessary at some point, the fix would
//be to discard any post headers that are empty as the tidy-generated spans are
//empty.
//const char * tltopic::POST_HEADER_XPATH = "//span[@class='forummsginfo']";
const char * tltopic::POST_HEADER_XPATH = "//td[@valign='top'and @class='titelbalk']";

Otherwise, most of the other changes were pretty straightforward: Quotes are handled by having a flag for ignoring quotes in tltopic objects that effectively nukes the nodes associated with the XPath expression for quotes. Protip: always work from end to the beginning of a set nodes when using xmlNodeSetContent:


for (q = n_quotes-1; q >= 0; q--)
{
    xmlNodeSetContent(quote_object->nodesetval->nodeTab[q], (xmlChar *) "");
}

Working from beginning to end leads to nasty behavior involving nodes being freed twice due to the internal handling of the node tree.

Redundant usernames were eliminated with a hash table (I was lazy, so that's done with std::map).

The next step is to learn basic web development to make this widely accessible (I figured trying to port to Windows isn't worth the investment). I currently know what html tags are. Javascript+whatever else, here we come!
Who doesn't enjoy some shameless mental masturbation?

Please or register to reply.

[Programming Blog] autofan v0.05

Completed

Ongoing

Upcoming