Saturday, July 23, 2011

Debugging Nutch Parse Filter.

Introduction:
I heard of Apache nutch, an open source crawler (well, actually it can do more than just crawling. It can index/search on the stored result, more info available at link). My curiosity to try this out led me through its code internals which I would like to share in this post. Particularly in this post I will share my debugging experience gained by stepping through its Parse segment internals.

Background:
I downloaded the latest stable code (at the time of this post, 1.3 being latest) from link. Custom plugin wiki is available at link. At the time of writing this post, the tutorial is yet to update for nutch 1.3. Also one can integrate nutch in eclispe using instructions at link, which is not yet complete.With little help available, I went the logging route, where I traced from the logs, what code was invoked. As of 1.3, Nutch can be run in two modes namely local and deploy. I tried the code in local, i.e. invoking nutch commands through command line inside dir <NUTCH_HOME_DIR>/runtime/local.

What was I trying to achieve?
I was trying to crawl/parse an Indian recipe website. From each of the recipe pages on this website, I wanted to extract Ingredients. Now one has two ways of going about this extraction, one which implements Parser or HtmlParseFilter. I choose HtmlParseFilter as Parser option is already taken care in HtmlParser plugin.

Approach:
From the wiki, I found I needed three things to get my code to work. One being the java class which actually did the parsing (TagExtractorParseFilter), plugin.xml and build.xml. More details on how to write a custom plugin can be found at link (will put the link as soon as I complete the post).
My first and main road block was my custom code was not invoked. I saw following in hadoop.log (which is located at NUTCH_HOME_DIR/runtime/local/logs/hadoop.log) Changing the log level to DEBUG to get more log information.

2011-07-23 00:33:37,592 WARN  parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml
1962011-07-23 11:21:53,547 WARN  parse.ParseUtil - Unable to successfully parse content http://www.tarladalal.com/ of type application/xhtml+xml

For the first comment, it turned out in my parse-plugins.xml had mapping of content type "application/xhtml+xml" but parse-html custom plugin's plugin.xml didnt state it. Had to make the following change, in order to make first error go away (in my case, it was an error as I need content type xhtml to be invoking HtmlParser code). First log is pretty intuitive and was resolved easily.

For Second log, I had to put log statements into code starting from ParseUtil -> HtmlParser -> HtmlParserFilters -> TagExtractorParseFilter (my custom code). Note my entry into ParseUtil was from the fact log statement in hadoop.log mentioned about it.

ParseUtil internally uses ParserFactory (based on Factory design pattern) to get the appropriate parser based on content type (note this content type is mapped in parse-plugins.xml file and it should match the content type in plugins plugin.xml). Once ParseUtil get hold of appropriate parser (in my case, it is HtmlParser), it calls
parseResult = p.getParse(content)
HtmlParser.getParse does a bunch of stuff before calling filters, invoked by
ParseResult filteredParse = this.htmlParseFilters.filter(content, parseResult, metaTags, root);

As these Filters, they are programmed to the interface, in my case, TagExtractorParseFilter.filter is invoked. On further investigation, it turned out there was a NPE thrown. Following is the code snippet where NPE was thrown.

@Override
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
 Document document = doc.getOwnerDocument();
 NodeList nodeList = e.getChildNodes();
 Element e = document.getElementById("rcpinglist");
 int size = nodeList.getLength();
 for (int i=0; i<size; i++) {
  Node nodeItem = nodeList.item(i);
  LOG.warn("Node Value:"+nodeItem.getNodeValue());
 }
 return parseResult;
}

Note error thrown was a NPE, but the stack trace was never visible in either logs or stdout. As it turns out, the exception was being swallowed and parseResult.isSuccess() was set to false, indicating parsing had failed. Following code snippet from HtmlParseFilters, which shows how each filter is called and parseResult is obtained from each. Also on error during filter, original parseResult is returned.



/** Run all defined filters. */
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
  // loop on each filter
  for (int i = 0 ; i < this.htmlParseFilters.length; i++) {
   // call filter interface
   parseResult = htmlParseFilters[i].filter(content, parseResult, metaTags, doc);
   // any failure on parse obj, return
   if (!parseResult.isSuccess()) {
   // TODO: What happens when parseResult.isEmpty() ?
   // Maybe clone parseResult and use parseResult as backup...
   // remove failed parse before return
   parseResult.filter();
   return parseResult;
  }
 }
 return parseResult;
}

Once code execution returned to ParseUtils, second log message was printed due to this error.Thus I modified my code to handle the NPE and that resolved my second log.

@Override
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) {
 Document document = doc.getOwnerDocument();
 NodeList nodeList = e.getChildNodes();
 Element e = document.getElementById("rcpinglist");
 if (e != null) {
  int size = nodeList.getLength();
  for (int i=0; i<size; i++) {
   Node nodeItem = nodeList.item(i);
   LOG.warn("Node Value:"+nodeItem.getNodeValue());
  }
 } else {
  LOG.warn("no rcpinglist found.");
 }
 return parseResult;
}

Conclusion:
Log statements are very useful and give good insights on debugging the program. Using log as my starting point, I was able to understand the code in a much better way. The only con I feel is it takes little extra time. I would recommend this route if one needs to understand a system internals.

4 comments:

  1. Hello Amrut,

    Pretty good article. I might need you brain !!! Using nutch 1.2 it was pretty simple to display logs of Tika Parser using log4j.properties file with log4j.logger.org.apache.nutch.parse.tika=ALL,cmdstdout.

    Using nutch 1.3, with similar configuration nothing is being displayed.

    Can you help me with that issue? Do you have any tips ?

    Thanks in advance,

    Nicolas

    ReplyDelete
  2. Thank you Nicholas.

    Check if your log level modification propagated to log4j.properties inside /runtime/local/conf/ . I noticed when you run ant, log4j.properties file is copied from /conf to runtime/local/conf directory. Also you can see logs at /runtime/local/logs/hadoop.log

    ReplyDelete
  3. I get the nutch release 1.3 from svn and add the conf and other setting like the wiki(run nutch in eclipse) but the classes in src/plugin doesnt map to its jars,also i add the jars to build path and solve this problem but when i want to run the project the following error display:
    failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content
    please help me

    ReplyDelete
  4. Hi Amrut,

    I developed a plugin which uses xpath and regex to crawl from specific url.

    I'm using nutch 1.6 and while parsing i got the following error

    Error parsing: http://www.examples.com/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

    What should i do to resolve it?
    Please help.

    ReplyDelete