I wanted to get familiar with plugin based model of Nutch and hence sharing my experience by writing custom plugin. In another post, I note my experience on debugging Nutch.
I followed tutorial which has been updated to 1.3 (latest at the time of writing this post). Plugins are to be created inside src/plugin directory.
Custom plugin needs following things:
1) Configuration files like build.xml, ivy.xml, plugin.xml.
2) Java classes: which execute business logic.
Following notes should be helpful in setting Nutch quickly.
1) Directory struture.
a) The main custom plugin directory should be defined inside <NUTCH_HOME_DIR>/src/plugin.
b) Inside this directory create build.xml and plugin.xml.
c) Now create directory src/java/<package-path>, in my case: src/java/com/mycompany/parserfilter/
d) In this directory, create a lib directory which store any external jars.
2) Make sure build.xml is present. It looks like this, replace <UR_PLUGIN_NAME> with plugin name.
3) Plugin.xml: This makes aware of custom code to Nutch system. Any other libraries which your custom code depends on can be defined inside libray tag. I wanted to extract certain section of the crawled page. Hence I am extending HtmlParseFilter. For every custom code, declaration is needed in the form of extension tag. I have defined TagExtractorParseFilter.
4) Remember to register this plugin in build.xml defined inside <NUTCH_HOME_DIR>/src/plugin/ . It can be registered : <ant dir="<UR_PLUGIN_NAME>" target="deploy" /> inside target tag. Failure to do so would result in your code not being packaged (into jar file). As of 1.3, the custom code is packaged into jar file and it resides in <NUTCH_HOME-DIR>/runtime/local/plugin/.
5) Lastly write your custom code:
6) Running nutch which invokes custom code:
a) Create a seed list (the initial urls to fetch) :
mkdir urls
echo "http://www.YOURWEBSITE.com/" > urls/seed.txt
b) Inject seed url(s) to nutch crawldb (execute in <NUTCH_HOME_DIR>/runtime/local)
bin/nutch inject crawl/crawldb urls
c) Generate fetch list, fetch and parse content
bin/nutch generate crawl/crawldb crawl/segments
The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched.
d) In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
e) Launch the fetcher that actually goes to get the content:
bin/nutch fetch $SEGMENT -noParsing
f) Next I parse the content:
bin/nutch parse $SEGMENT
g) Update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
h) Now a full Fetch cycle is completed. Next you can repeat steps (b-g) couple of times.
Notice in the logs, custom code will be invoked.
7) I would suggest reading my debug experience to get better understanding of the system, just in case you are stuck.
I followed tutorial which has been updated to 1.3 (latest at the time of writing this post). Plugins are to be created inside src/plugin directory.
Custom plugin needs following things:
1) Configuration files like build.xml, ivy.xml, plugin.xml.
2) Java classes: which execute business logic.
Following notes should be helpful in setting Nutch quickly.
1) Directory struture.
a) The main custom plugin directory should be defined inside <NUTCH_HOME_DIR>/src/plugin.
b) Inside this directory create build.xml and plugin.xml.
c) Now create directory src/java/<package-path>, in my case: src/java/com/mycompany/parserfilter/
d) In this directory, create a lib directory which store any external jars.
2) Make sure build.xml is present. It looks like this, replace <UR_PLUGIN_NAME> with plugin name.
<?xml version="1.0"?> <project name="<UR_PLUGIN_NAME>" default="jar-core"> <import file="../build-plugin.xml" /> </project>
3) Plugin.xml: This makes aware of custom code to Nutch system. Any other libraries which your custom code depends on can be defined inside libray tag. I wanted to extract certain section of the crawled page. Hence I am extending HtmlParseFilter. For every custom code, declaration is needed in the form of extension tag. I have defined TagExtractorParseFilter.
<?xml version="1.0" encoding="UTF-8"?> <plugin id="food" name="<UR_PLUGIN_NAME>" version="0.0.1" provider-name="mycompany"> <runtime> <library name="<UR_PLUGIN_NAME>"> <export name="*"/> </library> <library name="jsoup.jar" /> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="com.mycompany.parsefilter.TagExtractorParseFilter" name="<UR_PLUGIN_NAME>" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="TagExtractorParseFilter" class="com.mycompany.parsefilter.TagExtractorParseFilter"/> </extension> </plugin>
4) Remember to register this plugin in build.xml defined inside <NUTCH_HOME_DIR>/src/plugin/ . It can be registered : <ant dir="<UR_PLUGIN_NAME>" target="deploy" /> inside target tag. Failure to do so would result in your code not being packaged (into jar file). As of 1.3, the custom code is packaged into jar file and it resides in <NUTCH_HOME-DIR>/runtime/local/plugin/.
5) Lastly write your custom code:
package com.mycompany.parsefilter; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.conf.Configuration; import org.apache.nutch.parse.HTMLMetaTags; import org.apache.nutch.parse.HtmlParseFilter; import org.apache.nutch.parse.ParseResult; import org.apache.nutch.protocol.Content; import org.w3c.dom.*; import org.apache.commons.logging.Log; import org.jsoup.*; import org.jsoup.select.Elements; public class TagExtractorParseFilter implements HtmlParseFilter { private Configuration conf; public static final Log LOG = LogFactory.getLog(TagExtractorParseFilter.class); @Override public Configuration getConf() { return conf; } @Override public void setConf(Configuration arg0) { conf = arg0; } @Override public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { org.jsoup.nodes.Document document =Jsoup.parse(content.toString()); LOG.warn("document:"+document.text()); Elements e = document.select("#rcpinglist"); if (e != null) LOG.warn("ingredients:"+e.text()); else LOG.warn("rcpinglist empty."); return parseResult; } }
6) Running nutch which invokes custom code:
a) Create a seed list (the initial urls to fetch) :
mkdir urls
echo "http://www.YOURWEBSITE.com/" > urls/seed.txt
b) Inject seed url(s) to nutch crawldb (execute in <NUTCH_HOME_DIR>/runtime/local)
bin/nutch inject crawl/crawldb urls
c) Generate fetch list, fetch and parse content
bin/nutch generate crawl/crawldb crawl/segments
The above command will generate a new segment directory under crawl/segments that at this point contains files that store the url(s) to be fetched.
d) In the following commands we need the latest segment dir as parameter so we’ll store it in an environment variable:
export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
e) Launch the fetcher that actually goes to get the content:
bin/nutch fetch $SEGMENT -noParsing
f) Next I parse the content:
bin/nutch parse $SEGMENT
g) Update the Nutch crawldb. The updatedb command wil store all new urls discovered during the fetch and parse of the previous segment into Nutch database so they can be fetched later. Nutch also stores information about the pages that were fetched so the same urls won’t be fetched again and again.
bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
h) Now a full Fetch cycle is completed. Next you can repeat steps (b-g) couple of times.
Notice in the logs, custom code will be invoked.
7) I would suggest reading my debug experience to get better understanding of the system, just in case you are stuck.
Thanks a lot.
ReplyDeleteThis is really helpful.