parse-xml with large number of fields

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

parse-xml with large number of fields

Boris Lau-2
Hi all,

I am having problem with using parse-xml plugin with nutch 0.9 with a
5-node hadoop to process some XMl documents.  It is causing a huge
slow down at the crawl-reduce stage (to the point that it is sometime
causing node timeout)

My xmlparser-conf.xml would separate large number of tags into
different fields.  e.g.

------------------------------8<---------------------------

<nutchXmlParser>
<xmlIndexerProperties type="filePerDocument"
namespace="http://purl.org/dc/elements/1.1/">
<field name="dctitle" xpath="//dc:title" type="Text" boost="1.4"/>
<field name="dccreator" xpath="//dc:creator" type="keyword" boost="1.0"/>
</xmlIndexerProperties>
<xmlIndexerProperties type="filePerDocument" namespace="default">

<field name="tag1" xpath="//tag1" type="Text" boost="1.0"/>
<field name="tag2" xpath="//tag2" type="Text" boost="1.0"/>
<field name="tag3" xpath="//tag3" type="Text" boost="1.0"/>

<!--.... etc. about 100 of these, where these tag represent different
types of data -->

</xmlIndexerProperties>
</nutchXmlParser>

------------------------------8<---------------------------

The aim is to allow doing searches such as "tag1:data" with query-more
plugin.  (Please do correct me if i am using the term "field" wrongly
here)

I can confirm that the problem does not manifest itself when only
indexing into small number of different fields (around 10) by limiting
the different field tags in xmlparser-conf.xml

I wonder if this is because:
1. large number of different fields is bad in nutch? - anybody had
experience with dealing with large number of different fields (100+)
in the index?
2. parse-xml is inefficient at generating parse data for large number
of fields? - would anybody who have experience with parse-xml plugin
have any comment?

Many thanks for the help in advance.  Please do let me know if you
require more info - I am relatively new to nutch but I am very excited
about its potential.

Cheers
boris