I am having problem with using parse-xml plugin with nutch 0.9 with a
5-node hadoop to process some XMl documents. It is causing a huge
slow down at the crawl-reduce stage (to the point that it is sometime
causing node timeout)
My xmlparser-conf.xml would separate large number of tags into
different fields. e.g.
The aim is to allow doing searches such as "tag1:data" with query-more
plugin. (Please do correct me if i am using the term "field" wrongly
I can confirm that the problem does not manifest itself when only
indexing into small number of different fields (around 10) by limiting
the different field tags in xmlparser-conf.xml
I wonder if this is because:
1. large number of different fields is bad in nutch? - anybody had
experience with dealing with large number of different fields (100+)
in the index?
2. parse-xml is inefficient at generating parse data for large number
of fields? - would anybody who have experience with parse-xml plugin
have any comment?
Many thanks for the help in advance. Please do let me know if you
require more info - I am relatively new to nutch but I am very excited
about its potential.