Nutch excludeNodes Patch

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch excludeNodes Patch

Dave Beckstrom-2
Hi Everyone!


We are running Nutch 1.15.

We are trying to implement the nutch-585-excludeNodes.patch described on:
https://issues.apache.org/jira/browse/NUTCH-585

It's acting like it's not running.  We don't get an error when the crawl
runs, no errors in the hadoop logs, it just doesn't exclude the content
from the page.

We installed it in the directory plugins>parse-html

We added the following to our nutch-site.xml to exclude div id=sidebar

<property>
  <name>parser.html.NodesToExclude</name>
  <value>div;id;sidebar</value>
  <description>
  A list of nodes whose content will not be indexed separated by "|".  Use
this to tell
  the HTML parser to ignore, for example, site navigation text.
  Each node has three elements: the first one is the tag name, the second
one the
  attribute name, the third one the value of the attribute.
  Note that nodes with these attributes, and their children, will be
silently ignored by the parser
  so verify the indexed content with Luke to confirm results.
  </description>
</property>

Here is our plugin.includes property from nutch-site.xml

 <property>
  <name>plugin.includes</name>

<value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
  <description> plugins
  </description>
 </property>

One question I have is  would having Tika configured in nutch-site.xml like
the following  cause any problems with the parse-html plugin not running?

<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>
 <!-- DMB added -->
<property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

We don't have a lot to go on to debug the issue.  The plugin has logic to
enable logging:

if (LOG.isTraceEnabled())
+        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
idNode.getNodeValue());

But nothing shows in the log files when we crawl. I
updated log4j.properties setting these two values to TRACE thinking I had
to enable trace before the logging would work:

 log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
 log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout

I reran the crawl and no logging occurred and of course the content  we
didn't want crawled and indexed is still showing up in SOLR.

I could really use some help and suggestions!

Thank you!

Dave Beckstrom

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

RE: Nutch excludeNodes Patch

Markus Jelsma-2
Hello Dave,

You have both TikaParser and HtmlParser enabled. This probably means you never use HtmlParser but always TikaParser. You can instruct Nutch via parse-plugins.xml which Parser impl. to choose based on MIME-type. If you select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.

Regards,
Markus
 
-----Original message-----

> From:Dave Beckstrom <[hidden email]>
> Sent: Wednesday 9th October 2019 22:10
> To: [hidden email]
> Subject: Nutch excludeNodes Patch
>
> Hi Everyone!
>
>
> We are running Nutch 1.15.
>
> We are trying to implement the nutch-585-excludeNodes.patch described on:
> https://issues.apache.org/jira/browse/NUTCH-585
>
> It's acting like it's not running.  We don't get an error when the crawl
> runs, no errors in the hadoop logs, it just doesn't exclude the content
> from the page.
>
> We installed it in the directory plugins>parse-html
>
> We added the following to our nutch-site.xml to exclude div id=sidebar
>
> <property>
>   <name>parser.html.NodesToExclude</name>
>   <value>div;id;sidebar</value>
>   <description>
>   A list of nodes whose content will not be indexed separated by "|".  Use
> this to tell
>   the HTML parser to ignore, for example, site navigation text.
>   Each node has three elements: the first one is the tag name, the second
> one the
>   attribute name, the third one the value of the attribute.
>   Note that nodes with these attributes, and their children, will be
> silently ignored by the parser
>   so verify the indexed content with Luke to confirm results.
>   </description>
> </property>
>
> Here is our plugin.includes property from nutch-site.xml
>
>  <property>
>   <name>plugin.includes</name>
>
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
>   <description> plugins
>   </description>
>  </property>
>
> One question I have is  would having Tika configured in nutch-site.xml like
> the following  cause any problems with the parse-html plugin not running?
>
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>  <!-- DMB added -->
> <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
>
> We don't have a lot to go on to debug the issue.  The plugin has logic to
> enable logging:
>
> if (LOG.isTraceEnabled())
> +        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> idNode.getNodeValue());
>
> But nothing shows in the log files when we crawl. I
> updated log4j.properties setting these two values to TRACE thinking I had
> to enable trace before the logging would work:
>
>  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
>  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
>
> I reran the crawl and no logging occurred and of course the content  we
> didn't want crawled and indexed is still showing up in SOLR.
>
> I could really use some help and suggestions!
>
> Thank you!
>
> Dave Beckstrom
>
> --
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
>
> https://www.collectivefls.com/ <https://www.collectivefls.com/
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch excludeNodes Patch

Dave Beckstrom-2
Markus,

Thank you so much for the reply!

I made the change to  parse-plugins.xml  and the plug-in is being called
now.  That plug-in didn't work so I changed to the blacklist-whitelist
plug-in and I've got it working thanks to your help!

 Dave

On Wed, Oct 9, 2019 at 4:00 PM Markus Jelsma <[hidden email]>
wrote:

> Hello Dave,
>
> You have both TikaParser and HtmlParser enabled. This probably means you
> never use HtmlParser but always TikaParser. You can instruct Nutch via
> parse-plugins.xml which Parser impl. to choose based on MIME-type. If you
> select HtmlParser for html and xhtml, Nutch should use HtmlParser instead.
>
> Regards,
> Markus
>
> -----Original message-----
> > From:Dave Beckstrom <[hidden email]>
> > Sent: Wednesday 9th October 2019 22:10
> > To: [hidden email]
> > Subject: Nutch excludeNodes Patch
> >
> > Hi Everyone!
> >
> >
> > We are running Nutch 1.15.
> >
> > We are trying to implement the nutch-585-excludeNodes.patch described on:
> > https://issues.apache.org/jira/browse/NUTCH-585
> >
> > It's acting like it's not running.  We don't get an error when the crawl
> > runs, no errors in the hadoop logs, it just doesn't exclude the content
> > from the page.
> >
> > We installed it in the directory plugins>parse-html
> >
> > We added the following to our nutch-site.xml to exclude div id=sidebar
> >
> > <property>
> >   <name>parser.html.NodesToExclude</name>
> >   <value>div;id;sidebar</value>
> >   <description>
> >   A list of nodes whose content will not be indexed separated by "|".
> Use
> > this to tell
> >   the HTML parser to ignore, for example, site navigation text.
> >   Each node has three elements: the first one is the tag name, the second
> > one the
> >   attribute name, the third one the value of the attribute.
> >   Note that nodes with these attributes, and their children, will be
> > silently ignored by the parser
> >   so verify the indexed content with Luke to confirm results.
> >   </description>
> > </property>
> >
> > Here is our plugin.includes property from nutch-site.xml
> >
> >  <property>
> >   <name>plugin.includes</name>
> >
> >
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)</value>
> >   <description> plugins
> >   </description>
> >  </property>
> >
> > One question I have is  would having Tika configured in nutch-site.xml
> like
> > the following  cause any problems with the parse-html plugin not running?
> >
> > <property>
> >   <name>tika.extractor</name>
> >   <value>boilerpipe</value>
> >   <description>
> >   Which text extraction algorithm to use. Valid values are: boilerpipe or
> > none.
> >   </description>
> > </property>
> >  <!-- DMB added -->
> > <property>
> >   <name>tika.extractor.boilerpipe.algorithm</name>
> >   <value>ArticleExtractor</value>
> >   <description>
> >   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> > ArticleExtractor
> >   or CanolaExtractor.
> >   </description>
> > </property>
> >
> > We don't have a lot to go on to debug the issue.  The plugin has logic to
> > enable logging:
> >
> > if (LOG.isTraceEnabled())
> > +        LOG.trace("Stripping " + pNode.getNodeName() + "#" +
> > idNode.getNodeValue());
> >
> > But nothing shows in the log files when we crawl. I
> > updated log4j.properties setting these two values to TRACE thinking I had
> > to enable trace before the logging would work:
> >
> >  log4j.logger.org.apache.nutch.crawl.Crawl=TRACE,cmdstdout
> >  log4j.logger.org.apache.nutch.parse.html=TRACE,cmdstdout
> >
> > I reran the crawl and no logging occurred and of course the content  we
> > didn't want crawled and indexed is still showing up in SOLR.
> >
> > I could really use some help and suggestions!
> >
> > Thank you!
> >
> > Dave Beckstrom
> >
> > --
> > *Fig Leaf Software is now Collective FLS, Inc.*
> > *
> > *
> > *Collective FLS, Inc.*
> >
> > https://www.collectivefls.com/ <https://www.collectivefls.com/>
> >
> >
> >
> >
>

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/