parser.html.NodesToExclud

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

parser.html.NodesToExclud

Dave Beckstrom-2
Hi All,

I'm running NUTCH 1.15.

In my nutch-site.xml I configured the below parameters and
specifically under   parser.html.NodesToExclude I'm telling it not to index
"div id=sidebar" or "div id=footer" and yet it continues to index those
regions on the page.

Does anyone have suggestions on why this isn't working and what I should do
to resolve this?

Thank you!




<property>
  <name>tika.extractor</name>
  <value>boilerpipe</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or
none.
  </description>
</property>
 <property>
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description>
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
ArticleExtractor
  or CanolaExtractor.
  </description>
</property>
<property>
    <name>parser.html.NodesToExclude</name>
    <value>div;id;sidebar|div;id;footer</value>
    <description>
      A list of nodes whose content will not be indexed separated by "|".
      Use this to tell the HTML parser to ignore, for example, site
navigation text.

      Each node has three elements, separated by semi-colon:
      the first one is the tag name,
      the second one the attribute name,
      the third one the value of the attribute.

      Example: table;summary;header|div;id;navigation

      Note that nodes with these attributes, and their children, will be
      silently ignored by the parser so verify the indexed content
      with Luke to confirm results.
    </description>
  </property>




Regards,

Dave Beckstrom
Technical Delivery Manager / Senior Developer
em: [hidden email] <[hidden email]>
ph: 763.323.3499

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

Re: parser.html.NodesToExclud

Sebastian Nagel-2
Hi Dave,

the boilerplate removal (boilerpipe) works if parse-tika is used for parsing,
but the parser.html.NodesToExclude property belongs to a feature which never
made it into the code base, see
  https://issues.apache.org/jira/browse/NUTCH-585

Or do you work with a patched version?

Best,
Sebastian


On 9/12/19 9:24 PM, Dave Beckstrom wrote:

> Hi All,
>
> I'm running NUTCH 1.15.
>
> In my nutch-site.xml I configured the below parameters and
> specifically under   parser.html.NodesToExclude I'm telling it not to index
> "div id=sidebar" or "div id=footer" and yet it continues to index those
> regions on the page.
>
> Does anyone have suggestions on why this isn't working and what I should do
> to resolve this?
>
> Thank you!
>
>
>
>
> <property>
>   <name>tika.extractor</name>
>   <value>boilerpipe</value>
>   <description>
>   Which text extraction algorithm to use. Valid values are: boilerpipe or
> none.
>   </description>
> </property>
>  <property>
>   <name>tika.extractor.boilerpipe.algorithm</name>
>   <value>ArticleExtractor</value>
>   <description>
>   Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor,
> ArticleExtractor
>   or CanolaExtractor.
>   </description>
> </property>
> <property>
>     <name>parser.html.NodesToExclude</name>
>     <value>div;id;sidebar|div;id;footer</value>
>     <description>
>       A list of nodes whose content will not be indexed separated by "|".
>       Use this to tell the HTML parser to ignore, for example, site
> navigation text.
>
>       Each node has three elements, separated by semi-colon:
>       the first one is the tag name,
>       the second one the attribute name,
>       the third one the value of the attribute.
>
>       Example: table;summary;header|div;id;navigation
>
>       Note that nodes with these attributes, and their children, will be
>       silently ignored by the parser so verify the indexed content
>       with Luke to confirm results.
>     </description>
>   </property>
>
>
>
>
> Regards,
>
> Dave Beckstrom
> Technical Delivery Manager / Senior Developer
> em: [hidden email] <[hidden email]>
> ph: 763.323.3499
>