metatags missing with parse-html

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

metatags missing with parse-html

Dave Beckstrom-2
Hi Everyone,

It seems like I take 1 step forward and 2 steps backwards.

I was using parse-tika and I needed to change to parse-html in order to use
a plug-in for excluding content such as headers and footers.

I have the excludes working with the plug-in.  But now I see that all of
the metatags are missing from solr.  The metatag fields are defined in SOLR
but not populated.

Metatags were working prior to the change to parse-html.  What would
explain the metatags not being indexed when the configuration
parameters didn't change?  Is there some other setting for parse-html that
I need to look into?

Thanks!


 <property>
  <name>plugin.includes</name>

<value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist</value>
  <description> </description>
 </property>
 <!--  index all metatags -->
 <property>
  <name>metatags.names</name>
  <value>*</value>
  <description> </description>
 </property>
 <property>
  <name>index.parse.md</name>
   <value>metatag.language,metatag.subject,metatag.category</value>
  <description> </description>
</property>

--
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/ <https://www.collectivefls.com/



Reply | Threaded
Open this post in threaded view
|

Re: metatags missing with parse-html

Sebastian Nagel-2
Hi Dave,

could you share an example document? Which Nutch version is used?

I tried to reproduce the problem without success using Nutch v1.16:

- example document:

<html>
<head>
<title>Test metatags</title>
<meta name='language' content='en'>
<meta name='subject'  content='test'>
<meta name='Category' content='meta data'>
</head>
<body>
test for metatag extraction
</body>
</html>

- using parse-html (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   -Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :        Mon Oct 14 13:24:14 CEST 2019
metatag.language :      en
metatag.language :      en
metatag.category :      meta data
metatag.category :      meta data
digest :        50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :       test
metatag.subject :       test
id :    http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :       Test metatags
test for metatag extraction

- using parse-tika (works)

> bin/nutch indexchecker -Dmetatags.names='*' \
   -Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
   -Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \
   http://localhost/nutch/test_metatags.html
fetching: http://localhost/nutch/test_metatags.html
robots.txt whitelist not configured.
parsing: http://localhost/nutch/test_metatags.html
contentType: text/html
tstamp :        Mon Oct 14 13:25:34 CEST 2019
metatag.language :      en
metatag.language :      en
metatag.category :      meta data
metatag.category :      meta data
digest :        50d08494ba791bb52fcdeebfc08ba640
host :  localhost
metatag.subject :       test
metatag.subject :       test
id :    http://localhost/nutch/test_metatags.html
title : Test metatags
url :   http://localhost/nutch/test_metatags.html
content :       Test metatags
test for metatag extraction


There are currently two issue open around metatags:
 https://issues.apache.org/jira/browse/NUTCH-1559
 https://issues.apache.org/jira/browse/NUTCH-2525

Maybe it's related to one of those?


Best,
Sebastian


On 11.10.19 22:38, Dave Beckstrom wrote:

> Hi Everyone,
>
> It seems like I take 1 step forward and 2 steps backwards.
>
> I was using parse-tika and I needed to change to parse-html in order to use
> a plug-in for excluding content such as headers and footers.
>
> I have the excludes working with the plug-in.  But now I see that all of
> the metatags are missing from solr.  The metatag fields are defined in SOLR
> but not populated.
>
> Metatags were working prior to the change to parse-html.  What would
> explain the metatags not being indexed when the configuration
> parameters didn't change?  Is there some other setting for parse-html that
> I need to look into?
>
> Thanks!
>
>
>  <property>
>   <name>plugin.includes</name>
>
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist</value>
>   <description> </description>
>  </property>
>  <!--  index all metatags -->
>  <property>
>   <name>metatags.names</name>
>   <value>*</value>
>   <description> </description>
>  </property>
>  <property>
>   <name>index.parse.md</name>
>    <value>metatag.language,metatag.subject,metatag.category</value>
>   <description> </description>
> </property>
>