Hi Dave,
could you share an example document? Which Nutch version is used?
I tried to reproduce the problem without success using Nutch v1.16:
- example document:
<html>
<head>
<title>Test metatags</title>
<meta name='language' content='en'>
<meta name='subject' content='test'>
<meta name='Category' content='meta data'>
</head>
<body>
test for metatag extraction
</body>
</html>
- using parse-html (works)
> bin/nutch indexchecker -Dmetatags.names='*' \
-Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
-Dplugin.includes='protocol-http|parse-(html|metatags)|index-(basic|metadata)' \
http://localhost/nutch/test_metatags.htmlfetching:
http://localhost/nutch/test_metatags.htmlrobots.txt whitelist not configured.
parsing:
http://localhost/nutch/test_metatags.htmlcontentType: text/html
tstamp : Mon Oct 14 13:24:14 CEST 2019
metatag.language : en
metatag.language : en
metatag.category : meta data
metatag.category : meta data
digest : 50d08494ba791bb52fcdeebfc08ba640
host : localhost
metatag.subject : test
metatag.subject : test
id :
http://localhost/nutch/test_metatags.htmltitle : Test metatags
url :
http://localhost/nutch/test_metatags.htmlcontent : Test metatags
test for metatag extraction
- using parse-tika (works)
> bin/nutch indexchecker -Dmetatags.names='*' \
-Dindex.parse.md='metatag.language,metatag.subject,metatag.category' \
-Dplugin.includes='protocol-http|parse-(tika|metatags)|index-(basic|metadata)' \
http://localhost/nutch/test_metatags.htmlfetching:
http://localhost/nutch/test_metatags.htmlrobots.txt whitelist not configured.
parsing:
http://localhost/nutch/test_metatags.htmlcontentType: text/html
tstamp : Mon Oct 14 13:25:34 CEST 2019
metatag.language : en
metatag.language : en
metatag.category : meta data
metatag.category : meta data
digest : 50d08494ba791bb52fcdeebfc08ba640
host : localhost
metatag.subject : test
metatag.subject : test
id :
http://localhost/nutch/test_metatags.htmltitle : Test metatags
url :
http://localhost/nutch/test_metatags.htmlcontent : Test metatags
test for metatag extraction
There are currently two issue open around metatags:
https://issues.apache.org/jira/browse/NUTCH-1559 https://issues.apache.org/jira/browse/NUTCH-2525Maybe it's related to one of those?
Best,
Sebastian
On 11.10.19 22:38, Dave Beckstrom wrote:
> Hi Everyone,
>
> It seems like I take 1 step forward and 2 steps backwards.
>
> I was using parse-tika and I needed to change to parse-html in order to use
> a plug-in for excluding content such as headers and footers.
>
> I have the excludes working with the plug-in. But now I see that all of
> the metatags are missing from solr. The metatag fields are defined in SOLR
> but not populated.
>
> Metatags were working prior to the change to parse-html. What would
> explain the metatags not being indexed when the configuration
> parameters didn't change? Is there some other setting for parse-html that
> I need to look into?
>
> Thanks!
>
>
> <property>
> <name>plugin.includes</name>
>
> <value>exchange-jexl|protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|indexer-solr|urlnormalizer-(pass|regex|basic)|index-blacklist-whitelist</value>
> <description> </description>
> </property>
> <!-- index all metatags -->
> <property>
> <name>metatags.names</name>
> <value>*</value>
> <description> </description>
> </property>
> <property>
> <name>index.parse.md</name>
> <value>metatag.language,metatag.subject,metatag.category</value>
> <description> </description>
> </property>
>