Indexing HTML Metatags Nutch - SOLR

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Indexing HTML Metatags Nutch - SOLR

krauss@gds2.de
Hello,

I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)

I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml

I have created the core following this description:
https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/

By the way without the metatags everything works fine.

Bevor creating the core I deleted the managed-schema.xml and inserted my
metatag fields into schema.xml in the configsets directory of the core


        <field name="metatag.SITdescription" type="text_general"
stored="true" indexed="true" multiValued="true"/>
        <field name="metatag.SITkeywords" type="text_general" stored="true"
indexed="true"  multiValued="true"/>

First Question: After creating the core I see a managed-schema.xml file and
a schema.xml.bak file in the conf directory of the core. Sorry I am new to
this, but I believe I do not want managed-schema.xml??? (See description
above)

Anyway when I run the crawl all is ok until the index is created. Then I end
up with the error:

org.apache.solr.common.SolrException: copyField dest
:'metatag.SITdescription_str' is not an explicit field and doesn't match a
dynamicField.
        at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:902)
        at
org.apache.solr.schema.ManagedIndexSchema.addCopyFields(ManagedIndexSchema.java:784)

There is no copyfield instruction for metatag.SITdescription in
managed-schema.xml. I even created a field "metatag.SITdescription_str" in
managed-schema.xml which did not help.

Can you help me please

Best Regards

Martin

nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
<property>
<name>http.agent.name</name>
<value>SIT_NUTCH_SPIDER</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be
ignored. This is an effective way to limit the crawl to include only
initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
  <name>plugin.includes</name>
 
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  By default Nutch includes plugins to crawl HTML and various other
  document formats via HTTP/HTTPS and indexing the crawled content
  into Solr.  More plugins are available to support more indexing
  backends, to fetch ftp:// and file:// URLs, for focused crawling,
  and many other use cases.
  </description>
</property>

<property>
  <name>http.robot.rules.whitelist</name>
  <value>sitlux02.sit.de</value>
  <description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for.
  </description>
</property>

<property>
<name>metatags.names</name>
<value>SITdescription,SITkeywords,SITcategory,SITintern</value>
<description> Names of the metatags to extract, separated by ','.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property>
<property>
  <name>index.parse.md</name>
 
<value>metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to
generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>

<property>
  <name>index.metadata</name>
 
<value>metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern</value>
  <description>
  Comma-separated list of keys to be taken from the metadata to generate
fields.
  Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
  by a parser (see parse-metatags plugin), and property 'metatags.names'.
  </description>
</property>

</configuration>



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html