[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel

Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "NutchTutorial" page has been changed by SebastianNagel:

Updates for release of Nutch 1.15, fix Deduplication section

       Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize] [-addBinaryContent] [-base64]
-      Example: bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone
+      Example: bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20131108063838/ -filter -normalize -deleteGone
  === Step-by-Step: Deleting Duplicates ===
- Once indexed the entire contents, it must be disposed of duplicate urls in this way ensures that the urls are unique.
+ Duplicates (identical content but different URL) are optionally marked in the CrawlDb and are deleted later in the Solr index.
- MapReduce:
+ MapReduce "dedup" job:
-  * Map: Identity map where keys are digests and values are  [[http://wiki.apache.org/nutch/SolrRecord|SolrRecord]] instances (which contain id, boost and timestamp)
-  * Reduce: After map, [[http://wiki.apache.org/nutch/SolrRecord|SolrRecord]]s with the same digest will be grouped together. Now, of these documents with the same digests, delete all of them except the one with the highest score (boost field). If two (or more) documents have the same score, then the document with the latest timestamp is kept. Again, every other is deleted from solr index.
+  * Map: Identity map where keys are digests and values are CrawlDatum records
+  * Reduce: CrawlDatums with the same digest are marked (except one of them) as duplicates. There are multiple heuristics available to choose the item which is not marked as duplicate - the one with the shortest URL, fetched most recently, or with the highest score.
+      Usage: bin/nutch dedup <crawldb> [-group <none|host|domain>] [-compareOrder <score>,<fetchTime>,<urlLength>]
-      Usage: bin/nutch dedup <solr url>
-      Example: /bin/nutch dedup http://localhost:8983/solr
+ Deletion in the index is performed by the cleaning job (see below) or if the index job is called with the command-line flag {{-deleteGone}}.
  For more information see [[https://wiki.apache.org/nutch/bin/nutch%20dedup|dedup documentation]].
@@ -310, +311 @@

  Every version of Nutch is built against a specific Solr version, but you may also try a "close" version.
  || Nutch || Solr   ||
+ || 1.15  || 7.3.1  ||
  || 1.14  || 6.6.0  ||
  || 1.13  || 5.5.0  ||
  || 1.12  || 5.4.1  ||
+ To install Solr:
   * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
   * unzip to `$HOME/apache-solr`, we will now refer to this as `${APACHE_SOLR_HOME}`
   * create resources for a new nutch solr core `cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs ${APACHE_SOLR_HOME}/server/solr/configsets/nutch`
@@ -321, +324 @@

   * make sure that there is no `managed-schema` "in the way": `rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema`
   * start the solr server `${APACHE_SOLR_HOME}/bin/solr start`
   * create the nutch core `${APACHE_SOLR_HOME}/bin/solr create -c nutch -d server/solr/configsets/nutch/conf/`
+ After that you need to point Nutch to the Solr instance:
+  * (Nutch 1.15 and later) edit the file {{conf/index-writers.xml}}, see IndexWriters
-  * add the core name to the Solr server URL: `-Dsolr.server.url=http://localhost:8983/solr/nutch`
+  * (until Nutch 1.14) add the core name to the Solr server URL: `-Dsolr.server.url=http://localhost:8983/solr/nutch`
  = Verify Solr installation =
  After you started Solr admin console, you should be able to access the following links: