How to limit nutch to fetch, refetch and index just the injected URLs?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

How to limit nutch to fetch, refetch and index just the injected URLs?

Nicol�Lichtmaier
I'd like to limit nutch to fetch, refetch and index just the injected
URLs. Will setting db.max.outlinks.per.page to 0 enable me to do that?
If not... how could achive what I'm looking to?

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Andrzej Białecki-2
Nicolás Lichtmaier wrote:
> I'd like to limit nutch to fetch, refetch and index just the injected
> URLs. Will setting db.max.outlinks.per.page to 0 enable me to do that?
> If not... how could achive what I'm looking to?

You need to run "updatedb" with "-noAdditions" switch.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Nicol�Lichtmaier

>> I'd like to limit nutch to fetch, refetch and index just the injected
>> URLs. Will setting db.max.outlinks.per.page to 0 enable me to do
>> that? If not... how could achive what I'm looking to?
> You need to run "updatedb" with "-noAdditions" switch.

That doesn't work. And in the code, in org.apache.nutch.crawl.CrawlDb's
main method there's absolutely no handling of any parameter.
How could I achive this?

Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Andrzej Białecki-2
Nicolás Lichtmaier wrote:

>
>>> I'd like to limit nutch to fetch, refetch and index just the
>>> injected URLs. Will setting db.max.outlinks.per.page to 0 enable me
>>> to do that? If not... how could achive what I'm looking to?
>> You need to run "updatedb" with "-noAdditions" switch.
>
> That doesn't work. And in the code, in
> org.apache.nutch.crawl.CrawlDb's main method there's absolutely no
> handling of any parameter.
> How could I achive this?

Perhaps you should start from reporting which version you are using ...
The version in trunk/ certainly supports this argument. The version in
0.8.1 does not support it, but it's easy to add.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Nicol�Lichtmaier

>>> You need to run "updatedb" with "-noAdditions" switch.
>> That doesn't work. And in the code, in
>> org.apache.nutch.crawl.CrawlDb's main method there's absolutely no
>> handling of any parameter.
>> How could I achive this?
> Perhaps you should start from reporting which version you are using
> ... The version in trunk/ certainly supports this argument. The
> version in 0.8.1 does not support it, but it's easy to add.

Uhm... well... I was using the latest released version. Should I use
trunk? Is it ok for production use?

Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Nicol�Lichtmaier
In reply to this post by Andrzej Białecki-2

> Perhaps you should start from reporting which version you are using
> ... The version in trunk/ certainly supports this argument. The
> version in 0.8.1 does not support it, but it's easy to add.

I've "backported" revision 450799 to the 0.8.x branch for supporting
"-noAdditions". Perhaps you could consider committing it there... (I
haven't tested it yet whough).


Index: conf/nutch-default.xml
===================================================================
--- conf/nutch-default.xml (revisión: 492707)
+++ conf/nutch-default.xml (copia de trabajo)
@@ -237,6 +237,15 @@
 </property>
 
 <property>
+  <name>db.update.additions.allowed</name>
+  <value>true</value>
+  <description>If true, updatedb will add newly discovered URLs, if false
+  only already existing URLs in the CrawlDb will be updated and no new
+  URLs will be added.
+  </description>
+</property>
+
+<property>
   <name>db.ignore.internal.links</name>
   <value>true</value>
   <description>If true, when adding new links to a page, links from
Index: src/java/org/apache/nutch/crawl/CrawlDb.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDb.java (revisión: 492707)
+++ src/java/org/apache/nutch/crawl/CrawlDb.java (copia de trabajo)
@@ -36,6 +36,7 @@
  * crawldb accordingly.
  */
 public class CrawlDb extends Configured {
+  public static final String CRAWLDB_ADDITIONS_ALLOWED = "db.update.additions.allowed";
 
   public static final Log LOG = LogFactory.getLog(CrawlDb.class);
 
@@ -43,16 +44,22 @@
   public CrawlDb(Configuration conf) {
     super(conf);
   }
+  
+  public void update(Path crawlDb, Path segment) throws IOException {
+    boolean additionsAllowed = getConf().getBoolean(CRAWLDB_ADDITIONS_ALLOWED, true);
+    update(crawlDb, segment, additionsAllowed);
+  }
 
-  public void update(Path crawlDb, Path segment) throws IOException {
-    
+  public void update(Path crawlDb, Path segment, boolean additionsAllowed) throws IOException {    
     if (LOG.isInfoEnabled()) {
       LOG.info("CrawlDb update: starting");
       LOG.info("CrawlDb update: db: " + crawlDb);
       LOG.info("CrawlDb update: segment: " + segment);
+      LOG.info("CrawlDb update: additions allowed: " + additionsAllowed);
     }
 
     JobConf job = CrawlDb.createJob(getConf(), crawlDb);
+    job.setBoolean(CRAWLDB_ADDITIONS_ALLOWED, additionsAllowed);
     job.addInputPath(new Path(segment, CrawlDatum.FETCH_DIR_NAME));
     job.addInputPath(new Path(segment, CrawlDatum.PARSE_DIR_NAME));
 
@@ -108,13 +115,24 @@
   }
 
   public static void main(String[] args) throws Exception {
-    CrawlDb crawlDb = new CrawlDb(NutchConfiguration.create());
+    Configuration c = NutchConfiguration.create();
+    CrawlDb crawlDb = new CrawlDb(c);
     
     if (args.length < 2) {
-      System.err.println("Usage: <crawldb> <segment>");
+      System.err.println("Usage: <crawldb> <segment> [-noAdditions]");
+      System.err.println("\tcrawldb\tCrawlDb to update");
+      System.err.println("\tsegment\tsegment name to update from");
+      System.err.println("\t-noAdditions\tonly update already existing URLs, don't add any newly discovered URLs");
       return;
     }
     
+    boolean additionsAllowed = c.getBoolean(CRAWLDB_ADDITIONS_ALLOWED, true);
+    for(int i = 2 ; i < args.length; i++) {
+      if (args[i].equals("-noAdditions")) {
+        additionsAllowed = false;
+      }
+    }
+    
     crawlDb.update(new Path(args[0]), new Path(args[1]));
   }
 
Index: src/java/org/apache/nutch/crawl/CrawlDbReducer.java
===================================================================
--- src/java/org/apache/nutch/crawl/CrawlDbReducer.java (revisión: 492707)
+++ src/java/org/apache/nutch/crawl/CrawlDbReducer.java (copia de trabajo)
@@ -36,10 +36,12 @@
   private CrawlDatum result = new CrawlDatum();
   private ArrayList linked = new ArrayList();
   private ScoringFilters scfilters = null;
+  private boolean additionsAllowed;
 
   public void configure(JobConf job) {
     retryMax = job.getInt("db.fetch.retry.max", 3);
     scfilters = new ScoringFilters(job);
+    additionsAllowed = job.getBoolean(CrawlDb.CRAWLDB_ADDITIONS_ALLOWED, true);
   }
 
   public void close() {}
@@ -74,6 +76,9 @@
       }
     }
 
+    // if it doesn't already exist, skip it
+    if (old == null && !additionsAllowed) return;
+    
     // initialize with the latest version
     result.set(highest);
     if (old != null) {
Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Sami Siren-2
Nicolás Lichtmaier wrote:

> I've "backported" revision 450799 to the 0.8.x branch for supporting
> "-noAdditions". Perhaps you could consider committing it there... (I
> haven't tested it yet whough).

Can you please create a JIRA issue for this and attach the patch there.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Nicol�Lichtmaier

>> I've "backported" revision 450799 to the 0.8.x branch for supporting
>> "-noAdditions". Perhaps you could consider committing it there... (I
>> haven't tested it yet whough).
>>    
> Can you please create a JIRA issue for this and attach the patch there.
>  

Done. It's NUTCH-438 (https://issues.apache.org/jira/browse/NUTCH-438).

Reply | Threaded
Open this post in threaded view
|

Re: How to limit nutch to fetch, refetch and index just the injected URLs?

Rajasekar Karthik
In reply to this post by Nicol�Lichtmaier
I had this same problem - just used depth 1 to fetch & index the injected urls. As far as refetching goes - you might have use "updatedb" with "-noAdditions". Another solution (not that efficient, but works) - restart the crawl process with the same injected urls and discard the old dir/segment.

Nicolás Lichtmaier wrote
I'd like to limit nutch to fetch, refetch and index just the injected
URLs. Will setting db.max.outlinks.per.page to 0 enable me to do that?
If not... how could achive what I'm looking to?

Thanks!