What are the side effects of running crawl multiple times?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

What are the side effects of running crawl multiple times?

Paolo Castagna-2
Crawl.java:

line 84    if (fs.exists(dir)) {
line 85      throw new RuntimeException(dir + " already exists.");
line 86    }

What are the side effects of removing those lines from Crawl.java?

Reply | Threaded
Open this post in threaded view
|

Re: What are the side effects of running crawl multiple times?

Andrzej Białecki-2
Paolo Castagna wrote:
> Crawl.java:
>
> line 84    if (fs.exists(dir)) {
> line 85      throw new RuntimeException(dir + " already exists.");
> line 86    }
>
> What are the side effects of removing those lines from Crawl.java?
>
>

Hi,

These lines were put there to warn off users from running the crawl tool
several times on the same directory - this would cause incremental
updates to the crawldb, not only at the stage of generation (which is a
natural occurence anyway), but also on the stage of injection, i.e. the
tool would inject possibly the same urls many times to the same crawldb.

Again, it doesn't sound strange to do this, with a small exception ;) In
previous releases of Nutch, there was a bug in the injector which would
reset the status of urls in crawldb, if the same url was injected again.
This bug has been fixed now, so I think it's safe to remove these lines
from Crawl.java.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com