Absolute depth for recrawling

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Absolute depth for recrawling

Alexandre
Hey all!

Since a few days we are currently playing a bit arround with Nutch. Today we have encountered the following issue.

Our very simple test "URL structure" looks like this:
index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  ->  1.1.1.1.1.html

We start a crawl on the index.html (index.html is the only page in the seed list) with a depth of 3. In this case the first three pages (index.html, 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
Now we start a second crawl (recrawl) with the same depth and crawl db and in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html) are crawled. Nutch seems to take the indexed pages from the first crawl (like 1.1.1.html) also as a starting point for crawling.

In our case we'd like to force Nutch to always just crawl stuff within a depth of 3 from the real seed page, which is index.html in this case. Is there any possible way to do this?

We have already tried to use the '-noAdditions' option to 'updatedb' like mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but this results in the fact that only the first URL (index.html) is crawled.
In addition we are afraid that new URLs (for example if we add now 1.2.html as a link to the index.html) are also not crawled.

Thanks a lot in advance!

Reply | Threaded
Open this post in threaded view
|

Re: Absolute depth for recrawling

Julien Nioche-4
Salut Alexandre,

The use of the term 'depth' the crawl tool is very misleading. What it
means is # rounds of generate/fetch/parse/update and has nothing to do with
the actual logical depth from a start seed.

You can limit the depth of a crawl using the patch from
https://issues.apache.org/jira/browse/NUTCH-1331.

BTW I'd use the new script in the SVN trunk instead of the all in all crawl
command as it gives more control and a better understanding of what happens

HTH

Julien

On 17 September 2012 15:06, Alexandre <[hidden email]> wrote:

> Hey all!
>
> Since a few days we are currently playing a bit arround with Nutch. Today
> we
> have encountered the following issue.
>
> Our very simple test "URL structure" looks like this:
> index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  ->
> 1.1.1.1.1.html
>
> We start a crawl on the index.html (index.html is the only page in the seed
> list) with a depth of 3. In this case the first three pages (index.html,
> 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
> Now we start a second crawl (recrawl) with the same depth and crawl db and
> in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html)
> are crawled. Nutch seems to take the indexed pages from the first crawl
> (like 1.1.1.html) also as a starting point for crawling.
>
> In our case we'd like to force Nutch to always just crawl stuff within a
> depth of 3 from the real seed page, which is index.html in this case. Is
> there any possible way to do this?
>
> We have already tried to use the '-noAdditions' option to 'updatedb' like
> mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but
> this results in the fact that only the first URL (index.html) is crawled.
> In addition we are afraid that new URLs (for example if we add now 1.2.html
> as a link to the index.html) are also not crawled.
>
> Thanks a lot in advance!
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Absolute depth for recrawling

Alexandre
Salut Julien,

Thanks for your reply.
This Plugin: https://issues.apache.org/jira/browse/NUTCH-1331 is exactly what I needed.

I've tested it and it's working very well.

But i still have some issue or misunderstanding with the generated segments and recrawling.
I will create a new subject for that.



Alex.
Reply | Threaded
Open this post in threaded view
|

Re: Absolute depth for recrawling

Julien Nioche-4
Ask on the user list before opening a new JIRA, it is not necessarily a bug
Thanks!

On 19 September 2012 11:55, Alexandre <[hidden email]> wrote:

> Salut Julien,
>
> Thanks for your reply.
> This Plugin: https://issues.apache.org/jira/browse/NUTCH-1331 is exactly
> what I needed.
>
> I've tested it and it's working very well.
>
> But i still have some issue or misunderstanding with the generated segments
> and recrawling.
> I will create a new subject for that.
>
>
>
> Alex.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320p4008860.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Reply | Threaded
Open this post in threaded view
|

Re: Absolute depth for recrawling

Alexandre
Sorry I meant a new subject in this forum and not a Jira ticket.
See:
http://lucene.472066.n3.nabble.com/Recrawling-and-segment-cleanup-td4008865.html