Difficult crawling

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Difficult crawling

germanbio
Hi people

I'm semi newbie to nutch and starting to use in a production site. My
problem comes when I define the URL's seed with almost 20 urls. Also I
need that the crawler explores all of them and discover new ones. All
of this is working OK, I mean the crawler obtains news URLs and
explore it. I defined that the depth is 20 and uses the topN with
1000. My problem is that every crawl process reach a time or a number
of URLs and never complete the seed list. Because of this, the actual
index has a lot of new discovered URLs but the core URL sites,
uncompleted. I repeat the crawl and of course takes the uncompleted
links from the previous level...But never could go top and complete my
seed list.

I need if someone could suggest me a way to make the type of crawl
that I need. Perhaps the best could be to complete the seed list
before to go further with discovered URLs, is there a way to do so?
Perhaps another approach could be to use two nutch´es, one for the
seed list, to complete this sites and another to go for external
URL's. After that merging segments and invert.

Someone could give me a tip about this.
Thanks a lot
Germán
Reply | Threaded
Open this post in threaded view
|

RE: Difficult crawling

Rob Hunter
Hi Germán,

I think you've already picked up on the best ways to do what you want to do.  I'd probably limit your first crawl to just your 20 sites using crawl-urlfilter.txt, and then (if you want external sites) move outside of that with a larger, unbound crawl.  The other thing you could do is increase your topN - in my experience, 1000 is pretty tiny.  A depth of 20 suggests to me that your goal is to move outside of your 20 initial sites; if you assume that you double your number of domain names at every level after one, by level 4 you have 8000 (20^3) top level domains - you've outstripped your topN by leaps and bounds before you're a quarter of the way through.

Hope this helps,
Rob

-----Original Message-----
From: Germán Biozzoli [mailto:[hidden email]]
Sent: Saturday, December 11, 2010 11:55 AM
To: [hidden email]
Subject: Difficult crawling

Hi people

I'm semi newbie to nutch and starting to use in a production site. My
problem comes when I define the URL's seed with almost 20 urls. Also I
need that the crawler explores all of them and discover new ones. All
of this is working OK, I mean the crawler obtains news URLs and
explore it. I defined that the depth is 20 and uses the topN with
1000. My problem is that every crawl process reach a time or a number
of URLs and never complete the seed list. Because of this, the actual
index has a lot of new discovered URLs but the core URL sites,
uncompleted. I repeat the crawl and of course takes the uncompleted
links from the previous level...But never could go top and complete my
seed list.

I need if someone could suggest me a way to make the type of crawl
that I need. Perhaps the best could be to complete the seed list
before to go further with discovered URLs, is there a way to do so?
Perhaps another approach could be to use two nutch´es, one for the
seed list, to complete this sites and another to go for external
URL's. After that merging segments and invert.

Someone could give me a tip about this.
Thanks a lot
Germán
Reply | Threaded
Open this post in threaded view
|

Re: Difficult crawling

Julien Nioche-4
In reply to this post by germanbio
Hi German,

The best way of doing this is be to track the depth of the URLs since the
injection and give a priority to the lowest depths during the generation
step. This is done by implementing a custom scoring filter implementing the
interface ScoringFilter<http://nutch.apache.org/apidocs-1.2/org/apache/nutch/scoring/ScoringFilter.html>.
Not trivial but that would definitely work.

HTH

Julien

--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 11 December 2010 19:55, Germán Biozzoli <[hidden email]> wrote:

> Hi people
>
> I'm semi newbie to nutch and starting to use in a production site. My
> problem comes when I define the URL's seed with almost 20 urls. Also I
> need that the crawler explores all of them and discover new ones. All
> of this is working OK, I mean the crawler obtains news URLs and
> explore it. I defined that the depth is 20 and uses the topN with
> 1000. My problem is that every crawl process reach a time or a number
> of URLs and never complete the seed list. Because of this, the actual
> index has a lot of new discovered URLs but the core URL sites,
> uncompleted. I repeat the crawl and of course takes the uncompleted
> links from the previous level...But never could go top and complete my
> seed list.
>
> I need if someone could suggest me a way to make the type of crawl
> that I need. Perhaps the best could be to complete the seed list
> before to go further with discovered URLs, is there a way to do so?
> Perhaps another approach could be to use two nutch´es, one for the
> seed list, to complete this sites and another to go for external
> URL's. After that merging segments and invert.
>
> Someone could give me a tip about this.
> Thanks a lot
> Germán
>