Recrawling

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Recrawling

Andrei Hajdukewycz
Hi,
I've crawled a site of roughly 30,000-40,000 pages using the
bin/nutch crawl command, which went quite smoothly. Now,
however, I'm trying to recrawl it using the script at
http://wiki.apache.org/nutch/IntranetRecrawl?action=show .

However, when I run the recrawl, generally I end up fetching
80-100k pages instead of 30-40k, with many pages fetched more
than once.

I assume this is due to the number of generate+fetch cycles I'm
running, which  is 5. I'm looking for advice on settings to optimize
this so I end up with less multiple fetching but still proper
coverage over the site.

"depth" as per the script is set to 5, topN unspecified, 31
days added to force refetch of everything.

My relevant settings nutch-site.xml are as follows:
db.ignore.internal.links = false,
db.ignore.external.links = true,
fetcher.server.delay = 1.0,
fetcher.threads.fetch = 3,
fetcher.threads.per.host = 3,
db.default.fetch.interval = 1

Any help would be most appreciated!
Andrei
Reply | Threaded
Open this post in threaded view
|

Re: Recrawling

Raghavendra Prabhu
I am not sure. But i think this wud be the reason

When u crawl the site the first time with a specified depth, the other urls
are detected

But the second time u crawl, the urls are already there and the depth is
relative to these new urls

In both the cases the depth is the same, but since depth is with proportion
to how deep it goes for each url, in second case it will be more as many
urls are already there.


Regards,
Prabhu



On 9/5/06, Andrei Hajdukewycz <[hidden email]> wrote:

>
> Hi,
> I've crawled a site of roughly 30,000-40,000 pages using the
> bin/nutch crawl command, which went quite smoothly. Now,
> however, I'm trying to recrawl it using the script at
> http://wiki.apache.org/nutch/IntranetRecrawl?action=show .
>
> However, when I run the recrawl, generally I end up fetching
> 80-100k pages instead of 30-40k, with many pages fetched more
> than once.
>
> I assume this is due to the number of generate+fetch cycles I'm
> running, which  is 5. I'm looking for advice on settings to optimize
> this so I end up with less multiple fetching but still proper
> coverage over the site.
>
> "depth" as per the script is set to 5, topN unspecified, 31
> days added to force refetch of everything.
>
> My relevant settings nutch-site.xml are as follows:
> db.ignore.internal.links = false,
> db.ignore.external.links = true,
> fetcher.server.delay = 1.0,
> fetcher.threads.fetch = 3,
> fetcher.threads.per.host = 3,
> db.default.fetch.interval = 1
>
> Any help would be most appreciated!
> Andrei
>
Reply | Threaded
Open this post in threaded view
|

Re: Recrawling

Andrei Hajdukewycz
In reply to this post by Andrei Hajdukewycz
Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that many additional pages.

This seems like a pretty severe problem, honestly, obviously there's a lot of duplicated data in the segments.

--- [hidden email] wrote:

From: "Andrei Hajdukewycz" <[hidden email]>
To: <[hidden email]>
Subject: Recrawling
Date: Mon, 4 Sep 2006 13:42:42 -0700

Hi,
I've crawled a site of roughly 30,000-40,000 pages using the
bin/nutch crawl command, which went quite smoothly. Now,
however, I'm trying to recrawl it using the script at
http://wiki.apache.org/nutch/IntranetRecrawl?action=show .

However, when I run the recrawl, generally I end up fetching
80-100k pages instead of 30-40k, with many pages fetched more
than once.

I assume this is due to the number of generate+fetch cycles I'm
running, which  is 5. I'm looking for advice on settings to optimize
this so I end up with less multiple fetching but still proper
coverage over the site.

"depth" as per the script is set to 5, topN unspecified, 31
days added to force refetch of everything.

My relevant settings nutch-site.xml are as follows:
db.ignore.internal.links = false,
db.ignore.external.links = true,
fetcher.server.delay = 1.0,
fetcher.threads.fetch = 3,
fetcher.threads.per.host = 3,
db.default.fetch.interval = 1

Any help would be most appreciated!
Andrei

Reply | Threaded
Open this post in threaded view
|

Re: Recrawling

Tomi N/A
On 9/6/06, Andrei Hajdukewycz <[hidden email]> wrote:
> Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that many additional pages.
>
> This seems like a pretty severe problem, honestly, obviously there's a lot of duplicated data in the segments.

I have the same problem: my index grew from 1.5GB after the original
crawl to over 5GB(!) after the recrawl...from the looks of it, I might
as well crawl anew every time. :\

t.n.a.
Reply | Threaded
Open this post in threaded view
|

Re: Recrawling

ytthet
Folks,

Have you found any solutions?

I am facing the same issue.

Thanks,

YT. Thet
Tomi N/A wrote
On 9/6/06, Andrei Hajdukewycz <ahajdukewycz@mozilla.com> wrote:
> Another problem I've noticed is that it seems the db grows *rapidly* with each successive recrawl. Mine started at 379MB, and it seems to increase by roughly 350MB every time I run a recrawl, despite there not being anywhere near that many additional pages.
>
> This seems like a pretty severe problem, honestly, obviously there's a lot of duplicated data in the segments.

I have the same problem: my index grew from 1.5GB after the original
crawl to over 5GB(!) after the recrawl...from the looks of it, I might
as well crawl anew every time. :\

t.n.a.