Crawler Behavior (2 questions)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawler Behavior (2 questions)

Ian Reardon
I have been crawling rather large sites ( larger then 10k pages) with
the crawl command.   It seems like it crawls all the pages twice.  Is
that normal?  I thought it was just removing the segments but it looks
like it crawls all the pages, does some update to the DB and then
crawls them again.  If anyone could shed some light on this I would
appreciate it.

2nd Question.  Is there a way to limit a crawl to number of pages
rather then depth?  I would like to limit a crawl to say 100 pages,
1000 pages of whatever.  I could brute force it by writing a script to
look at the logs and then killing the crawler but I'd rather not go
that approach.

Thanks.

Ian
Reply | Threaded
Open this post in threaded view
|

Re: Crawler Behavior (2 questions)

Andy Liu-3
If you download the most recent version of Nutch from SVN, the newer
CrawlTool doesn't fetch pages twice.

As far as limiting the number of pages to crawl, you can use the -topN
flag when generating your segments.

Andy

On 5/26/05, Ian Reardon <[hidden email]> wrote:

> I have been crawling rather large sites ( larger then 10k pages) with
> the crawl command.   It seems like it crawls all the pages twice.  Is
> that normal?  I thought it was just removing the segments but it looks
> like it crawls all the pages, does some update to the DB and then
> crawls them again.  If anyone could shed some light on this I would
> appreciate it.
>
> 2nd Question.  Is there a way to limit a crawl to number of pages
> rather then depth?  I would like to limit a crawl to say 100 pages,
> 1000 pages of whatever.  I could brute force it by writing a script to
> look at the logs and then killing the crawler but I'd rather not go
> that approach.
>
> Thanks.
>
> Ian
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawler Behavior (2 questions)

Sundaramoorthy Kannan
Hi,
If I have to exclude some parts of a web page from getting indexed, how
can I do it? As I understand, DOMContentUtils class of HTML parser
plugin currently ignores only SCRIPT, STYLE and comment text. Can I
configure it to exclude some other tags too?

Thanks,
Kannan
On Thu, 2005-05-26 at 15:34 -0400, Andy Liu wrote:

> If you download the most recent version of Nutch from SVN, the newer
> CrawlTool doesn't fetch pages twice.
>
> As far as limiting the number of pages to crawl, you can use the -topN
> flag when generating your segments.
>
> Andy
>
> On 5/26/05, Ian Reardon <[hidden email]> wrote:
> > I have been crawling rather large sites ( larger then 10k pages) with
> > the crawl command.   It seems like it crawls all the pages twice.  Is
> > that normal?  I thought it was just removing the segments but it looks
> > like it crawls all the pages, does some update to the DB and then
> > crawls them again.  If anyone could shed some light on this I would
> > appreciate it.
> >
> > 2nd Question.  Is there a way to limit a crawl to number of pages
> > rather then depth?  I would like to limit a crawl to say 100 pages,
> > 1000 pages of whatever.  I could brute force it by writing a script to
> > look at the logs and then killing the crawler but I'd rather not go
> > that approach.
> >
> > Thanks.
> >
> > Ian
> >