Stats?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Stats?

Paul Stewart-5
Hi folks...



Is there a way to retrieve stats from Nutch - meaning how many webpages
are indexed, to be indexed etc??



When I was working with AspSeek and Mnogosearch in the past I could run
a command to see stats ....



Thanks again,



Paul






----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
Reply | Threaded
Open this post in threaded view
|

Re: Stats?

Susam Pal
Try this command:-

 bin/nutch readdb crawl/crawldb -stats

To get help, try:-

bin/nutch readdb

Regards,
Susam Pal

On Feb 1, 2008 8:21 AM, Paul Stewart <[hidden email]> wrote:

> Hi folks...
>
>
>
> Is there a way to retrieve stats from Nutch - meaning how many webpages
> are indexed, to be indexed etc??
>
>
>
> When I was working with AspSeek and Mnogosearch in the past I could run
> a command to see stats ....
>
>
>
> Thanks again,
>
>
>
> Paul
>
>
>
>
>
>
> ----------------------------------------------------------------------------
>
> "The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
Reply | Threaded
Open this post in threaded view
|

Limiting Crawl Time

Paul Stewart-5
Hi folks...

What is the best way to say limit crawling to perhaps 3-4 hours per day?
Is there a way to do this?

Right now, I have a crawl depth of 6 and maximum per site of 100.  I
thought this would limit things pretty low but during some test crawls,
my last crawl took 2.5 days to complete:

Statistics for CrawlDb: crawl/crawldb
TOTAL urls:     1566612
retry 0:        1549310
retry 1:        12814
retry 2:        1601
retry 3:        2887
min score:      0.0
avg score:      0.037
max score:      429.15
status 1 (db_unfetched):        1021400
status 2 (db_fetched):  446907
status 3 (db_gone):     74420
status 4 (db_redir_temp):       13861
status 5 (db_redir_perm):       10024
CrawlDb statistics: done


What I would like to do is crawl for 3-4 hours per day at most to
gradually fill the index.... thoughts?

Thanks very much,

Paul





----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
Reply | Threaded
Open this post in threaded view
|

Re: Limiting Crawl Time

Susam Pal
Did you try specifying a topN value? -depth 3 -topN 1000 should be
close to what you want.

On 2/6/08, Paul Stewart <[hidden email]> wrote:

> Hi folks...
>
> What is the best way to say limit crawling to perhaps 3-4 hours per day?
> Is there a way to do this?
>
> Right now, I have a crawl depth of 6 and maximum per site of 100.  I
> thought this would limit things pretty low but during some test crawls,
> my last crawl took 2.5 days to complete:
>
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     1566612
> retry 0:        1549310
> retry 1:        12814
> retry 2:        1601
> retry 3:        2887
> min score:      0.0
> avg score:      0.037
> max score:      429.15
> status 1 (db_unfetched):        1021400
> status 2 (db_fetched):  446907
> status 3 (db_gone):     74420
> status 4 (db_redir_temp):       13861
> status 5 (db_redir_perm):       10024
> CrawlDb statistics: done
>
>
> What I would like to do is crawl for 3-4 hours per day at most to
> gradually fill the index.... thoughts?
>
> Thanks very much,
>
> Paul
>
>
>
>
>
> ----------------------------------------------------------------------------
>
> "The information transmitted is intended only for the person or entity to
> which it is addressed and contains confidential and/or privileged material.
> If you received this in error, please contact the sender immediately and
> then destroy this transmission, including all attachments, without copying,
> distributing or disclosing same. Thank you."
>

--
Sent from Gmail for mobile | mobile.google.com
Reply | Threaded
Open this post in threaded view
|

RE: Limiting Crawl Time

Paul Stewart-5
Thanks - perhaps I misunderstand the depth and topN commands..

My understanding of the depth command is that Nutch will only go X deep
in the URL's to find websites - if I can that depth later does that mean
it will go deeper at a later point in time?  I thought it would continue
ignoring URL's at that depth once it was told a higher depth?  In other
words, if I run a crawl with a depth of 2 and then a week later run a
depth of 4, and then perhaps a couple of weeks later run a depth of 6
will that work?

Finally, the topN command - does that mean to only select the 1000
"best" URL's this *particular* crawl but in the *next* crawl pick
another 1000 to match?

I guess on both of these commands I was under the impression that large
chunks of websites would never get crawled no matter how many times I
went back to crawl it....?

Thanks very much for the clarification...

Paul


-----Original Message-----
From: Susam Pal [mailto:[hidden email]]
Sent: Tuesday, February 05, 2008 10:36 PM
To: [hidden email]
Subject: Re: Limiting Crawl Time

Did you try specifying a topN value? -depth 3 -topN 1000 should be
close to what you want.

On 2/6/08, Paul Stewart <[hidden email]> wrote:
> Hi folks...
>
> What is the best way to say limit crawling to perhaps 3-4 hours per
day?
> Is there a way to do this?
>
> Right now, I have a crawl depth of 6 and maximum per site of 100.  I
> thought this would limit things pretty low but during some test
crawls,

> my last crawl took 2.5 days to complete:
>
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:     1566612
> retry 0:        1549310
> retry 1:        12814
> retry 2:        1601
> retry 3:        2887
> min score:      0.0
> avg score:      0.037
> max score:      429.15
> status 1 (db_unfetched):        1021400
> status 2 (db_fetched):  446907
> status 3 (db_gone):     74420
> status 4 (db_redir_temp):       13861
> status 5 (db_redir_perm):       10024
> CrawlDb statistics: done
>
>
> What I would like to do is crawl for 3-4 hours per day at most to
> gradually fill the index.... thoughts?
>
> Thanks very much,
>
> Paul
>
>
>
>
>
>
------------------------------------------------------------------------
----
>
> "The information transmitted is intended only for the person or entity
to
> which it is addressed and contains confidential and/or privileged
material.
> If you received this in error, please contact the sender immediately
and
> then destroy this transmission, including all attachments, without
copying,
> distributing or disclosing same. Thank you."
>

--
Sent from Gmail for mobile | mobile.google.com




----------------------------------------------------------------------------

"The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
Reply | Threaded
Open this post in threaded view
|

Re: Limiting Crawl Time

Susam Pal
My replies inline.

On Feb 6, 2008 7:58 PM, Paul Stewart <[hidden email]> wrote:
> Thanks - perhaps I misunderstand the depth and topN commands..
>
> My understanding of the depth command is that Nutch will only go X deep
> in the URL's to find websites - if I can that depth later does that mean
> it will go deeper at a later point in time?

Depth here refers to how many times you are generating new URLs from
the crawldb and fetching them. At depth i, it generates some URLs from
the crawldb based on the URLs discovered in the pages fetched in the
previous fetch cycle (depth i - 1) and fetches these generated URLs.
Now these pages in turn would contain URLs to new pages. These URLs
will be fetched in depth i + 1.

> I thought it would continue
> ignoring URL's at that depth once it was told a higher depth? In other
> words, if I run a crawl with a depth of 2 and then a week later run a
> depth of 4, and then perhaps a couple of weeks later run a depth of 6
> will that work?

Yes, it should work.

>
> Finally, the topN command - does that mean to only select the 1000
> "best" URL's this *particular* crawl but in the *next* crawl pick
> another 1000 to match?

A topN value of 1000 would select the top 1000 URLs for this
particular crawl. For the next crawl, again top 1000 URLs would be
generated.

Regards,
Susam Pal

> -----Original Message-----
> From: Susam Pal [mailto:[hidden email]]
> Sent: Tuesday, February 05, 2008 10:36 PM
> To: [hidden email]
> Subject: Re: Limiting Crawl Time
>
> Did you try specifying a topN value? -depth 3 -topN 1000 should be
> close to what you want.
>
> On 2/6/08, Paul Stewart <[hidden email]> wrote:
> > Hi folks...
> >
> > What is the best way to say limit crawling to perhaps 3-4 hours per
> day?
> > Is there a way to do this?
> >
> > Right now, I have a crawl depth of 6 and maximum per site of 100.  I
> > thought this would limit things pretty low but during some test
> crawls,
> > my last crawl took 2.5 days to complete:
> >
> > Statistics for CrawlDb: crawl/crawldb
> > TOTAL urls:     1566612
> > retry 0:        1549310
> > retry 1:        12814
> > retry 2:        1601
> > retry 3:        2887
> > min score:      0.0
> > avg score:      0.037
> > max score:      429.15
> > status 1 (db_unfetched):        1021400
> > status 2 (db_fetched):  446907
> > status 3 (db_gone):     74420
> > status 4 (db_redir_temp):       13861
> > status 5 (db_redir_perm):       10024
> > CrawlDb statistics: done
> >
> >
> > What I would like to do is crawl for 3-4 hours per day at most to
> > gradually fill the index.... thoughts?
> >
> > Thanks very much,
> >
> > Paul
> >
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------
> ----
> >
> > "The information transmitted is intended only for the person or entity
> to
> > which it is addressed and contains confidential and/or privileged
> material.
> > If you received this in error, please contact the sender immediately
> and
> > then destroy this transmission, including all attachments, without
> copying,
> > distributing or disclosing same. Thank you."
> >
>
> --
> Sent from Gmail for mobile | mobile.google.com
>
>
>
>
> ----------------------------------------------------------------------------
>
> "The information transmitted is intended only for the person or entity to which it is addressed and contains confidential and/or privileged material. If you received this in error, please contact the sender immediately and then destroy this transmission, including all attachments, without copying, distributing or disclosing same. Thank you."
>