Generic Question about initial seed

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Generic Question about initial seed

BBrown-2
This is kind of a generic question. Are there any stats on how many pages
will get crawled based on some initial seed.  For example, if you seed the
list from dmoz, how many pages will get indexed?  Lets say there are 4
million, will 4 million only get indexed?

Or lets say I have 4000, will I get 30,000 crawled/indexed pages?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?

Reply | Threaded
Open this post in threaded view
|

Re: Generic Question about initial seed

BBrown-2
On Wed, 16 May 2007 16:42:05 -0400, bbrown wrote

> This is kind of a generic question. Are there any stats on how many
> pages will get crawled based on some initial seed.  For example, if
> you seed the list from dmoz, how many pages will get indexed?  Lets
> say there are 4 million, will 4 million only get indexed?
>
> Or lets say I have 4000, will I get 30,000 crawled/indexed pages?
>
> --
> Berlin Brown
> [berlin dot brown at gmail dot com]
> http://botspiritcompany.com/botlist/?

I am sorry, lets say I give an average depth of 3.  I am asking because I
have these article pages (blogs, news articles) about 8000 of them and I want
to have nutch crawl them on a regular basis but would like to have an idea of
how many pages will get created in the index.

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?

Reply | Threaded
Open this post in threaded view
|

Re: Generic Question about initial seed

Sean Dean-3
In reply to this post by BBrown-2
There is a command to show stats on your database of links. It will show what has been fetched (if any) and what is waiting to be.
 
Keep in mind though, during a fetch if the page cannot be retrieved then it will not be indexed so only use this number as a estimate for the final indexed amount.

 
The command is below, it can take minutes or even hours to complete depending on the size of your database.
 
"bin/nutch readdb [path to crawldb] -stats"
 
 
----- Original Message ----
From: bbrown <[hidden email]>
To: [hidden email]
Sent: Wednesday, May 16, 2007 4:42:05 PM
Subject: Generic Question about initial seed


This is kind of a generic question. Are there any stats on how many pages
will get crawled based on some initial seed.  For example, if you seed the
list from dmoz, how many pages will get indexed?  Lets say there are 4
million, will 4 million only get indexed?

Or lets say I have 4000, will I get 30,000 crawled/indexed pages?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?
Reply | Threaded
Open this post in threaded view
|

Re: Generic Question about initial seed

Dennis Kubes
In reply to this post by BBrown-2
In the beginning it is approximately 10 to 1.  So for every page I crawl
I will get 10 more pages to crawl that are not currently in the index.
As you move towards 50 million pages is becomes more like 6 to 1.  If
you seed the entire dmoz, your first crawl will be around 5.5 million
pages.  Your second crawl will be around 54 million pages.  And a depth
of 3 will give you over 300 million pages.  These are the numbers that
we are currently seeing.

Dennis Kubes

bbrown wrote:

> This is kind of a generic question. Are there any stats on how many pages
> will get crawled based on some initial seed.  For example, if you seed the
> list from dmoz, how many pages will get indexed?  Lets say there are 4
> million, will 4 million only get indexed?
>
> Or lets say I have 4000, will I get 30,000 crawled/indexed pages?
>
> --
> Berlin Brown
> [berlin dot brown at gmail dot com]
> http://botspiritcompany.com/botlist/?
>
Reply | Threaded
Open this post in threaded view
|

Re: Generic Question about initial seed

Andrzej Białecki-2
Dennis Kubes wrote:
> In the beginning it is approximately 10 to 1.  So for every page I crawl
> I will get 10 more pages to crawl that are not currently in the index.
> As you move towards 50 million pages is becomes more like 6 to 1.  If
> you seed the entire dmoz, your first crawl will be around 5.5 million
> pages.  Your second crawl will be around 54 million pages.  And a depth
> of 3 will give you over 300 million pages.  These are the numbers that
> we are currently seeing.

Be advised though that any crawl run that collects more than 1 mln pages
is bound to collect a LOT of utter junk and spam - unless you tightly
control the quality of URLs, using URLFilters, ScoringFilters and other
means.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com