Pages per second on EC2?

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Pages per second on EC2?

Otis Gospodnetic-2
Hi,

I'm trying to do some basic calculations trying to figure out what, in terms of
time, resources, and cost, it would take to crawl 500M URLs.
The obvious environment for this is EC2, so I'm wondering what people are seeing
in terms of fetch rate there these days? 50 pages/second? 100? 200?


Here's a time calculation that assumes 100 pages/second per EC2 instance:

  100*60*60*12 = 4,320,000 URLs/day per EC2 instance

That 12 means 12 hours because last time I used Nutch I recall about  half of
the time being spent in updatedb, generate, and other  non-fetching steps.

If I have 20 servers fetching URLs, that's:

  100*60*60*12*20 = 86,400,000 URLs/day   -- this is starting to sound too good
to be true

Then to crawl 500M URLs:

  500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week

Suspiciously short, isn't it?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

Julien Nioche-4
Hi Otis,

I'm trying to do some basic calculations trying to figure out what, in terms
> of
> time, resources, and cost, it would take to crawl 500M URLs.
> The obvious environment for this is EC2, so I'm wondering what people are
> seeing
> in terms of fetch rate there these days? 50 pages/second? 100? 200?
>

Depends mostly on the distributions of URLS / host, whether you have a DNS
cache etc... Using large instances, you can start with a conservative
estimate at 125K URLs fetched per node and per hour


>
> Here's a time calculation that assumes 100 pages/second per EC2 instance:
>
>  100*60*60*12 = 4,320,000 URLs/day per EC2 instance
>
> That 12 means 12 hours because last time I used Nutch I recall about  half
> of
> the time being spent in updatedb, generate, and other  non-fetching steps.
>

The time spent in generate and update is proportional to the size of the
crawldb. Might take half the time at one point but will take more than that.
The best option would probably be to generate multiple segments in one go
(see options for the Generator), fetch all the segments one by one, then
merge them with the crawldb in a single call to update


>
> If I have 20 servers fetching URLs, that's:
>
>  100*60*60*12*20 = 86,400,000 URLs/day   -- this is starting to sound too
> good
> to be true
>
> Then to crawl 500M URLs:
>
>  500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week
>
> Suspiciously short, isn't it?
>

it also depends on the rate at which new URLs are discovered and hence on
your seedlist.

You will also inevitably hit slow servers which will have an impact the
fetchrate - although not as bad as before the introduction of the timeout on
fetching. The main point being that you will definitely get plenty of new
URLs to fetch but will need to pay attention to the *quality* of what is
fetched. Unless you are dealing with a limited number of target hosts, you
will inevitable get loads of porn if you crawl in the open and adult URLs
(mostly redirections to other porn sites) will quickly take over your
crawldb. As a result what your crawl will just be churning URLs generated
automatically from adult sites and despite the fact that your crawldb will
contain loads of URLs there will be very little useful ones.

Anyway, it's not just a matter of pages / seconds. Doing large, open crawls
brings up a lot of interesting challenges :-)

HTH

Julien


--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

kkrugler
In reply to this post by Otis Gospodnetic-2
Hi Otis,

> I'm trying to do some basic calculations trying to figure out what,  
> in terms of
> time, resources, and cost, it would take to crawl 500M URLs.

I can't directly comment on Nutch, but we recently did something  
similar to this (563M pages via EC2) using Bixo.

Since we're also using a sequence file to store the crawldb, the  
update time should be comparable, but we ran only one loop (since we  
started with a large set of known URLs).

Some parameters that obviously impact crawl performance:

* A default crawl delay of 15 seconds
* Batching (via keep-alive) 50 URLs per connection per IP address.
* 500 fetch threads/server (250 per each of two reducers per server)
* Crawling 1.7M domains
* Starting with about 1.2B known links
* Running in "efficient" mode - skip batches of URLs that can't be  
fetched due to politeness.
* Fetching text, HTML, and image files
* Cluster size of 50 slaves, using m1.large instances (with spot  
pricing)

The results were:

* CPU cost was only $250
* data-in was $2100 ($0.10/GB, and we fetched 21TB)
* Major performance issue was not enough domains with lots of URLs to  
fetch (power curve for URLs/domain)
* total cluster time of 22 hours, fetch time of about 12 hours

We didn't parse the fetched pages, which would have added some  
significant CPU cost.

HTH,

-- Ken

> The obvious environment for this is EC2, so I'm wondering what  
> people are seeing
> in terms of fetch rate there these days? 50 pages/second? 100? 200?
>
>
> Here's a time calculation that assumes 100 pages/second per EC2  
> instance:
>
>  100*60*60*12 = 4,320,000 URLs/day per EC2 instance
>
> That 12 means 12 hours because last time I used Nutch I recall  
> about  half of
> the time being spent in updatedb, generate, and other  non-fetching  
> steps.
>
> If I have 20 servers fetching URLs, that's:
>
>  100*60*60*12*20 = 86,400,000 URLs/day   -- this is starting to  
> sound too good
> to be true
>
> Then to crawl 500M URLs:
>
>  500000000/(100*60*60*12*20) = 5.78 days  -- that's less than 1 week
>
> Suspiciously short, isn't it?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

Otis Gospodnetic-2-2
Hi,


> Hi Otis,

Hi Ken :)

> > I'm trying to do some basic calculations trying to figure  out what, in terms
>of
> > time, resources, and cost, it would take to crawl  500M URLs.
>
> I can't directly comment on Nutch, but we recently did  something similar to
>this (563M pages via EC2) using Bixo.

Eh, I was going to go over to Bixo's list and ask if Bixo is suitable for such
wide crawls or whether it's meant more for vertical crawls.  For some reason in
my head I had it in the "for vertical crawl" bucket, but it seems I was wrong,
ha?

> Since we're  also using a sequence file to store the crawldb, the update time
>should be  comparable, but we ran only one loop (since we started with a large
>set of known  URLs).
>
> Some parameters that obviously impact crawl performance:
>
> *  A default crawl delay of 15 seconds

That's very polite.  Is 3 seconds delay acceptable?

> * Batching (via keep-alive) 50 URLs per  connection per IP address.

Does Nutch automatically do this?  I don't recall seeing this setting in Nutch,
but it's been a while...

> * 500 fetch threads/server (250 per each of two  reducers per server)
> * Crawling 1.7M domains

Is this because you restricted it to 1.7M domains, or is that how many distinct
domains were in your seed list, or is that how many domains you've discovered
while crawling?

> * Starting with about 1.2B  known links

Where did you get that many of them?
Also, if you start with 1.2B known links, how do you end up with just 563M pages
fetched?  Maybe out of 1.2B you simply got to only 563M before you stopped
crawling?

> * Running in "efficient" mode - skip batches of URLs that can't  be fetched due
>to politeness.

Doesn't Nutch (and Bixo) do this automatically?

> * Fetching text, HTML, and image files
> *  Cluster size of 50 slaves, using m1.large instances (with spot  pricing)

I've never used spot instances.  Isn't it the case that when you use spot
instances you can use them as long as the price you paid is adequate.  When the
price goes up due to demand, and you are using a spot instance, don't you get
kicked off (because you are not paying enough to meet the price any more)?  If
that's so, what happens with the cluster?  You keep to adding new spot instances
(at new/higher prices) to keep the cluster of more or less consistent size?

> The results were:
>
> * CPU cost was only $250
> * data-in  was $2100 ($0.10/GB, and we fetched 21TB)

That's fast!

> * Major performance issue was not  enough domains with lots of URLs to fetch
>(power curve for URLs/domain)

Why is this a problem?  Isn't this actually good?  Isn't it better to have 100
hosts/domains with 10 pages each than 10 hosts/domains with 100 each?  Wouldn't
fetching of the former complete faster?

> *  total cluster time of 22 hours, fetch time of about 12 hours

That's fast.  Where did the delta of 10h go?

> We didn't  parse the fetched pages, which would have added some significant CPU  
>cost.

Yeah.  Would you dare to guess how much that would add in terms of
time/servers/cost?

Many thanks!

Thanks,
Otis


> HTH,
>
> -- Ken
>
> > The obvious environment for this is  EC2, so I'm wondering what people are
>seeing
> > in terms of fetch rate  there these days? 50 pages/second? 100? 200?
> >
> >
> > Here's a  time calculation that assumes 100 pages/second per EC2 instance:
> >
> >  100*60*60*12 = 4,320,000 URLs/day per EC2 instance
> >
> > That 12 means 12 hours because last time I used Nutch I recall  about  half
>of
> > the time being spent in updatedb, generate, and  other  non-fetching steps.
> >
> > If I have 20 servers fetching  URLs, that's:
> >
> >  100*60*60*12*20 = 86,400,000 URLs/day    -- this is starting to sound too
>good
> > to be true
> >
> > Then  to crawl 500M URLs:
> >
> >  500000000/(100*60*60*12*20) = 5.78  days  -- that's less than 1 week
> >
> > Suspiciously short, isn't  it?
> >
> > Thanks,
> > Otis
> > ----
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
>
> --------------------------
> Ken  Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n  g
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

Otis Gospodnetic-2-2
In reply to this post by Julien Nioche-4
Hi


> I'm trying to do some basic calculations trying to figure out  what, in terms
> > of
> > time, resources, and cost, it would take to  crawl 500M URLs.
> > The obvious environment for this is EC2, so I'm  wondering what people are
> > seeing
> > in terms of fetch rate there  these days? 50 pages/second? 100? 200?
> >
>
> Depends mostly on the  distributions of URLS / host, whether you have a DNS
> cache etc... Using large  instances, you can start with a conservative

In my case the crawl would be wide, which is good for URL distribution, but bad
for DNS.
What's recommended for DNA caching?  I do see  
http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting  up a
local DNS server (e.g. bind) or something like pdnsdr or something  else?

> estimate at 125K URLs fetched  per node and per hour

125K URLs per node per hour.... so again assuming I'm fetching only 12h out of
24h and with 50 machines (to match Ken's example):

125000*12*50=75M / day

That means 525M in 7 days.  That is close to Ken's number, good. :)

> > Here's a time calculation that  assumes 100 pages/second per EC2 instance:
> >
> >  100*60*60*12 =  4,320,000 URLs/day per EC2 instance
> >
> > That 12 means 12 hours  because last time I used Nutch I recall about  half
> > of
> > the  time being spent in updatedb, generate, and other  non-fetching  steps.
> >
>
> The time spent in generate and update is proportional to  the size of the
> crawldb. Might take half the time at one point but will take  more than that.
> The best option would probably be to generate multiple  segments in one go
> (see options for the Generator), fetch all the segments  one by one, then
> merge them with the crawldb in a single call to  update

Right.
But with time (or, more precisely, as crawldb grows) this generation will start
taking more and more time, and there is no way around that, right?

How does Bixo deal with that?

> > If I have 20 servers fetching URLs,  that's:
> >
> >  100*60*60*12*20 = 86,400,000 URLs/day   --  this is starting to sound too
> > good
> > to be true
> >
> >  Then to crawl 500M URLs:
> >
> >  500000000/(100*60*60*12*20) =  5.78 days  -- that's less than 1 week
> >
> > Suspiciously short,  isn't it?
> >
>
> it also depends on the rate at which new URLs are  discovered and hence on
> your seedlist.

Yeah.  I want Ken's seed list! :)

> You will also inevitably hit  slow servers which will have an impact the
> fetchrate - although not as bad as  before the introduction of the timeout on
> fetching.

Right, I remember this problem.  So now one can specify how long each fetch
should last and fetching will stop when that time is reached?

How does one guess what that time limit to pick, especially since fetch runs can
vary in terms of how fast they are depending on what hosts are in it?

Wouldn't it be better to express this in requests/second instead of time, so
that you can say "when fetching goes below N requests per second and stays like
that for M minutes, abort fetch"?

What if you have a really fast fetch run going on, but the time is still reached
and fetch aborted?  What do you do?  Restart the fetch with the same list of
generated URLs as before?  Somehow restart with only unfetched URLs?  Generate a
whole new fetchlist (which ends up being slow)?

A ton of questions, I know. :(

> The main point being that  you will definitely get plenty of new
> URLs to fetch but will need to pay  attention to the *quality* of what is
> fetched. Unless you are dealing with a  limited number of target hosts, you
> will inevitable get loads of porn if you  crawl in the open and adult URLs
> (mostly redirections to other porn sites)  will quickly take over your
> crawldb. As a result what your crawl will just be  churning URLs generated
> automatically from adult sites and despite the fact  that your crawldb will
> contain loads of URLs there will be very little useful  ones.

One man's trash is another man's...
But this is very good to know, thanks!

> Anyway, it's not just a matter of pages / seconds. Doing large,  open crawls
> brings up a lot of interesting challenges  :-)

Yup.  Thanks Julien!

Otis
Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

kkrugler
In reply to this post by Otis Gospodnetic-2-2
Hi Otis,

>>> I'm trying to do some basic calculations trying to figure  out  
>>> what, in terms
>> of
>>> time, resources, and cost, it would take to crawl  500M URLs.
>>
>> I can't directly comment on Nutch, but we recently did  something  
>> similar to
>> this (563M pages via EC2) using Bixo.
>
> Eh, I was going to go over to Bixo's list and ask if Bixo is  
> suitable for such
> wide crawls or whether it's meant more for vertical crawls.  For  
> some reason in
> my head I had it in the "for vertical crawl" bucket, but it seems I  
> was wrong,
> ha?

Well, it's a toolkit - so hooking it up for a wide/big crawl means  
writing some code, but there's nothing in the Bixo architecture (after  
a few revs) that precludes using it in this manner.

>> Since we're  also using a sequence file to store the crawldb, the  
>> update time
>> should be  comparable, but we ran only one loop (since we started  
>> with a large
>> set of known  URLs).
>>
>> Some parameters that obviously impact crawl performance:
>>
>> *  A default crawl delay of 15 seconds
>
> That's very polite.  Is 3 seconds delay acceptable?

It depends on the site. For somebody big like (say) CNN, 3 seconds  
would probably be OK.

For smaller sights, 30 seconds is actually a better value.

We actually modified the fetch policy we're using so that for low-
traffic sites (based on Alexa/Quantcast data) we use a 60 second  
delay, down to about 5 seconds for top sites.

Even more important is to limit pages/day for smaller sites to <  
1000...unless you enjoy getting angry emails from irate webmasters :)

>> * Batching (via keep-alive) 50 URLs per  connection per IP address.
>
> Does Nutch automatically do this?  I don't recall seeing this  
> setting in Nutch,
> but it's been a while...

No, I don't believe Nutch has this implemented.

>
>> * 500 fetch threads/server (250 per each of two  reducers per server)
>> * Crawling 1.7M domains
>
> Is this because you restricted it to 1.7M domains, or is that how  
> many distinct
> domains were in your seed list, or is that how many domains you've  
> discovered
> while crawling?

We restricted it, based on top domains (where top == most US-based  
traffic).

>> * Starting with about 1.2B  known links
>
> Where did you get that many of them?

 From a previous crawl, of roughly the same size.

> Also, if you start with 1.2B known links, how do you end up with  
> just 563M pages
> fetched?  Maybe out of 1.2B you simply got to only 563M before you  
> stopped
> crawling?

Because we're running with Bixo's "efficient" mode (see below).

>> * Running in "efficient" mode - skip batches of URLs that can't  be  
>> fetched due
>> to politeness.
>
> Doesn't Nutch (and Bixo) do this automatically?

Nutch will block and not fetch a URL until sufficient time has passed.

Bixo can do the same thing, but when you run a crawl like this, you  
often wind up blocked on a few slow sites.

>> * Fetching text, HTML, and image files
>> *  Cluster size of 50 slaves, using m1.large instances (with spot  
>> pricing)
>
> I've never used spot instances.  Isn't it the case that when you use  
> spot
> instances you can use them as long as the price you paid is  
> adequate.  When the
> price goes up due to demand, and you are using a spot instance,  
> don't you get
> kicked off (because you are not paying enough to meet the price any  
> more)?  If
> that's so, what happens with the cluster?  You keep to adding new  
> spot instances
> (at new/higher prices) to keep the cluster of more or less  
> consistent size?

We run the master without spot pricing, and use a very high max bid to  
help ensure we rarely (if ever) lose servers.

>> The results were:
>>
>> * CPU cost was only $250
>> * data-in  was $2100 ($0.10/GB, and we fetched 21TB)
>
> That's fast!
>
>> * Major performance issue was not  enough domains with lots of URLs  
>> to fetch
>> (power curve for URLs/domain)
>
> Why is this a problem?  Isn't this actually good?  Isn't it better  
> to have 100
> hosts/domains with 10 pages each than 10 hosts/domains with 100  
> each?  Wouldn't
> fetching of the former complete faster?

The problem is that there are too many domains with only a handful of  
pages.

So very quickly, the set of available domains to fetch from is reduced  
down to a fraction of that initial 1.7M, and then politeness starts  
causing either (a) very inefficient utilization of resources, as most  
threads are spinning w/o doing any work, or (b) you start skipping  
lots of URLs for domains that aren't ready yet (not enough time has  
elapsed since the prior batch of URLs were fetched).

>> *  total cluster time of 22 hours, fetch time of about 12 hours
>
> That's fast.  Where did the delta of 10h go?

Jobs to extract links from the crawldb, partition by IP address, fetch  
robots.txt, etc.

>> We didn't  parse the fetched pages, which would have added some  
>> significant CPU
>> cost.
>
> Yeah.  Would you dare to guess how much that would add in terms of
> time/servers/cost?

I've got some data, but I'd need to dig it up after we finish a  
deliverable that's due by 5pm :(

-- Ken

>>> The obvious environment for this is  EC2, so I'm wondering what  
>>> people are
>> seeing
>>> in terms of fetch rate  there these days? 50 pages/second? 100? 200?
>>>
>>>
>>> Here's a  time calculation that assumes 100 pages/second per EC2  
>>> instance:
>>>
>>> 100*60*60*12 = 4,320,000 URLs/day per EC2 instance
>>>
>>> That 12 means 12 hours because last time I used Nutch I recall  
>>> about  half
>> of
>>> the time being spent in updatedb, generate, and  other  non-
>>> fetching steps.
>>>
>>> If I have 20 servers fetching  URLs, that's:
>>>
>>> 100*60*60*12*20 = 86,400,000 URLs/day    -- this is starting to  
>>> sound too
>> good
>>> to be true
>>>
>>> Then  to crawl 500M URLs:
>>>
>>> 500000000/(100*60*60*12*20) = 5.78  days  -- that's less than 1 week
>>>
>>> Suspiciously short, isn't  it?
>>>
>>> Thanks,
>>> Otis
>>> ----
>>> Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
>>> Lucene ecosystem search :: http://search-lucene.com/
>>>
>>
>> --------------------------
>> Ken  Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> e l a s t i c   w e b   m i n i n  g
>>
>>
>>
>>
>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

kkrugler
In reply to this post by Otis Gospodnetic-2-2
Hi Otis,

More input, though mostly from recent experience w/Bixo...

>> I'm trying to do some basic calculations trying to figure out  
>> what, in terms
>>> of
>>> time, resources, and cost, it would take to  crawl 500M URLs.
>>> The obvious environment for this is EC2, so I'm  wondering what  
>>> people are
>>> seeing
>>> in terms of fetch rate there  these days? 50 pages/second? 100? 200?
>>>
>>
>> Depends mostly on the  distributions of URLS / host, whether you  
>> have a DNS
>> cache etc... Using large  instances, you can start with a  
>> conservative
>
> In my case the crawl would be wide, which is good for URL  
> distribution, but bad
> for DNS.
> What's recommended for DNA caching?  I do see
> http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean  
> setting  up a
> local DNS server (e.g. bind) or something like pdnsdr or something  
> else?

We fire up nscd on every server in the cluster - check out the Bixo  
remote-init.sh script.

And we tweak the config, so that negative lookups (for example) have a  
longer TTL than by default.

>> estimate at 125K URLs fetched  per node and per hour
>
> 125K URLs per node per hour.... so again assuming I'm fetching only  
> 12h out of
> 24h and with 50 machines (to match Ken's example):
>
> 125000*12*50=75M / day
>
> That means 525M in 7 days.  That is close to Ken's number, good. :)
>
>>> Here's a time calculation that  assumes 100 pages/second per EC2  
>>> instance:
>>>
>>> 100*60*60*12 =  4,320,000 URLs/day per EC2 instance
>>>
>>> That 12 means 12 hours  because last time I used Nutch I recall  
>>> about  half
>>> of
>>> the  time being spent in updatedb, generate, and other  non-
>>> fetching  steps.
>>>
>>
>> The time spent in generate and update is proportional to  the size  
>> of the
>> crawldb. Might take half the time at one point but will take  more  
>> than that.
>> The best option would probably be to generate multiple  segments in  
>> one go
>> (see options for the Generator), fetch all the segments  one by  
>> one, then
>> merge them with the crawldb in a single call to  update
>
> Right.
> But with time (or, more precisely, as crawldb grows) this generation  
> will start
> taking more and more time, and there is no way around that, right?

Correct.

> How does Bixo deal with that?

We don't. Or rather, we live with the pain of having to update the  
crawldb, via building a new version from old + fetch loop results.

One solution is to use something like HBase, which we could easily do,  
since there's a Cascading Tap for it.

We could partition the DB by pending/processed, which would reduce  
time for later fetch phases. Early on, though, most entries are "new"  
thus this doesn't save much.

[snip]

>> You will also inevitably hit  slow servers which will have an  
>> impact the
>> fetchrate - although not as bad as  before the introduction of the  
>> timeout on
>> fetching.
>
> Right, I remember this problem.  So now one can specify how long  
> each fetch
> should last and fetching will stop when that time is reached?

You can in Bixo, don't know about the current version of Nutch.

> How does one guess what that time limit to pick, especially since  
> fetch runs can
> vary in terms of how fast they are depending on what hosts are in it?
>
> Wouldn't it be better to express this in requests/second instead of  
> time, so
> that you can say "when fetching goes below N requests per second and  
> stays like
> that for M minutes, abort fetch"?

We implemented something like that in Nutch back in 2006, but at the  
time the Nutch fetching architecture was such that this felt very flaky.

In Bixo we have support for aborting requests if the response rate is  
less than some specified limit, which seems to work well to avoid  
problems with slow sites.

> What if you have a really fast fetch run going on, but the time is  
> still reached
> and fetch aborted?  What do you do?  Restart the fetch with the same  
> list of
> generated URLs as before?  Somehow restart with only unfetched  
> URLs?  Generate a
> whole new fetchlist (which ends up being slow)?

In Bixo, at least, what happens is that fetched URLs are then  
processed, and unfetched URLs have their state unchanged - so they'll  
get fetched in the next loop. I assume something similar is possible  
in Nutch.

[snip]

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply | Threaded
Open this post in threaded view
|

Re: Pages per second on EC2?

Julien Nioche-4
In reply to this post by Otis Gospodnetic-2-2
Hi Otis

In my case the crawl would be wide, which is good for URL distribution, but
> bad
> for DNS.
> What's recommended for DNA caching?  I do see
> http://wiki.apache.org/nutch/OptimizingCrawls -- does that mean setting
>  up a
> local DNS server (e.g. bind) or something like pdnsdr or something  else?
>

I used bind for local DNS caching when running a 400-node cluster on EC2 for
Similarpages, am sure there are other tools which work just as well

[...]


> > The time spent in generate and update is proportional to  the size of the
> > crawldb. Might take half the time at one point but will take  more than
> that.
> > The best option would probably be to generate multiple  segments in one
> go
> > (see options for the Generator), fetch all the segments  one by one, then
> > merge them with the crawldb in a single call to  update
>
> Right.
> But with time (or, more precisely, as crawldb grows) this generation will
> start
> taking more and more time, and there is no way around that, right?
>

nope. Nutch 2.0 will be faster for the updates compared to 1.x but the
generation will still be proportional to the size of the crawldb


>
> > You will also inevitably hit  slow servers which will have an impact the
> > fetchrate - although not as bad as  before the introduction of the
> timeout on
> > fetching.
>
> Right, I remember this problem.  So now one can specify how long each fetch
> should last and fetching will stop when that time is reached?
>

exactly - you give it say 60 mins and it will stop fetching after that


>
> How does one guess what that time limit to pick, especially since fetch
> runs can
> vary in terms of how fast they are depending on what hosts are in it?
>

empirically :-) take a largish value, observe the fetch and at which point
it is starting to slow down and reduce accordingly
Sounds a bit like a recipe, doesn't it?


>
> Wouldn't it be better to express this in requests/second instead of time,
> so
> that you can say "when fetching goes below N requests per second and stays
> like
> that for M minutes, abort fetch"?
>

this would be a nice feature indeed. The timeout is an efficient but
somewhat crude mechanism, but it proved useful though as fetches could hang
on a single host for a looooooooooooooong time which on a large cluster
means big money


>
> What if you have a really fast fetch run going on, but the time is still
> reached
> and fetch aborted?  What do you do?  Restart the fetch with the same list
> of
> generated URLs as before?  Somehow restart with only unfetched URLs?
>  Generate a
> whole new fetchlist (which ends up being slow)?
>

you won't need to restart the fetch with the same list. The unfetched one
should end up in the next round of generation


> As a result what your crawl will just be  churning URLs generated
> > automatically from adult sites and despite the fact  that your crawldb
> will
> > contain loads of URLs there will be very little useful  ones.
>
> One man's trash is another man's...
>

even if you adult sites is what you really want to crawl for there is still
a need for filtering / normalisation strategies.


> > Anyway, it's not just a matter of pages / seconds. Doing large,  open
> crawls
> > brings up a lot of interesting challenges  :-)
>
> Yup.  Thanks Julien!
>
>
You are welcome.

Julien



--
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Reply | Threaded
Open this post in threaded view
|

mergesegs on HDFS fails

Patricio Galeas-5
In reply to this post by Julien Nioche-4
Hi,
Few months ago I started a crawl in a single machine (one process).
Now I'm trying to continue this crawl with Hadoop file system on the same
machine using the tutorial "How to Setup Nutch (V1.1) and Hadoop".
When I run a crawl (TopN=25000, depth=7) with the new configuration, mergesegs
fails.

The failed job details shows:
-----------------
java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

------------------

Any idea?

Regards
Patricio