EC2 storage needs for 500M URL crawl?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

EC2 storage needs for 500M URL crawl?

Otis Gospodnetic-2
Hi,

Here's another Q about a wide, large-scale crawl resource requirements on EC2 -
primarily storage and bandwidth needs.
Please correct any mistakes you see.
I'll use 500M pages as the crawl target.
I'll assume 10 KB/page on average.

500M pages * 10 KB/page = 5000 GB, which is 5 TB

5 TB is the size of just the raw fetched pages.

Q:
- What about any overhead besides the obvious replication factor, such as sizes
of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and such?
- If parsed data is stored in addition to raw data, can we assume the parsed
content will be up to 50% of the raw fetched data?

Here are some calculations:

- 50 small EC2 instances at 0.085/hour give us 160 GB *  50 = 8 TB for $714/week
- 50 large EC2 instances at 0.34/hour give us 850 GB *  50 = 42 TB for
$2856/week
(we can lower the cost by using Spot instances, but I'm just trying to keep this
simple for now)

Sounds like either one needs more smaller instances (which should make fetching
faster) or one needs to use large instances to be able to store 500M pages + any
overhead.  I'm assuming 42 TB is enough for that.... is it?

Bandwidth is relatively cheap:
At $0.1 / GB for IN data, 5000 GB * $0.1 = $500

What mistakes did I make above?
Did I miss anything important?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply | Threaded
Open this post in threaded view
|

Re: EC2 storage needs for 500M URL crawl?

kkrugler

On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:

> Hi,
>
> Here's another Q about a wide, large-scale crawl resource  
> requirements on EC2 -
> primarily storage and bandwidth needs.
> Please correct any mistakes you see.
> I'll use 500M pages as the crawl target.
> I'll assume 10 KB/page on average.

It depends on what you're crawling, e.g. a number closer to 40KB/page  
is what we've seen for text/HTML + images.

> 500M pages * 10 KB/page = 5000 GB, which is 5 TB

For our 550M page crawl, we pulled 21TB.

> 5 TB is the size of just the raw fetched pages.
>
> Q:
> - What about any overhead besides the obvious replication factor,  
> such as sizes
> of linkdb and crawldb, any temporary data, any non-raw data in HDFS,  
> and such?
> - If parsed data is stored in addition to raw data, can we assume  
> the parsed
> content will be up to 50% of the raw fetched data?
>
> Here are some calculations:
>
> - 50 small EC2 instances at 0.085/hour give us 160 GB *  50 = 8 TB  
> for $714/week
> - 50 large EC2 instances at 0.34/hour give us 850 GB *  50 = 42 TB for
> $2856/week
> (we can lower the cost by using Spot instances, but I'm just trying  
> to keep this
> simple for now)
>
> Sounds like either one needs more smaller instances (which should  
> make fetching
> faster) or one needs to use large instances to be able to store 500M  
> pages + any
> overhead.

If you're planning on parsing the pages (sounds like it) then the  
m1.small instances are going to take a very long time - their disk I/O  
and CPU are pretty low-end.

> I'm assuming 42 TB is enough for that.... is it?
>
> Bandwidth is relatively cheap:
> At $0.1 / GB for IN data, 5000 GB * $0.1 = $500

As per above, for us it was closer to $2100.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply | Threaded
Open this post in threaded view
|

Re: EC2 storage needs for 500M URL crawl?

Paul Dhaliwal
Hello,

On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[hidden email]>wrote:

>
> On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
>
>  Hi,
>>
>> Here's another Q about a wide, large-scale crawl resource requirements on
>> EC2 -
>> primarily storage and bandwidth needs.
>> Please correct any mistakes you see.
>> I'll use 500M pages as the crawl target.
>> I'll assume 10 KB/page on average.
>>
>
> It depends on what you're crawling, e.g. a number closer to 40KB/page is
> what we've seen for text/HTML + images.


I would double/triple the page size estimate. Not sure what kind of pages
you are crawling, but there are some large pages out there.  Take a quick
look at amazon.com, tripadvisor.com

>
>
>  500M pages * 10 KB/page = 5000 GB, which is 5 TB
>>
>
> For our 550M page crawl, we pulled 21TB.
>
>
>  5 TB is the size of just the raw fetched pages.
>>
>> Q:
>> - What about any overhead besides the obvious replication factor, such as
>> sizes
>> of linkdb and crawldb, any temporary data, any non-raw data in HDFS, and
>> such?
>> - If parsed data is stored in addition to raw data, can we assume the
>> parsed
>> content will be up to 50% of the raw fetched data?
>>
>> Here are some calculations:
>>
>> - 50 small EC2 instances at 0.085/hour give us 160 GB *  50 = 8 TB for
>> $714/week
>> - 50 large EC2 instances at 0.34/hour give us 850 GB *  50 = 42 TB for
>> $2856/week
>> (we can lower the cost by using Spot instances, but I'm just trying to
>> keep this
>> simple for now)
>>
>
Money wise, it was cheaper for us to do vps.net. We also didn't use vps for
storage. We had a rack elsewhere which stores all of our data. We used the
cloud nature of vps.net just for crawling. It didn't make sense for us to
pay 4000 a month for something that literally costs $2000 one time cost and
few hundred a month in colo.

>
>> Sounds like either one needs more smaller instances (which should make
>> fetching
>> faster) or one needs to use large instances to be able to store 500M pages
>> + any
>> overhead.
>>
>
> If you're planning on parsing the pages (sounds like it) then the m1.small
> instances are going to take a very long time - their disk I/O and CPU are
> pretty low-end.
>
>
>  I'm assuming 42 TB is enough for that.... is it?
>>
>> Bandwidth is relatively cheap:
>> At $0.1 / GB for IN data, 5000 GB * $0.1 = $500
>>
>
> As per above, for us it was closer to $2100.
>
vps.net bandwidth is included, so we saved a bundle there.

Paul

>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: EC2 storage needs for 500M URL crawl?

kkrugler

On Mar 9, 2011, at 11:52am, Paul Dhaliwal wrote:

> Hello,
>
> On Wed, Mar 9, 2011 at 9:26 AM, Ken Krugler <[hidden email]
> > wrote:
>
> On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
>
> Hi,
>
> Here's another Q about a wide, large-scale crawl resource  
> requirements on EC2 -
> primarily storage and bandwidth needs.
> Please correct any mistakes you see.
> I'll use 500M pages as the crawl target.
> I'll assume 10 KB/page on average.
>
> It depends on what you're crawling, e.g. a number closer to 40KB/
> page is what we've seen for text/HTML + images.
>
> I would double/triple the page size estimate. Not sure what kind of  
> pages you are crawling, but there are some large pages out there.  
> Take a quick look at amazon.com, tripadvisor.com

There are large pages out there, but the _average_ size over 550M  
pages was about 40K.

[snip]

> I'm assuming 42 TB is enough for that.... is it?
>
> Bandwidth is relatively cheap:
> At $0.1 / GB for IN data, 5000 GB * $0.1 = $500
>
> As per above, for us it was closer to $2100.
> vps.net bandwidth is included, so we saved a bundle there.

Interesting, thanks for the ref.

So you immediately push data to your colo, which (I assume) also  
doesn't have much of a data-in cost?

When you get one VPS system (composed of say 12 nodes), how many  
virtual cores do you get? Is it also 12?

Thanks,

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply | Threaded
Open this post in threaded view
|

Re: EC2 storage needs for 500M URL crawl?

Otis Gospodnetic-2-2
In reply to this post by kkrugler
Hi,


----- Original Message ----

> From: Ken Krugler <[hidden email]>
> To: [hidden email]
> Sent: Wed, March 9, 2011 12:26:59 PM
> Subject: Re: EC2 storage needs for 500M URL crawl?
>
>
> On Mar 9, 2011, at 8:45am, Otis Gospodnetic wrote:
>
> > Hi,
> >
> > Here's another Q about a wide, large-scale crawl resource requirements  on
>EC2 -
> > primarily storage and bandwidth needs.
> > Please correct  any mistakes you see.
> > I'll use 500M pages as the crawl target.
> >  I'll assume 10 KB/page on average.
>
> It depends on what you're crawling,  e.g. a number closer to 40KB/page is what
>we've seen for text/HTML +  images.
>
> > 500M pages * 10 KB/page = 5000 GB, which is 5 TB
>
> For  our 550M page crawl, we pulled 21TB.

OK, so 40 KB/page.  How things have changed.... :)

> > 5 TB is the size of just the  raw fetched pages.
> >
> > Q:
> > - What about any overhead besides  the obvious replication factor, such as
>sizes
> > of linkdb and crawldb, any  temporary data, any non-raw data in HDFS, and
>such?
> > - If parsed data is  stored in addition to raw data, can we assume the
parsed

> > content will be  up to 50% of the raw fetched data?
> >
> > Here are some  calculations:
> >
> > - 50 small EC2 instances at 0.085/hour give us  160 GB *  50 = 8 TB for
>$714/week
> > - 50 large EC2 instances at  0.34/hour give us 850 GB *  50 = 42 TB for
> > $2856/week
> > (we  can lower the cost by using Spot instances, but I'm just trying to keep  
>this
> > simple for now)
> >
> > Sounds like either one needs more  smaller instances (which should make
>fetching
> > faster) or one needs to  use large instances to be able to store 500M pages +
>any
> >  overhead.
>
> If you're planning on parsing the pages (sounds like it) then  the m1.small
>instances are going to take a very long time - their disk I/O and  CPU are
>pretty low-end.

Yeah, I can imagine! :)
But if your 550M page crawl pulled 21 TB of *raw*(?) data, then I have a feeling
that even 40 large EC2 instances won't have enough storage, right?
Would you recommend 75 of them (63 TB) or 100 of them (84 TB)?

> > I'm assuming 42 TB is enough for that.... is  it?
> >
> > Bandwidth is relatively cheap:
> > At $0.1 / GB for IN  data, 5000 GB * $0.1 = $500
>
> As per above, for us it was closer to  $2100.

Thanks!

Otis
Reply | Threaded
Open this post in threaded view
|

Re: EC2 storage needs for 500M URL crawl?

kkrugler
Hi Otis,

[snip]

>> If you're planning on parsing the pages (sounds like it) then  the  
>> m1.small
>> instances are going to take a very long time - their disk I/O and  
>> CPU are
>> pretty low-end.
>
> Yeah, I can imagine! :)
> But if your 550M page crawl pulled 21 TB of *raw*(?) data, then I  
> have a feeling
> that even 40 large EC2 instances won't have enough storage, right?
> Would you recommend 75 of them (63 TB) or 100 of them (84 TB)?

We ran with 50, IIRC - but we used compressed output files, so there  
was enough space.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g