hadoop on single machine

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

hadoop on single machine

Tomislav Poljak
Would it be recommended to use hadoop for crawling (100 sites with 1000
pages each) on a single machine? What would be the benefit?
Something like described on:
http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
machine.


Or is the simple crawl/recrawl (without hadoop, like described in nutch
tutorial on wiki:  
http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
way to go?

Thanks,
       Tomislav

Reply | Threaded
Open this post in threaded view
|

Re: hadoop on single machine

Renaud Richardet
hi Tomislav,

The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
with a single machine should definitively be OK. You might want to add
more machines if many many people are searching your index.

BTW, Nutch is "always" using Hadoop. When testing locally or when using
only one machine, Hadoop just uses the local file system. So even the
NutchTutorial uses Hadoop.

HTH,
Renaud

> Would it be recommended to use hadoop for crawling (100 sites with 1000
> pages each) on a single machine? What would be the benefit?
> Something like described on:
> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> machine.
>
>
> Or is the simple crawl/recrawl (without hadoop, like described in nutch
> tutorial on wiki:  
> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> way to go?
>
> Thanks,
>        Tomislav
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: hadoop on single machine

Tomislav Poljak
Hi Renaud,
thank you for your reply. This is valuable information, but can you
elaborate a little bit more, like:

you say: Nutch is "always" using Hadoop.

I assume it does not uses Hadoop Distributed File System (HDFS) when
running on a single machine by default?

hadoop homepage says:  Hadoop implements MapReduce, using the HDFS

If there is no distributing file sistem over computer nodes (single
machine configuration) what does Hadoop do?

When running crawl/recrawl cycle-> generate/fetch/update
what processes is Hadoop running? How can I monitor them to see what is
going on (like how many urls are fetched and how many are still
unfetched from fetchlist)? Is there a GUI for this?

you say: Fetching 100 sites of 1000 nodes with a single machine should
definitively be OK

What about recrawl on a regular basis (once a day or even more often)?

Sorry if this are basic questions but I am trying to learn about nutch
and hadoop.

Thanks,
     Tomislav

 


On Thu, 2007-08-30 at 18:06 -0400, [hidden email] wrote:

> hi Tomislav,
>
> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
> with a single machine should definitively be OK. You might want to add
> more machines if many many people are searching your index.
>
> BTW, Nutch is "always" using Hadoop. When testing locally or when using
> only one machine, Hadoop just uses the local file system. So even the
> NutchTutorial uses Hadoop.
>
> HTH,
> Renaud
>
> > Would it be recommended to use hadoop for crawling (100 sites with 1000
> > pages each) on a single machine? What would be the benefit?
> > Something like described on:
> > http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> > machine.
> >
> >
> > Or is the simple crawl/recrawl (without hadoop, like described in nutch
> > tutorial on wiki:  
> > http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> > way to go?
> >
> > Thanks,
> >        Tomislav
> >
> >
> >  
>

Reply | Threaded
Open this post in threaded view
|

Re: hadoop on single machine

Renaud Richardet
hi Tomislav,

> Hi Renaud,
> thank you for your reply. This is valuable information, but can you
> elaborate a little bit more, like:
>
> you say: Nutch is "always" using Hadoop.
>
> I assume it does not uses Hadoop Distributed File System (HDFS) when
> running on a single machine by default?
>
> hadoop homepage says:  Hadoop implements MapReduce, using the HDFS
>
> If there is no distributing file sistem over computer nodes (single
> machine configuration) what does Hadoop do?
>  
Well, you're not using the full potential of Hadoop's HDFS when using
Nutch on a single machine (still, Hadoop is handling the map-reduce
logic, the configuration objects, etc). It's like using a chainsaw to
cut a toothpick ;-) Nevertheless, Nutch is a very good choice for
single-machine deployments: high-performance, reliable and easy to
customize.
> When running crawl/recrawl cycle-> generate/fetch/update
> what processes is Hadoop running?
Have a look at the class Crawl.java
> How can I monitor them to see what is
> going on (like how many urls are fetched and how many are still
> unfetched from fetchlist)? Is there a GUI for this?
>  
No GUI, but the command line tools can give you informations (e.g.
readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
> you say: Fetching 100 sites of 1000 nodes with a single machine should
> definitively be OK
>
> What about recrawl on a regular basis (once a day or even more often)?
>  
It depends on your configuration and connection, but you can expect to
fetch 10-30 pages / second. So for 100K pages, it will take < 3h
Re disk space, with an estimate of 10k/page for the index, it will take
you ~1GB disk space
See more on http://wiki.apache.org/nutch/HardwareRequirements

HTH,
Renaud

> Sorry if this are basic questions but I am trying to learn about nutch
> and hadoop.
>
> Thanks,
>      Tomislav
>
>  
>
>
> On Thu, 2007-08-30 at 18:06 -0400, [hidden email] wrote:
>  
>> hi Tomislav,
>>
>> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
>> with a single machine should definitively be OK. You might want to add
>> more machines if many many people are searching your index.
>>
>> BTW, Nutch is "always" using Hadoop. When testing locally or when using
>> only one machine, Hadoop just uses the local file system. So even the
>> NutchTutorial uses Hadoop.
>>
>> HTH,
>> Renaud
>>
>>    
>>> Would it be recommended to use hadoop for crawling (100 sites with 1000
>>> pages each) on a single machine? What would be the benefit?
>>> Something like described on:
>>> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
>>> machine.
>>>
>>>
>>> Or is the simple crawl/recrawl (without hadoop, like described in nutch
>>> tutorial on wiki:  
>>> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
>>> way to go?
>>>
>>> Thanks,
>>>        Tomislav
>>>
>>>
>>>  
>>>      
>
>
>  

Reply | Threaded
Open this post in threaded view
|

Re: hadoop on single machine

Tomislav Poljak
Hi Renaud,
what would be a recommended hardware specification for a machine running
searcher web application with 15K users per day using this index (100K
pages)? What is a good practice for
getting index from crawl machine to search machine (if using separate
machines for crawl and search)?

Thanks,
     Tomislav

2007/9/1, [hidden email] <[hidden email]>:

>
> hi Tomislav,
> > Hi Renaud,
> > thank you for your reply. This is valuable information, but can you
> > elaborate a little bit more, like:
> >
> > you say: Nutch is "always" using Hadoop.
> >
> > I assume it does not uses Hadoop Distributed File System (HDFS) when
> > running on a single machine by default?
> >
> > hadoop homepage says:  Hadoop implements MapReduce, using the HDFS
> >
> > If there is no distributing file sistem over computer nodes (single
> > machine configuration) what does Hadoop do?
> >
> Well, you're not using the full potential of Hadoop's HDFS when using
> Nutch on a single machine (still, Hadoop is handling the map-reduce
> logic, the configuration objects, etc). It's like using a chainsaw to
> cut a toothpick ;-) Nevertheless, Nutch is a very good choice for
> single-machine deployments: high-performance, reliable and easy to
> customize.
> > When running crawl/recrawl cycle-> generate/fetch/update
> > what processes is Hadoop running?
> Have a look at the class Crawl.java
> > How can I monitor them to see what is
> > going on (like how many urls are fetched and how many are still
> > unfetched from fetchlist)? Is there a GUI for this?
> >
> No GUI, but the command line tools can give you informations (e.g.
> readdb http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch%20readdb)
> > you say: Fetching 100 sites of 1000 nodes with a single machine should
> > definitively be OK
> >
> > What about recrawl on a regular basis (once a day or even more often)?
> >
> It depends on your configuration and connection, but you can expect to
> fetch 10-30 pages / second. So for 100K pages, it will take < 3h
> Re disk space, with an estimate of 10k/page for the index, it will take
> you ~1GB disk space
> See more on http://wiki.apache.org/nutch/HardwareRequirements
>
> HTH,
> Renaud
> > Sorry if this are basic questions but I am trying to learn about nutch
> > and hadoop.
> >
> > Thanks,
> >      Tomislav
> >
> >
> >
> >
> > On Thu, 2007-08-30 at 18:06 -0400, [hidden email] wrote:
> >
> >> hi Tomislav,
> >>
> >> The Nutch Tutorial is the way to go. Fetching 100 sites of 1000 nodes
> >> with a single machine should definitively be OK. You might want to add
> >> more machines if many many people are searching your index.
> >>
> >> BTW, Nutch is "always" using Hadoop. When testing locally or when using
> >> only one machine, Hadoop just uses the local file system. So even the
> >> NutchTutorial uses Hadoop.
> >>
> >> HTH,
> >> Renaud
> >>
> >>
> >>> Would it be recommended to use hadoop for crawling (100 sites with
> 1000
> >>> pages each) on a single machine? What would be the benefit?
> >>> Something like described on:
> >>> http://wiki.apache.org/nutch/NutchHadoopTutorial but on a single
> >>> machine.
> >>>
> >>>
> >>> Or is the simple crawl/recrawl (without hadoop, like described in
> nutch
> >>> tutorial on wiki:
> >>> http://wiki.apache.org/nutch/NutchTutorial + recrawl script from wiki)
> >>> way to go?
> >>>
> >>> Thanks,
> >>>        Tomislav
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
>
>