Optimizing Nutch 2.2.1

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimizing Nutch 2.2.1

BlackIce
Hi,

I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram

Currently the Fetch cycle is limited by my Internet connection.

Parse cycle uses an average of 10% per CPU core

Updatedb cycle uses average 3% per CPU core

Currently I'm only running Hbase in Speudo distributed, not Nutch.

As the DB grows everything slows down significantly but as you can see CPU
resources are not used very much, heck during Update DB my web browsing
creates higher utilization spikes than the updatedb process. I feel that my
hardware is very underutilized and adding more phisycal machines would be a
waste.

What are the bottlenecks? how can I optimize them? should I run a cluster
on 3 Virtual machines?

Thank you for any help you can give!


Ralf R. Kotowski
Reply | Threaded
Open this post in threaded view
|

Fwd: Optimizing Nutch 2.2.1

BlackIce
Hi,

I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram

Currently the Fetch cycle is limited by my Internet connection.

Parse cycle uses an average of 10% per CPU core

Updatedb cycle uses average 3% per CPU core

Currently I'm only running Hbase in pseudo distributed, not Nutch.

As the DB grows everything slows down significantly but as you can see CPU
resources are not used very much, heck during Update DB my web browsing
creates higher utilization spikes than the updatedb process. I feel that my
hardware is very underutilized and adding more phisycal machines would be a
waste.

What are the bottlenecks? how can I optimize them? should I run a cluster
on 3 Virtual machines?

Thank you for any help you can give!


Ralf R. Kotowski
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing Nutch 2.2.1

Talat Uyarer
In reply to this post by BlackIce
Hi,

When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
want to speed up some job you should decrease your map and reduce count.
But optimization is very general concept. You should tune Nutch, Hdfs,
Jobtracker and Hbase settings.

Good luck ;)


2014-03-18 14:00 GMT+02:00 BlackIce <[hidden email]>:

> Hi,
>
> I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
>
> Currently the Fetch cycle is limited by my Internet connection.
>
> Parse cycle uses an average of 10% per CPU core
>
> Updatedb cycle uses average 3% per CPU core
>
> Currently I'm only running Hbase in Speudo distributed, not Nutch.
>
> As the DB grows everything slows down significantly but as you can see CPU
> resources are not used very much, heck during Update DB my web browsing
> creates higher utilization spikes than the updatedb process. I feel that my
> hardware is very underutilized and adding more phisycal machines would be a
> waste.
>
> What are the bottlenecks? how can I optimize them? should I run a cluster
> on 3 Virtual machines?
>
> Thank you for any help you can give!
>
>
> Ralf R. Kotowski
>



--
Talat UYARER
Websitesi: http://talat.uyarer.com
Twitter: http://twitter.com/talatuyarer
Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing Nutch 2.2.1

BlackIce
Thank you,

what are some good starting points to start tuning?

thnx


On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <[hidden email]> wrote:

> Hi,
>
> When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
> want to speed up some job you should decrease your map and reduce count.
> But optimization is very general concept. You should tune Nutch, Hdfs,
> Jobtracker and Hbase settings.
>
> Good luck ;)
>
>
> 2014-03-18 14:00 GMT+02:00 BlackIce <[hidden email]>:
>
> > Hi,
> >
> > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> >
> > Currently the Fetch cycle is limited by my Internet connection.
> >
> > Parse cycle uses an average of 10% per CPU core
> >
> > Updatedb cycle uses average 3% per CPU core
> >
> > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> >
> > As the DB grows everything slows down significantly but as you can see
> CPU
> > resources are not used very much, heck during Update DB my web browsing
> > creates higher utilization spikes than the updatedb process. I feel that
> my
> > hardware is very underutilized and adding more phisycal machines would
> be a
> > waste.
> >
> > What are the bottlenecks? how can I optimize them? should I run a cluster
> > on 3 Virtual machines?
> >
> > Thank you for any help you can give!
> >
> >
> > Ralf R. Kotowski
> >
>
>
>
> --
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing Nutch 2.2.1

Talat Uyarer
imho you dont wait performance on psedo mode. Actually you should learn how
do hadoop run. I read Hadoop Definitive Guide, i recommend you for start
point
19 Mar 2014 20:48 tarihinde "BlackIce" <[hidden email]> yazd─▒:

> Thank you,
>
> what are some good starting points to start tuning?
>
> thnx
>
>
> On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <[hidden email]> wrote:
>
> > Hi,
> >
> > When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If you
> > want to speed up some job you should decrease your map and reduce count.
> > But optimization is very general concept. You should tune Nutch, Hdfs,
> > Jobtracker and Hbase settings.
> >
> > Good luck ;)
> >
> >
> > 2014-03-18 14:00 GMT+02:00 BlackIce <[hidden email]>:
> >
> > > Hi,
> > >
> > > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode , Hadoop
> > > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> > >
> > > Currently the Fetch cycle is limited by my Internet connection.
> > >
> > > Parse cycle uses an average of 10% per CPU core
> > >
> > > Updatedb cycle uses average 3% per CPU core
> > >
> > > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> > >
> > > As the DB grows everything slows down significantly but as you can see
> > CPU
> > > resources are not used very much, heck during Update DB my web browsing
> > > creates higher utilization spikes than the updatedb process. I feel
> that
> > my
> > > hardware is very underutilized and adding more phisycal machines would
> > be a
> > > waste.
> > >
> > > What are the bottlenecks? how can I optimize them? should I run a
> cluster
> > > on 3 Virtual machines?
> > >
> > > Thank you for any help you can give!
> > >
> > >
> > > Ralf R. Kotowski
> > >
> >
> >
> >
> > --
> > Talat UYARER
> > Websitesi: http://talat.uyarer.com
> > Twitter: http://twitter.com/talatuyarer
> > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing Nutch 2.2.1

BlackIce
Thnx,

It seems that anything related to Hadoop is a MUST read!


On Wed, Mar 19, 2014 at 8:25 PM, Talat Uyarer <[hidden email]> wrote:

> imho you dont wait performance on psedo mode. Actually you should learn how
> do hadoop run. I read Hadoop Definitive Guide, i recommend you for start
> point
> 19 Mar 2014 20:48 tarihinde "BlackIce" <[hidden email]> yazd─▒:
>
> > Thank you,
> >
> > what are some good starting points to start tuning?
> >
> > thnx
> >
> >
> > On Tue, Mar 18, 2014 at 8:20 PM, Talat Uyarer <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > When you use Hadoop in pseudo mode, it create 2 map and 2 reduce. If
> you
> > > want to speed up some job you should decrease your map and reduce
> count.
> > > But optimization is very general concept. You should tune Nutch, Hdfs,
> > > Jobtracker and Hbase settings.
> > >
> > > Good luck ;)
> > >
> > >
> > > 2014-03-18 14:00 GMT+02:00 BlackIce <[hidden email]>:
> > >
> > > > Hi,
> > > >
> > > > I'm Using Nutch 2.2.1, Hbase 0.90.6 in pseudo distributed mode ,
> Hadoop
> > > > 1.2.1, Java 8 Oracle, Intel I5 Quadcore, 16GB Ram
> > > >
> > > > Currently the Fetch cycle is limited by my Internet connection.
> > > >
> > > > Parse cycle uses an average of 10% per CPU core
> > > >
> > > > Updatedb cycle uses average 3% per CPU core
> > > >
> > > > Currently I'm only running Hbase in Speudo distributed, not Nutch.
> > > >
> > > > As the DB grows everything slows down significantly but as you can
> see
> > > CPU
> > > > resources are not used very much, heck during Update DB my web
> browsing
> > > > creates higher utilization spikes than the updatedb process. I feel
> > that
> > > my
> > > > hardware is very underutilized and adding more phisycal machines
> would
> > > be a
> > > > waste.
> > > >
> > > > What are the bottlenecks? how can I optimize them? should I run a
> > cluster
> > > > on 3 Virtual machines?
> > > >
> > > > Thank you for any help you can give!
> > > >
> > > >
> > > > Ralf R. Kotowski
> > > >
> > >
> > >
> > >
> > > --
> > > Talat UYARER
> > > Websitesi: http://talat.uyarer.com
> > > Twitter: http://twitter.com/talatuyarer
> > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
> > >
> >
>