Full CPU usage

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Full CPU usage

Weiwei Xiong
Hi All,

I'am trying to use nutch to crawl some websites but got a full CPU usage
after it got to depth 2 or 3. I couldn't do anything with the machine but
have to stop the crawling. This happened even when I configured to use only
ONE fetcher thread.
One weird thing I noticed is that the number of threads keeps growing after
running sometime.

Does anyone have any hint to solve this problem?

Thanks.
-- ww
Reply | Threaded
Open this post in threaded view
|

Re: Full CPU usage

xiao yang
Hi, Weiwei

What about the configuration of Hadoop?
Maybe there're 10 processes with 1 thread each.

Thanks!
Xiao

On 11/27/10, Weiwei Xiong <[hidden email]> wrote:

> Hi All,
>
> I'am trying to use nutch to crawl some websites but got a full CPU usage
> after it got to depth 2 or 3. I couldn't do anything with the machine but
> have to stop the crawling. This happened even when I configured to use only
> ONE fetcher thread.
> One weird thing I noticed is that the number of threads keeps growing after
> running sometime.
>
> Does anyone have any hint to solve this problem?
>
> Thanks.
> -- ww
>
Reply | Threaded
Open this post in threaded view
|

Re: Full CPU usage

Weiwei Xiong
Thanks for your tips Xiao.

I am currently trying to use Nutch on a single machine so I didn't change
any Hadoop related configurations. Or should I? I assume Nutch sets the
default number of map/reduce task to 1. Is this true?

If I have to change the Hadoop mapreduce configurations in a single machine
environment, Could anyone help to tell me which is the file I should change?
I tried to specify the number of map and reduce task numbers but it didn't
work out.
Below is the configurations I added into mapred-site.xml:

<property>
    <name>mapred.map.tasks</name>
    <value>1</value>
  </property>
  <property>
    <name>mapred.reduce.tasks</name>
    <value>1</value>
  </property>


Thanks,
-- Weiwei

On Sat, Nov 27, 2010 at 7:36 AM, xiao yang <[hidden email]> wrote:

> Hi, Weiwei
>
> What about the configuration of Hadoop?
> Maybe there're 10 processes with 1 thread each.
>
> Thanks!
> Xiao
>
> On 11/27/10, Weiwei Xiong <[hidden email]> wrote:
> > Hi All,
> >
> > I'am trying to use nutch to crawl some websites but got a full CPU usage
> > after it got to depth 2 or 3. I couldn't do anything with the machine but
> > have to stop the crawling. This happened even when I configured to use
> only
> > ONE fetcher thread.
> > One weird thing I noticed is that the number of threads keeps growing
> after
> > running sometime.
> >
> > Does anyone have any hint to solve this problem?
> >
> > Thanks.
> > -- ww
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Full CPU usage

Alexis
Dear Weiwei,

I ran into a similar issue with Nutch 1.2 release. This was already
discussed here:
http://search.lucidimagination.com/search/document/e63dfbb91194cbbd/cpu_100#464de23fdacc40f5


I see around 200 running threads after executing jstack (a command in
the bin/ directory from Sun JDK that takes the pid as input) looking
like:

   java.lang.Thread.State: RUNNABLE
        at org.apache.tika.parser.video.FLVParser.parse(FLVParser.java:248)
        at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:18)
        at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:7)
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)

The shipped jar that includes the Tika parser library is:
$NUTCH_HOME/plugins/parse-tika/tika-parsers-0.7.jar

I did not run into the problem anymore after I used the Tika 0.8
snapshot. I guess one way to fix the problem is to replace it with the
trunk version from SVN and build it with Maven:

$ export TIKA_HOME=./tika
$ svn co http://svn.apache.org/repos/asf/tika/trunk $TIKA_HOME
$ cd $TIKA_HOME
$ mvn install
$ rm $NUTCH_HOME/plugins/parse-tika/tika-parsers-0.7.jar
$ cp $TIKA_HOME/tika-parsers/target/tika-parsers-0.9-SNAPSHOT.jar
$NUTCH_HOME/plugins/parse-tika/


Hope it helps. Please let us know if that would fix your issue.

Alexis



On Sat, Nov 27, 2010 at 10:55 AM, Weiwei Xiong <[hidden email]> wrote:

> Thanks for your tips Xiao.
>
> I am currently trying to use Nutch on a single machine so I didn't change
> any Hadoop related configurations. Or should I? I assume Nutch sets the
> default number of map/reduce task to 1. Is this true?
>
> If I have to change the Hadoop mapreduce configurations in a single machine
> environment, Could anyone help to tell me which is the file I should change?
> I tried to specify the number of map and reduce task numbers but it didn't
> work out.
> Below is the configurations I added into mapred-site.xml:
>
> <property>
>    <name>mapred.map.tasks</name>
>    <value>1</value>
>  </property>
>  <property>
>    <name>mapred.reduce.tasks</name>
>    <value>1</value>
>  </property>
>
>
> Thanks,
> -- Weiwei
>
> On Sat, Nov 27, 2010 at 7:36 AM, xiao yang <[hidden email]> wrote:
>
>> Hi, Weiwei
>>
>> What about the configuration of Hadoop?
>> Maybe there're 10 processes with 1 thread each.
>>
>> Thanks!
>> Xiao
>>
>> On 11/27/10, Weiwei Xiong <[hidden email]> wrote:
>> > Hi All,
>> >
>> > I'am trying to use nutch to crawl some websites but got a full CPU usage
>> > after it got to depth 2 or 3. I couldn't do anything with the machine but
>> > have to stop the crawling. This happened even when I configured to use
>> only
>> > ONE fetcher thread.
>> > One weird thing I noticed is that the number of threads keeps growing
>> after
>> > running sometime.
>> >
>> > Does anyone have any hint to solve this problem?
>> >
>> > Thanks.
>> > -- ww
>> >
>>
>