dual-core cpu usage while parsing and indexing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

dual-core cpu usage while parsing and indexing

Tomislav Poljak
Hi,
I have noticed that Nutch while parsing segment data, doing update,
indexing or other CPU demanding operations is using only one CPU (core).
Actually it uses both but alternately: when one CPU goes 100% other CPU
is on 1%, and then they switch (never using both CPU on 100%). For
example, when parsing segment data Nutch java process uses 100% CPU
(according to top) for a long time, but when I look CPU history (with
System Monitor) I see only one CPU is on 100% while other CPU is barely
used (and then they switch). Is it possible to configure Nutch to use
both CPUs (cores) simultaneously to get max performance (more threads
for parse,update,index,merge)? I am also curious about why Nutch uses so
much CPU time when parsing fetched data while there is absolutely no IO
(no disk write/read)?


Thanks,
     Tomislav

Reply | Threaded
Open this post in threaded view
|

Re: dual-core cpu usage while parsing and indexing

Andrzej Białecki-2
Tomislav Poljak wrote:

> Hi,
> I have noticed that Nutch while parsing segment data, doing update,
> indexing or other CPU demanding operations is using only one CPU (core).
> Actually it uses both but alternately: when one CPU goes 100% other CPU
> is on 1%, and then they switch (never using both CPU on 100%). For
> example, when parsing segment data Nutch java process uses 100% CPU
> (according to top) for a long time, but when I look CPU history (with
> System Monitor) I see only one CPU is on 100% while other CPU is barely
> used (and then they switch). Is it possible to configure Nutch to use
> both CPUs (cores) simultaneously to get max performance (more threads
> for parse,update,index,merge)? I am also curious about why Nutch uses so
> much CPU time when parsing fetched data while there is absolutely no IO
> (no disk write/read)?
>

It's likely that this problem is related to the OS scheduler, or the way
that this JVM implementation uses kernel threads. Perhaps there is a
method in the OS to select how application threads are mapped to kernel
threads? (there is in FreeBSD, I'm not that familiar with Linux)

Long periods of no IO during parsing are probably related to the fact
that Hadoop uses internal buffers which are several MB large.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com