Perfomance problems and segmenting

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Perfomance problems and segmenting

JoostRuiter
Hi All,

First off, I'm quite the noob when it comes to Nutch, so don't bash me if the following is an enormously stupid question.

We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.


Performance is really poor, if we do get search results it will take multiple minutes. When the query is longer we are getting the following:

"java.lang.OutOfMemoryError: Java heap memory"

What we have tried to improve on this:
- Slice the segments into smaller chuncks (max: 50000 url/per seg)
- Set io.map.index.skip to 8
- Set indexer.termIndexInterval to 1024
- Cluster with Hadoop (4 nodes to search)

Any ideas? Missing information? Please let me know, this is my graduation internship and I would really like to get a good grade ;)
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

Briggs
How much memory are you currently allocating to the search servers?



On 4/23/07, JoostRuiter <[hidden email]> wrote:

>
> Hi All,
>
> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> the following is an enormously stupid question.
>
> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
>
>
> Performance is really poor, if we do get search results it will take
> multiple minutes. When the query is longer we are getting the following:
>
> "java.lang.OutOfMemoryError: Java heap memory"
>
> What we have tried to improve on this:
> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> - Set io.map.index.skip to 8
> - Set indexer.termIndexInterval to 1024
> - Cluster with Hadoop (4 nodes to search)
>
> Any ideas? Missing information? Please let me know, this is my graduation
> internship and I would really like to get a good grade ;)
> --
> View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141310
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>


--
"Conscious decisions by concious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

JoostRuiter
Dear Briggs,

Currently we allocated 1gig for JVM and Resin/Tomcat.

Greetings,

Joost

Briggs wrote
How much memory are you currently allocating to the search servers?



On 4/23/07, JoostRuiter <joost.ruiter@adnexus-recruitment.nl> wrote:
>
> Hi All,
>
> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> the following is an enormously stupid question.
>
> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
>
>
> Performance is really poor, if we do get search results it will take
> multiple minutes. When the query is longer we are getting the following:
>
> "java.lang.OutOfMemoryError: Java heap memory"
>
> What we have tried to improve on this:
> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> - Set io.map.index.skip to 8
> - Set indexer.termIndexInterval to 1024
> - Cluster with Hadoop (4 nodes to search)
>
> Any ideas? Missing information? Please let me know, this is my graduation
> internship and I would really like to get a good grade ;)
> --
> View this message in context: http://www.nabble.com/Perfomance-problems-and-segmenting-tf3631982.html#a10141310
> Sent from the Nutch - Dev mailing list archive at Nabble.com.
>
>


--
"Conscious decisions by concious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

Dennis Kubes
In reply to this post by JoostRuiter
Without more information this sounds like your tomcat search
nutch-site.xml file is setup to use the DFS rather than the local file
system.  Remember that processing jobs occurs on the DFS but for
searching, indexes are best moved to the local file system.

Dennis Kubes

JoostRuiter wrote:

> Hi All,
>
> First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> the following is an enormously stupid question.
>
> We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
>
>
> Performance is really poor, if we do get search results it will take
> multiple minutes. When the query is longer we are getting the following:
>
> "java.lang.OutOfMemoryError: Java heap memory"
>
> What we have tried to improve on this:
> - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> - Set io.map.index.skip to 8
> - Set indexer.termIndexInterval to 1024
> - Cluster with Hadoop (4 nodes to search)
>
> Any ideas? Missing information? Please let me know, this is my graduation
> internship and I would really like to get a good grade ;)
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

Briggs
One more thing...

Are you using a distributed index?  If this is so, you do not want to
do this; indexes should be local to the machine that is being
searched.

On 4/23/07, Dennis Kubes <[hidden email]> wrote:

> Without more information this sounds like your tomcat search
> nutch-site.xml file is setup to use the DFS rather than the local file
> system.  Remember that processing jobs occurs on the DFS but for
> searching, indexes are best moved to the local file system.
>
> Dennis Kubes
>
> JoostRuiter wrote:
> > Hi All,
> >
> > First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> > the following is an enormously stupid question.
> >
> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> >
> >
> > Performance is really poor, if we do get search results it will take
> > multiple minutes. When the query is longer we are getting the following:
> >
> > "java.lang.OutOfMemoryError: Java heap memory"
> >
> > What we have tried to improve on this:
> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> > - Set io.map.index.skip to 8
> > - Set indexer.termIndexInterval to 1024
> > - Cluster with Hadoop (4 nodes to search)
> >
> > Any ideas? Missing information? Please let me know, this is my graduation
> > internship and I would really like to get a good grade ;)
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

JoostRuiter
Ok thanks for all your input guys! I`ll discuss this with my co-worker. Dennis, what more information do you need?

Thanks everyone!

Briggs wrote
One more thing...

Are you using a distributed index?  If this is so, you do not want to
do this; indexes should be local to the machine that is being
searched.

On 4/23/07, Dennis Kubes <nutch-dev@dragonflymc.com> wrote:
> Without more information this sounds like your tomcat search
> nutch-site.xml file is setup to use the DFS rather than the local file
> system.  Remember that processing jobs occurs on the DFS but for
> searching, indexes are best moved to the local file system.
>
> Dennis Kubes
>
> JoostRuiter wrote:
> > Hi All,
> >
> > First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> > the following is an enormously stupid question.
> >
> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> >
> >
> > Performance is really poor, if we do get search results it will take
> > multiple minutes. When the query is longer we are getting the following:
> >
> > "java.lang.OutOfMemoryError: Java heap memory"
> >
> > What we have tried to improve on this:
> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> > - Set io.map.index.skip to 8
> > - Set indexer.termIndexInterval to 1024
> > - Cluster with Hadoop (4 nodes to search)
> >
> > Any ideas? Missing information? Please let me know, this is my graduation
> > internship and I would really like to get a good grade ;)
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

JoostRuiter
Hey guys,

one more addition, we're not using DFS. We got a single XP box with NFTS (so no distributed index).

Hope this helps, greetings..

And for some strange reason we got the following error after slicing the segments into 50K url pieces:

$ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/ -slice 50000
Merging 1 segments to arscrminternal/outseg/20070423163605
SegmentMerger:   adding arscrminternal/segments/20070421110321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
        at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)


We thought making smaller chunks would help the perfomance, but we didnt even come around to test it beacause of the above error, any ideas?


JoostRuiter wrote
Ok thanks for all your input guys! I`ll discuss this with my co-worker. Dennis, what more information do you need?

Thanks everyone!

Briggs wrote
One more thing...

Are you using a distributed index?  If this is so, you do not want to
do this; indexes should be local to the machine that is being
searched.

On 4/23/07, Dennis Kubes <nutch-dev@dragonflymc.com> wrote:
> Without more information this sounds like your tomcat search
> nutch-site.xml file is setup to use the DFS rather than the local file
> system.  Remember that processing jobs occurs on the DFS but for
> searching, indexes are best moved to the local file system.
>
> Dennis Kubes
>
> JoostRuiter wrote:
> > Hi All,
> >
> > First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> > the following is an enormously stupid question.
> >
> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> >
> >
> > Performance is really poor, if we do get search results it will take
> > multiple minutes. When the query is longer we are getting the following:
> >
> > "java.lang.OutOfMemoryError: Java heap memory"
> >
> > What we have tried to improve on this:
> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> > - Set io.map.index.skip to 8
> > - Set indexer.termIndexInterval to 1024
> > - Cluster with Hadoop (4 nodes to search)
> >
> > Any ideas? Missing information? Please let me know, this is my graduation
> > internship and I would really like to get a good grade ;)
>


--
"Conscious decisions by conscious minds are what make reality real"
Reply | Threaded
Open this post in threaded view
|

Re: Perfomance problems and segmenting

JoostRuiter
I got some additional info from our developer:

"I never
had much luck with the merge tools but you might post this snippit from
your log to the board:

2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 20:01:56,656 INFO  segment.SegmentMerger - Slice size: 50000
URLs.
2007-04-23 21:28:09,031 WARN  mapred.LocalJobRunner - job_gai7an
java.lang.OutOfMemoryError: Java heap space

Which might give them a little more info since it tells them when."


JoostRuiter wrote
Hey guys,

one more addition, we're not using DFS. We got a single XP box with NFTS (so no distributed index).

Hope this helps, greetings..

And for some strange reason we got the following error after slicing the segments into 50K url pieces:

$ nutch mergesegs arscrminternal/outseg -dir arscrminternal/segments/ -slice 50000
Merging 1 segments to arscrminternal/outseg/20070423163605
SegmentMerger:   adding arscrminternal/segments/20070421110321
SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Slice size: 50000 URLs.
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:357)
        at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:627)
        at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:675)


We thought making smaller chunks would help the perfomance, but we didnt even come around to test it beacause of the above error, any ideas?


JoostRuiter wrote
Ok thanks for all your input guys! I`ll discuss this with my co-worker. Dennis, what more information do you need?

Thanks everyone!

Briggs wrote
One more thing...

Are you using a distributed index?  If this is so, you do not want to
do this; indexes should be local to the machine that is being
searched.

On 4/23/07, Dennis Kubes <nutch-dev@dragonflymc.com> wrote:
> Without more information this sounds like your tomcat search
> nutch-site.xml file is setup to use the DFS rather than the local file
> system.  Remember that processing jobs occurs on the DFS but for
> searching, indexes are best moved to the local file system.
>
> Dennis Kubes
>
> JoostRuiter wrote:
> > Hi All,
> >
> > First off, I'm quite the noob when it comes to Nutch, so don't bash me if
> > the following is an enormously stupid question.
> >
> > We're using Nutch on a P4 Duo Core system (800mhz fsb) with 4gig RAM and a
> > 500gig SATA (3gig/sec) HD. We indexed 350 000 pages into 1 segment of 15gig.
> >
> >
> > Performance is really poor, if we do get search results it will take
> > multiple minutes. When the query is longer we are getting the following:
> >
> > "java.lang.OutOfMemoryError: Java heap memory"
> >
> > What we have tried to improve on this:
> > - Slice the segments into smaller chuncks (max: 50000 url/per seg)
> > - Set io.map.index.skip to 8
> > - Set indexer.termIndexInterval to 1024
> > - Cluster with Hadoop (4 nodes to search)
> >
> > Any ideas? Missing information? Please let me know, this is my graduation
> > internship and I would really like to get a good grade ;)
>


--
"Conscious decisions by conscious minds are what make reality real"