Nutch Search Speed Concern

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch Search Speed Concern

tl-3
Search Speed

What are the most important factors in nutch/lucene's
search speed?

I've been testing nutch's search speed on a search
pool with about 100M records (separated evenly into 30
segments), and discovered that certain search terms
have a signicantly higher search time then others.
Some searches take 30 ms while others takes upwards of
3000ms.

At first, there seemed to be a direct relationship
between the total number of results from a given query
and the timeit took to complete. But after further
testing, that relationship did not hold true for all
cases. There seems to be other factors that directly
affect the speed of a search.

Has anyone else encountered this issue? Or have some
insight to the impact of certain factors on search
speed?

Thanks.

- T


               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

Paul Harrison-2
I too would love to hear some answers on this one.  We have a 100 million
page implementation on 5 machines, 4 GB of ram, and 2 SATA drives of 250 GB
each.  Part of what I have noticed is that Lucene does some sort of strange
caching in that if you do subsequent searches on a search the return results
are quite quick.  I too have noticed that different terms have different
search responses and that the problem gets worse with the number of terms in
the query.  I have also noticed that distributed search has problems.  The
main search machine waits on other machines to serve up their results before
it will respond.  So it appears that your search is only as fast as your
slowest responding machine or whenever the timeout hits (whichever comes
first).  If anyone has any suggestions on tuning the distributed search or
general suggestions on speeding up retrieval times with a large set, I am
all ears.

Thanks,

Paul  

-----Original Message-----
From: TL [mailto:[hidden email]]
Sent: Thursday, October 13, 2005 12:15 PM
To: [hidden email]
Subject: Nutch Search Speed Concern

Search Speed

What are the most important factors in nutch/lucene's
search speed?

I've been testing nutch's search speed on a search
pool with about 100M records (separated evenly into 30
segments), and discovered that certain search terms
have a signicantly higher search time then others.
Some searches take 30 ms while others takes upwards of
3000ms.

At first, there seemed to be a direct relationship
between the total number of results from a given query
and the timeit took to complete. But after further
testing, that relationship did not hold true for all
cases. There seems to be other factors that directly
affect the speed of a search.

Has anyone else encountered this issue? Or have some
insight to the impact of certain factors on search
speed?

Thanks.

- T


               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] RE: Nutch Search Speed Concern

Otis Gospodnetic-2-2
Hello,

--- Paul Harrison <[hidden email]> wrote:

> I too would love to hear some answers on this one.  We have a 100
> million
> page implementation on 5 machines, 4 GB of ram, and 2 SATA drives of
> 250 GB
> each.  Part of what I have noticed is that Lucene does some sort of
> strange
> caching in that if you do subsequent searches on a search the return
> results
> are quite quick.  I too have noticed that different terms have

That's probably your OS/FS caching.  Lucene doesn't cache anything.

> different
> search responses and that the problem gets worse with the number of
> terms in
> the query.

Yes, that makes sense.  More complex queries will have to dig through
the index more than simple ones, consequently taking more time to
return hits.

> I have also noticed that distributed search has problems.
>  The
> main search machine waits on other machines to serve up their results
> before
> it will respond.  So it appears that your search is only as fast as
> your
> slowest responding machine or whenever the timeout hits (whichever
> comes first).

I'm no expert, but this sounds reasonable to me - what if your closest
matches happen to be in the index on the slowest search server?

Otis

> If anyone has any suggestions on tuning the distributed
> search or
> general suggestions on speeding up retrieval times with a large set,
> I am
> all ears.
>
> Thanks,
>
> Paul  
>
> -----Original Message-----
> From: TL [mailto:[hidden email]]
> Sent: Thursday, October 13, 2005 12:15 PM
> To: [hidden email]
> Subject: Nutch Search Speed Concern
>
> Search Speed
>
> What are the most important factors in nutch/lucene's
> search speed?
>
> I've been testing nutch's search speed on a search
> pool with about 100M records (separated evenly into 30
> segments), and discovered that certain search terms
> have a signicantly higher search time then others.
> Some searches take 30 ms while others takes upwards of
> 3000ms.
>
> At first, there seemed to be a direct relationship
> between the total number of results from a given query
> and the timeit took to complete. But after further
> testing, that relationship did not hold true for all
> cases. There seems to be other factors that directly
> affect the speed of a search.
>
> Has anyone else encountered this issue? Or have some
> insight to the impact of certain factors on search
> speed?
>
> Thanks.
>
> - T
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads,
> discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nutch-general mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

mhunter
In reply to this post by Paul Harrison-2
Paul and TL,
I was wondering if you could detail how you have your cluster's configured,
hardware wise ie. how many servers are used for each purpose, epecially with
regard to how your storage is configured.  

We tested search for a 20 Million page index on a dual core 64 bit machine
with 8 GB of ram using storage of the nutch data on another server through
linux nfs, and it's performance was terrible. It looks like the bottleneck
was nfs, so I was wondering how you had your storage set up.  Are you using
NDFS, or is it split up over multiple servers?  We are trying to build a
system that could handle at least 50 million pages, so would appreciate any
advice on the the best way to configure the servers.  Originally we were
thinking 3 servers, 1 for crawling and indexing and 2 for search servers
would be enough for that size of index.

Thanks,
Murray  

-----Original Message-----
From: Paul Harrison [mailto:[hidden email]]
Sent: Friday, October 14, 2005 7:40 PM
To: [hidden email]
Subject: RE: Nutch Search Speed Concern

I too would love to hear some answers on this one.  We have a 100 million
page implementation on 5 machines, 4 GB of ram, and 2 SATA drives of 250 GB
each.  Part of what I have noticed is that Lucene does some sort of strange
caching in that if you do subsequent searches on a search the return results
are quite quick.  I too have noticed that different terms have different
search responses and that the problem gets worse with the number of terms in
the query.  I have also noticed that distributed search has problems.  The
main search machine waits on other machines to serve up their results before
it will respond.  So it appears that your search is only as fast as your
slowest responding machine or whenever the timeout hits (whichever comes
first).  If anyone has any suggestions on tuning the distributed search or
general suggestions on speeding up retrieval times with a large set, I am
all ears.

Thanks,

Paul  

-----Original Message-----
From: TL [mailto:[hidden email]]
Sent: Thursday, October 13, 2005 12:15 PM
To: [hidden email]
Subject: Nutch Search Speed Concern

Search Speed

What are the most important factors in nutch/lucene's search speed?

I've been testing nutch's search speed on a search pool with about 100M
records (separated evenly into 30 segments), and discovered that certain
search terms have a signicantly higher search time then others.
Some searches take 30 ms while others takes upwards of 3000ms.

At first, there seemed to be a direct relationship between the total number
of results from a given query and the timeit took to complete. But after
further testing, that relationship did not hold true for all cases. There
seems to be other factors that directly affect the speed of a search.

Has anyone else encountered this issue? Or have some insight to the impact
of certain factors on search speed?

Thanks.

- T


               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

Paul Harrison-2
Murray,

We are running on the following:

5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40 GB OS drive and 2 SATA
250 GB data drives each.  We are running the latest version of Fedora and
have the data drives setup with ReiserFS.  We are running JDK 1.5 and Tomcat
5.5.

On a small set of 20 million I don't see much of a performance degredation;
especially if it is all on one machine.  Where things get bad is in the
distributed search.  We are actually contemplating rewriting the distributed
search code.

Thanks,

Paul

-----Original Message-----
From: Murray Hunter [mailto:[hidden email]]
Sent: Monday, October 17, 2005 9:11 AM
To: [hidden email]
Subject: RE: Nutch Search Speed Concern

Paul and TL,
I was wondering if you could detail how you have your cluster's configured,
hardware wise ie. how many servers are used for each purpose, epecially with
regard to how your storage is configured.  

We tested search for a 20 Million page index on a dual core 64 bit machine
with 8 GB of ram using storage of the nutch data on another server through
linux nfs, and it's performance was terrible. It looks like the bottleneck
was nfs, so I was wondering how you had your storage set up.  Are you using
NDFS, or is it split up over multiple servers?  We are trying to build a
system that could handle at least 50 million pages, so would appreciate any
advice on the the best way to configure the servers.  Originally we were
thinking 3 servers, 1 for crawling and indexing and 2 for search servers
would be enough for that size of index.

Thanks,
Murray  

-----Original Message-----
From: Paul Harrison [mailto:[hidden email]]
Sent: Friday, October 14, 2005 7:40 PM
To: [hidden email]
Subject: RE: Nutch Search Speed Concern

I too would love to hear some answers on this one.  We have a 100 million
page implementation on 5 machines, 4 GB of ram, and 2 SATA drives of 250 GB
each.  Part of what I have noticed is that Lucene does some sort of strange
caching in that if you do subsequent searches on a search the return results
are quite quick.  I too have noticed that different terms have different
search responses and that the problem gets worse with the number of terms in
the query.  I have also noticed that distributed search has problems.  The
main search machine waits on other machines to serve up their results before
it will respond.  So it appears that your search is only as fast as your
slowest responding machine or whenever the timeout hits (whichever comes
first).  If anyone has any suggestions on tuning the distributed search or
general suggestions on speeding up retrieval times with a large set, I am
all ears.

Thanks,

Paul  

-----Original Message-----
From: TL [mailto:[hidden email]]
Sent: Thursday, October 13, 2005 12:15 PM
To: [hidden email]
Subject: Nutch Search Speed Concern

Search Speed

What are the most important factors in nutch/lucene's search speed?

I've been testing nutch's search speed on a search pool with about 100M
records (separated evenly into 30 segments), and discovered that certain
search terms have a signicantly higher search time then others.
Some searches take 30 ms while others takes upwards of 3000ms.

At first, there seemed to be a direct relationship between the total number
of results from a given query and the timeit took to complete. But after
further testing, that relationship did not hold true for all cases. There
seems to be other factors that directly affect the speed of a search.

Has anyone else encountered this issue? Or have some insight to the impact
of certain factors on search speed?

Thanks.

- T


               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

tl-3
Hey Paul and Murray,

We're running a similar setup. We have two machines
running distributed search and we're seeing decent
performance gains by using distributed search.

Our machines have similar specs to Pauls... P 4cpu's,
4 gb of ram, sata drives.

NFS would be a performance killer.

But my problem only occurs for specific searches. We
have some search terms that will search in under 50
ms, and some others that would take up to 4000 ms.


--- Paul Harrison <[hidden email]> wrote:

> Murray,
>
> We are running on the following:
>
> 5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40
> GB OS drive and 2 SATA
> 250 GB data drives each.  We are running the latest
> version of Fedora and
> have the data drives setup with ReiserFS.  We are
> running JDK 1.5 and Tomcat
> 5.5.
>
> On a small set of 20 million I don't see much of a
> performance degredation;
> especially if it is all on one machine.  Where
> things get bad is in the
> distributed search.  We are actually contemplating
> rewriting the distributed
> search code.
>
> Thanks,
>
> Paul
>
> -----Original Message-----
> From: Murray Hunter
> [mailto:[hidden email]]
> Sent: Monday, October 17, 2005 9:11 AM
> To: [hidden email]
> Subject: RE: Nutch Search Speed Concern
>
> Paul and TL,
> I was wondering if you could detail how you have
> your cluster's configured,
> hardware wise ie. how many servers are used for each
> purpose, epecially with
> regard to how your storage is configured.  
>
> We tested search for a 20 Million page index on a
> dual core 64 bit machine
> with 8 GB of ram using storage of the nutch data on
> another server through
> linux nfs, and it's performance was terrible. It
> looks like the bottleneck
> was nfs, so I was wondering how you had your storage
> set up.  Are you using
> NDFS, or is it split up over multiple servers?  We
> are trying to build a
> system that could handle at least 50 million pages,
> so would appreciate any
> advice on the the best way to configure the servers.
>  Originally we were
> thinking 3 servers, 1 for crawling and indexing and
> 2 for search servers
> would be enough for that size of index.
>
> Thanks,
> Murray  
>
> -----Original Message-----
> From: Paul Harrison [mailto:[hidden email]]
> Sent: Friday, October 14, 2005 7:40 PM
> To: [hidden email]
> Subject: RE: Nutch Search Speed Concern
>
> I too would love to hear some answers on this one.
> We have a 100 million
> page implementation on 5 machines, 4 GB of ram, and
> 2 SATA drives of 250 GB
> each.  Part of what I have noticed is that Lucene
> does some sort of strange
> caching in that if you do subsequent searches on a
> search the return results
> are quite quick.  I too have noticed that different
> terms have different
> search responses and that the problem gets worse
> with the number of terms in
> the query.  I have also noticed that distributed
> search has problems.  The
> main search machine waits on other machines to serve
> up their results before
> it will respond.  So it appears that your search is
> only as fast as your
> slowest responding machine or whenever the timeout
> hits (whichever comes
> first).  If anyone has any suggestions on tuning the
> distributed search or
> general suggestions on speeding up retrieval times
> with a large set, I am
> all ears.
>
> Thanks,
>
> Paul  
>
> -----Original Message-----
> From: TL [mailto:[hidden email]]
> Sent: Thursday, October 13, 2005 12:15 PM
> To: [hidden email]
> Subject: Nutch Search Speed Concern
>
> Search Speed
>
> What are the most important factors in
> nutch/lucene's search speed?
>
> I've been testing nutch's search speed on a search
> pool with about 100M
> records (separated evenly into 30 segments), and
> discovered that certain
> search terms have a signicantly higher search time
> then others.
> Some searches take 30 ms while others takes upwards
> of 3000ms.
>
> At first, there seemed to be a direct relationship
> between the total number
> of results from a given query and the timeit took to
> complete. But after
> further testing, that relationship did not hold true
> for all cases. There
> seems to be other factors that directly affect the
> speed of a search.
>
> Has anyone else encountered this issue? Or have some
> insight to the impact
> of certain factors on search speed?
>
> Thanks.
>
> - T
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>



       
               
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

Doug Cutting-2
In reply to this post by mhunter
Murray Hunter wrote:
> We tested search for a 20 Million page index on a dual core 64 bit machine
> with 8 GB of ram using storage of the nutch data on another server through
> linux nfs, and it's performance was terrible. It looks like the bottleneck
> was nfs, so I was wondering how you had your storage set up.  Are you using
> NDFS, or is it split up over multiple servers?

For good search performance, indexes and segments should always reside
on local volumes, not in NDFS and not in NFS.  Ideally these can be
spread across the available local volumes, to permit more parallel disk
i/o.  As a rule of thumb, searching starts to get slow with more than
around 20M pages per node.  Systems larger than that should benefit from
distributed search.

Doug
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

Goldschmidt, Dave
In reply to this post by tl-3
What additional rules of thumb exist beyond the 20M pages per node
threshold -- i.e. when distributed search becomes necessary?

Thanks,
DaveG

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Monday, October 17, 2005 1:38 PM
To: [hidden email]
Subject: Re: Nutch Search Speed Concern

Murray Hunter wrote:
> We tested search for a 20 Million page index on a dual core 64 bit
machine
> with 8 GB of ram using storage of the nutch data on another server
through
> linux nfs, and it's performance was terrible. It looks like the
bottleneck
> was nfs, so I was wondering how you had your storage set up.  Are you
using
> NDFS, or is it split up over multiple servers?

For good search performance, indexes and segments should always reside
on local volumes, not in NDFS and not in NFS.  Ideally these can be
spread across the available local volumes, to permit more parallel disk
i/o.  As a rule of thumb, searching starts to get slow with more than
around 20M pages per node.  Systems larger than that should benefit from

distributed search.

Doug
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

Earl Cahill
In reply to this post by Paul Harrison-2
Anyway you would post your conf/nutch-site.xml and
walk through your crawl process a bit?

Thanks,
Earl

--- Paul Harrison <[hidden email]> wrote:

> Murray,
>
> We are running on the following:
>
> 5 Pentium 4 3.2 Ghz machines, 4 GB of RAM each, 1 40
> GB OS drive and 2 SATA
> 250 GB data drives each.  We are running the latest
> version of Fedora and
> have the data drives setup with ReiserFS.  We are
> running JDK 1.5 and Tomcat
> 5.5.
>
> On a small set of 20 million I don't see much of a
> performance degredation;
> especially if it is all on one machine.  Where
> things get bad is in the
> distributed search.  We are actually contemplating
> rewriting the distributed
> search code.
>
> Thanks,
>
> Paul
>
> -----Original Message-----
> From: Murray Hunter
> [mailto:[hidden email]]
> Sent: Monday, October 17, 2005 9:11 AM
> To: [hidden email]
> Subject: RE: Nutch Search Speed Concern
>
> Paul and TL,
> I was wondering if you could detail how you have
> your cluster's configured,
> hardware wise ie. how many servers are used for each
> purpose, epecially with
> regard to how your storage is configured.  
>
> We tested search for a 20 Million page index on a
> dual core 64 bit machine
> with 8 GB of ram using storage of the nutch data on
> another server through
> linux nfs, and it's performance was terrible. It
> looks like the bottleneck
> was nfs, so I was wondering how you had your storage
> set up.  Are you using
> NDFS, or is it split up over multiple servers?  We
> are trying to build a
> system that could handle at least 50 million pages,
> so would appreciate any
> advice on the the best way to configure the servers.
>  Originally we were
> thinking 3 servers, 1 for crawling and indexing and
> 2 for search servers
> would be enough for that size of index.
>
> Thanks,
> Murray  
>
> -----Original Message-----
> From: Paul Harrison [mailto:[hidden email]]
> Sent: Friday, October 14, 2005 7:40 PM
> To: [hidden email]
> Subject: RE: Nutch Search Speed Concern
>
> I too would love to hear some answers on this one.
> We have a 100 million
> page implementation on 5 machines, 4 GB of ram, and
> 2 SATA drives of 250 GB
> each.  Part of what I have noticed is that Lucene
> does some sort of strange
> caching in that if you do subsequent searches on a
> search the return results
> are quite quick.  I too have noticed that different
> terms have different
> search responses and that the problem gets worse
> with the number of terms in
> the query.  I have also noticed that distributed
> search has problems.  The
> main search machine waits on other machines to serve
> up their results before
> it will respond.  So it appears that your search is
> only as fast as your
> slowest responding machine or whenever the timeout
> hits (whichever comes
> first).  If anyone has any suggestions on tuning the
> distributed search or
> general suggestions on speeding up retrieval times
> with a large set, I am
> all ears.
>
> Thanks,
>
> Paul  
>
> -----Original Message-----
> From: TL [mailto:[hidden email]]
> Sent: Thursday, October 13, 2005 12:15 PM
> To: [hidden email]
> Subject: Nutch Search Speed Concern
>
> Search Speed
>
> What are the most important factors in
> nutch/lucene's search speed?
>
> I've been testing nutch's search speed on a search
> pool with about 100M
> records (separated evenly into 30 segments), and
> discovered that certain
> search terms have a signicantly higher search time
> then others.
> Some searches take 30 ms while others takes upwards
> of 3000ms.
>
> At first, there seemed to be a direct relationship
> between the total number
> of results from a given query and the timeit took to
> complete. But after
> further testing, that relationship did not hold true
> for all cases. There
> seems to be other factors that directly affect the
> speed of a search.
>
> Has anyone else encountered this issue? Or have some
> insight to the impact
> of certain factors on search speed?
>
> Thanks.
>
> - T
>
>
>
> __________________________________
> Yahoo! Music Unlimited
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
>



               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Reply | Threaded
Open this post in threaded view
|

RE: Nutch Search Speed Concern

mhunter
In reply to this post by Doug Cutting-2
Doug,
Our frontend server has 6 SCSI drive bays, would you suggest 6 separate
volumes or one raid 10 volume, or perhaps three raid 0 arrays? We have
networked storage that we could use for backups so we are not to concerned
about data loss.

Thanks,
Murray

-----Original Message-----
From: Doug Cutting [mailto:[hidden email]]
Sent: Monday, October 17, 2005 12:38 PM
To: [hidden email]
Subject: Re: Nutch Search Speed Concern

Murray Hunter wrote:
> We tested search for a 20 Million page index on a dual core 64 bit
> machine with 8 GB of ram using storage of the nutch data on another
> server through linux nfs, and it's performance was terrible. It looks
> like the bottleneck was nfs, so I was wondering how you had your
> storage set up.  Are you using NDFS, or is it split up over multiple
servers?

For good search performance, indexes and segments should always reside on
local volumes, not in NDFS and not in NFS.  Ideally these can be spread
across the available local volumes, to permit more parallel disk i/o.  As a
rule of thumb, searching starts to get slow with more than around 20M pages
per node.  Systems larger than that should benefit from distributed search.

Doug

Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

Doug Cutting-2
Murray Hunter wrote:
> Our frontend server has 6 SCSI drive bays, would you suggest 6 separate
> volumes or one raid 10 volume, or perhaps three raid 0 arrays? We have
> networked storage that we could use for backups so we are not to concerned
> about data loss.

One raid 0 or 10 volume should work well for searching, and simplifies
allocation.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

Jay Pound
Doug,
    actually splitting the segments across the drives as individual drives
is much faster, your waiting for it to access random data so your limitation
is the harddrives random access time, 15k scsi drives can have about
4-5million pages on them before they are slower than 1 sec per search. test
out my current search engine setup, I'm running 6 harddrives 5 are 10k 9.1gb
scsi and 1 is a 15k 18gb scsi, the server is a quad xeon 550mhz with 1gb
ram, running windows 2000. no raid controller just ultra wide scsi 2
(80mbyte a sec) each drive just has a segments folder containing separate
segments without a combined index. The machine has 3 million pages indexed
for searching, its running distributed search with 6 separate servers
running, also tomcat is running on the same machine, it is using all of its
1gb of ram.
-J
PS: http://search.fromped.com
PSS: I have 8x 320gb sata drives hooked into my big server running raid 0,
it can do less searches a sec as raid 0 then the old xeon with 10k and 15k
drives. even though the scsi drives are only 55mbytes a sec and the 8x320gb
drives is over 400mbytes a sec its still slower for random accesses!!! can
anyone say nonvolital-storage, if they only made 300gb flash drives for
under $100K!!!
----- Original Message -----
From: "Doug Cutting" <[hidden email]>
To: <[hidden email]>
Sent: Monday, October 17, 2005 4:50 PM
Subject: Re: Nutch Search Speed Concern


> Murray Hunter wrote:
> > Our frontend server has 6 SCSI drive bays, would you suggest 6 separate
> > volumes or one raid 10 volume, or perhaps three raid 0 arrays? We have
> > networked storage that we could use for backups so we are not to
concerned
> > about data loss.
>
> One raid 0 or 10 volume should work well for searching, and simplifies
> allocation.
>
> Doug
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

tl-3
In reply to this post by Doug Cutting-2
You mentioned that as a rule of thumb each node should
only have about 20M pages. What's the main bottleneck
that's encountered around 20M pages? Disk i/o , cpu
speed?

Thanks.

TL


--- Doug Cutting <[hidden email]> wrote:

> Murray Hunter wrote:
> > We tested search for a 20 Million page index on a
> dual core 64 bit machine
> > with 8 GB of ram using storage of the nutch data
> on another server through
> > linux nfs, and it's performance was terrible. It
> looks like the bottleneck
> > was nfs, so I was wondering how you had your
> storage set up.  Are you using
> > NDFS, or is it split up over multiple servers?
>
> For good search performance, indexes and segments
> should always reside
> on local volumes, not in NDFS and not in NFS.
> Ideally these can be
> spread across the available local volumes, to permit
> more parallel disk
> i/o.  As a rule of thumb, searching starts to get
> slow with more than
> around 20M pages per node.  Systems larger than that
> should benefit from
> distributed search.
>
> Doug
>



               
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

Doug Cutting-2
TL wrote:
> You mentioned that as a rule of thumb each node should
> only have about 20M pages. What's the main bottleneck
> that's encountered around 20M pages? Disk i/o , cpu
> speed?

Either or both, depending on your hardware, index, traffic, etc.
CPU-time to compute results serially can average up to a second or more
with ~20M page indexes.  And the total amount of i/o time per query on
indexes this size can be more than a second.  If you can spread the i/o
over multiple spindles then it may not be the bottleneck.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: Nutch Search Speed Concern

Stefan Groschupf-2
In reply to this post by mhunter
Murray,
you may will find this article interesting, especially the part about  
disks.
http://www.acmqueue.org/modules.php?name=Content&pa=showpage&pid=143

Stefan
Am 17.10.2005 um 22:09 schrieb Murray Hunter:






> Doug,
> Our frontend server has 6 SCSI drive bays, would you suggest 6  
> separate
> volumes or one raid 10 volume, or perhaps three raid 0 arrays? We have
> networked storage that we could use for backups so we are not to  
> concerned
> about data loss.
>
> Thanks,
> Murray
>
> -----Original Message-----
> From: Doug Cutting [mailto:[hidden email]]
> Sent: Monday, October 17, 2005 12:38 PM
> To: [hidden email]
> Subject: Re: Nutch Search Speed Concern
>
> Murray Hunter wrote:
>
>
>
>
>
>
>> We tested search for a 20 Million page index on a dual core 64 bit
>> machine with 8 GB of ram using storage of the nutch data on another
>> server through linux nfs, and it's performance was terrible. It looks
>> like the bottleneck was nfs, so I was wondering how you had your
>> storage set up.  Are you using NDFS, or is it split up over multiple
>>
>>
>>
>>
>>
>>
> servers?
>
> For good search performance, indexes and segments should always  
> reside on
> local volumes, not in NDFS and not in NFS.  Ideally these can be  
> spread
> across the available local volumes, to permit more parallel disk i/
> o.  As a
> rule of thumb, searching starts to get slow with more than around  
> 20M pages
> per node.  Systems larger than that should benefit from distributed  
> search.
>
> Doug
>
>
>
>
>
>
>
>