Optimizing search speed & performance for a 10G Index

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimizing search speed & performance for a 10G Index

Chun Wei Ho
Hi,

We run a search engine based on Lucene 1.9.1 / Nutch 0.7.2. Our index
has approximately 2 million documents and the physical size of it is
about 10 GB. We run it as a tomcat web application on a Fedora Core 4
server with duo Xeon 3.2GHz processors and 4GB RAM.

We receive about 46500 web search requests a day (ranging from 50-300
requests per 5 minutes across the day). Each web search request could
spawn about one to three actual Lucene searches. Our average response
time (calculated from the server side - and so excludes network
latency), is about 2 seconds.

Does this timing of 2 seconds appear plausible for Lucene, based on
the machine specifications above.


Our index is slightly more complex (with multiple fields like title,
location, site, content). For example, a search for "Linux and Lucene"
related entries in "Australia" might result in lucene searches for:

((title:linux^1.0 title:lucene^1.0)^4.0)
+((
+(title:linux^5.0 location:linux^1.5 content:linux^1.0)
+(title:lucene^5.0 location:lucene^1.5 content:lucene^1.0))
((+(+content:linux +content:lucene)) +(site:contentsite1
site:contentsite2 site:contentsite3 site:contentsite4
site:contentsite5 site:contentsite6 site:contentsite7)))^0.01))
+location:australia)
+newsdate:[20061107 TO 20061208]
+region:au)
-jobsite:badsite1 -region:badregion1 -jobsite:badsite2
-jobsite:badsite3 -jobsite:badsite4

Does anyone have ideas or could point us to resources that would allow
us to improve this performance? 2 seconds response added with network
latency gives an impression of "slowness" of our site that we are
trying to reduce.

Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing search speed & performance for a 10G Index

Zaheed Haque
On 12/8/06, Chun Wei Ho <[hidden email]> wrote:

> Hi,
>
> We run a search engine based on Lucene 1.9.1 / Nutch 0.7.2. Our index
> has approximately 2 million documents and the physical size of it is
> about 10 GB. We run it as a tomcat web application on a Fedora Core 4
> server with duo Xeon 3.2GHz processors and 4GB RAM.
>
> We receive about 46500 web search requests a day (ranging from 50-300
> requests per 5 minutes across the day). Each web search request could
> spawn about one to three actual Lucene searches. Our average response
> time (calculated from the server side - and so excludes network
> latency), is about 2 seconds.
>
> Does this timing of 2 seconds appear plausible for Lucene, based on
> the machine specifications above.
>
>
> Our index is slightly more complex (with multiple fields like title,
> location, site, content). For example, a search for "Linux and Lucene"
> related entries in "Australia" might result in lucene searches for:

Just a thought.. Wouldn't you want to separate the additional info
into a separate
 Lucene index and have 1 field in the main index that does the look-up work
from the "other additional info index". In my use case I have a Nutch index with
1 added field i.e. infoid. then I have a separate index just
containing site related info
i.e address, email, phone etc.. corresponding to that specific page.

My experience is that the more fields you have in the index the more slow the
search becomes.

> ((title:linux^1.0 title:lucene^1.0)^4.0)
> +((
> +(title:linux^5.0 location:linux^1.5 content:linux^1.0)
> +(title:lucene^5.0 location:lucene^1.5 content:lucene^1.0))
> ((+(+content:linux +content:lucene)) +(site:contentsite1
> site:contentsite2 site:contentsite3 site:contentsite4
> site:contentsite5 site:contentsite6 site:contentsite7)))^0.01))
> +location:australia)
> +newsdate:[20061107 TO 20061208]
> +region:au)
> -jobsite:badsite1 -region:badregion1 -jobsite:badsite2
> -jobsite:badsite3 -jobsite:badsite4
>
> Does anyone have ideas or could point us to resources that would allow
> us to improve this performance? 2 seconds response added with network
> latency gives an impression of "slowness" of our site that we are
> trying to reduce.

You could do bunch of tweaking in the nutch-site.xml file but I am guessing you
have already tried that.

I have also heard i.e. never tried something like using mem cache or
oscache.. There
is an oscache type of implementation under contrib/web2 in Nutch 0.8
maybe you can
look there to see how you can implement it in 0.7.. just some wild guesses...

> Thank you.
>