What's the bottleneck?

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

What's the bottleneck?

Jason Rennie-2
We have a 14 million document index that we only use for querying
(optimized, read-only).  When we issue queries that have few, relatively
rare words, the query returns quickly.  However, when the query is longer
and uses more common words (hitting, say, ~1 million docs), it might take
seconds to return.  I'd like to know: what's the bottleneck?  It doesn't
seem to be disk---i/o wait times on the machine are much, much lower than on
our database servers (e.g. 3% vs. 50%).  Our search server is an 8-core
machine and we do see cpu regularly holding above 100%, so cpu seems
plausible, but would it really take that long to compute scores?

We're using DisMax.  There are a number of different fields that we search
over (5 to be exact).  We also have an fq on a single-digit status field.
Does it make sense that computation time could easily exceed a second?  If
cpu is the bottleneck, is there anything we could do to easily trim-down
computation time (besides removing common words from the query)?

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Mark Miller-3
Jason Rennie wrote:

> We have a 14 million document index that we only use for querying
> (optimized, read-only).  When we issue queries that have few, relatively
> rare words, the query returns quickly.  However, when the query is longer
> and uses more common words (hitting, say, ~1 million docs), it might take
> seconds to return.  I'd like to know: what's the bottleneck?  It doesn't
> seem to be disk---i/o wait times on the machine are much, much lower than on
> our database servers (e.g. 3% vs. 50%).  Our search server is an 8-core
> machine and we do see cpu regularly holding above 100%, so cpu seems
> plausible, but would it really take that long to compute scores?
>
> We're using DisMax.  There are a number of different fields that we search
> over (5 to be exact).  We also have an fq on a single-digit status field.
> Does it make sense that computation time could easily exceed a second?  If
> cpu is the bottleneck, is there anything we could do to easily trim-down
> computation time (besides removing common words from the query)?
>
> Jason
>
>  
What kind of traffic are you getting when it takes seconds? 1 request? 12?
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Jason Rennie-2
On Thu, Sep 11, 2008 at 11:54 AM, Mark Miller <[hidden email]> wrote:

> What kind of traffic are you getting when it takes seconds? 1 request? 12?
>

I'd estimate concurrency around 3, though the speed doesn't change much when
we run the same query on a server with zero traffic.

Jason
Reply | Threaded
Open this post in threaded view
|

RE: What's the bottleneck?

r.prieto
In reply to this post by Jason Rennie-2
Hi Jason, some questions ..

what is your index configuration???
What is your average size form the returned fields ???
How much memory have your System ??
Do you have long fieds who is returned in the queries ?
Do you have actívate the Highlighting in the request ?
Are you using multi-value filed for filter ...



-----Mensaje original-----
De: Jason Rennie [mailto:[hidden email]]
Enviado el: jueves, 11 de septiembre de 2008 17:25
Para: [hidden email]
Asunto: What's the bottleneck?

We have a 14 million document index that we only use for querying
(optimized, read-only).  When we issue queries that have few, relatively
rare words, the query returns quickly.  However, when the query is longer
and uses more common words (hitting, say, ~1 million docs), it might take
seconds to return.  I'd like to know: what's the bottleneck?  It doesn't
seem to be disk---i/o wait times on the machine are much, much lower than on
our database servers (e.g. 3% vs. 50%).  Our search server is an 8-core
machine and we do see cpu regularly holding above 100%, so cpu seems
plausible, but would it really take that long to compute scores?

We're using DisMax.  There are a number of different fields that we search
over (5 to be exact).  We also have an fq on a single-digit status field.
Does it make sense that computation time could easily exceed a second?  If
cpu is the bottleneck, is there anything we could do to easily trim-down
computation time (besides removing common words from the query)?

Jason

--
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Jason Rennie-2
On Thu, Sep 11, 2008 at 1:29 PM, <[hidden email]> wrote:

> what is your index configuration???


Not sure what you mean.  We're using 1.2, though we've tested with a recent
nightly and didn't see a significant change in performance...


> What is your average size form the returned fields ???


Returned fields are relatively small, ~200 characters total per document.
We're requesting the top 10 or so docs.

How much memory have your System ??


8g.  We give the jvm a 2g (max) heap.  We have another solr running on the
same box also w/ 2g heap.  The Linux kernel caches ~2.5g of disk.


> Do you have long fieds who is returned in the queries ?


No.  The searched and returned fields are relatively short.  One
searched-over (but not returned) field can get up to a few hundred
characters, but it's safe to assume they're all < 1k.


> Do you have actívate the Highlighting in the request ?


Nope.


> Are you using multi-value filed for filter ...


No, it does not have the multiValue attribute turned on.  The qf field is
just an integer.

Any thoughts/comments are appreciated.

Thanks,

Jason
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Mike Klaas
In reply to this post by Jason Rennie-2
On 11-Sep-08, at 8:24 AM, Jason Rennie wrote:

> We have a 14 million document index that we only use for querying
> (optimized, read-only).  When we issue queries that have few,  
> relatively
> rare words, the query returns quickly.  However, when the query is  
> longer
> and uses more common words (hitting, say, ~1 million docs), it might  
> take
> seconds to return.  I'd like to know: what's the bottleneck?  It  
> doesn't
> seem to be disk---i/o wait times on the machine are much, much lower  
> than on
> our database servers (e.g. 3% vs. 50%).  Our search server is an 8-
> core
> machine and we do see cpu regularly holding above 100%, so cpu seems
> plausible, but would it really take that long to compute scores?
>
> We're using DisMax.  There are a number of different fields that we  
> search
> over (5 to be exact).  We also have an fq on a single-digit status  
> field.
> Does it make sense that computation time could easily exceed a  
> second?  If
> cpu is the bottleneck, is there anything we could do to easily trim-
> down
> computation time (besides removing common words from the query)?

Are you using pf?  phrase queries are much more expensive than term  
queries.

If you have a restrictive fq, you might try an approach similar to the  
one in https://issues.apache.org/jira/browse/SOLR-407 .

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

JerylCook
In reply to this post by Jason Rennie-2
I think you should justs break up your index across boxes and do a
"federated search" across them...
since you mentioned you have a single machine..

Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"Whether we bring our enemies to justice, or bring justice to our
enemies, justice will be done."
--George W. Bush, Address to a Joint Session of Congress and the
American People, September 20, 2001


On Thu, Sep 11, 2008 at 3:58 PM, Jason Rennie <[hidden email]> wrote:

> On Thu, Sep 11, 2008 at 1:29 PM, <[hidden email]> wrote:
>
>> what is your index configuration???
>
>
> Not sure what you mean.  We're using 1.2, though we've tested with a recent
> nightly and didn't see a significant change in performance...
>
>
>> What is your average size form the returned fields ???
>
>
> Returned fields are relatively small, ~200 characters total per document.
> We're requesting the top 10 or so docs.
>
> How much memory have your System ??
>
>
> 8g.  We give the jvm a 2g (max) heap.  We have another solr running on the
> same box also w/ 2g heap.  The Linux kernel caches ~2.5g of disk.
>
>
>> Do you have long fieds who is returned in the queries ?
>
>
> No.  The searched and returned fields are relatively short.  One
> searched-over (but not returned) field can get up to a few hundred
> characters, but it's safe to assume they're all < 1k.
>
>
>> Do you have actívate the Highlighting in the request ?
>
>
> Nope.
>
>
>> Are you using multi-value filed for filter ...
>
>
> No, it does not have the multiValue attribute turned on.  The qf field is
> just an integer.
>
> Any thoughts/comments are appreciated.
>
> Thanks,
>
> Jason
>



--
Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
"Whether we bring our enemies to justice, or bring justice to our
enemies, justice will be done."
--George W. Bush, Address to a Joint Session of Congress and the
American People, September 20, 2001
Reply | Threaded
Open this post in threaded view
|

RE: What's the bottleneck?

r.prieto
In reply to this post by Jason Rennie-2
Ok, have you a average size about the memory ocupation,  by Solr ?

You must to have a look about the really memory usage from cached fields,
and try to set java memory to upper value

Are you evaluate the performance factors:
http://wiki.apache.org/solr/SolrPerformanceFactors

I think that is a memory problem, because when you issue queries that have
few documents, its are load into memory (cache from solr) and the next
queries don't use IO disk operation. But when the queries return too many
documents, its can't be load in memory, by size, and for any query the solr
must do load/unload memory operations, and disk reads ...

Other cause can be the Lucene memory ocupation, but i need know what is the
realy memory ocupation for the index.

Sorry for my english :-(



-----Mensaje original-----
De: Jason Rennie [mailto:[hidden email]]
Enviado el: jueves, 11 de septiembre de 2008 21:58
Para: [hidden email]
Asunto: Re: What's the bottleneck?

On Thu, Sep 11, 2008 at 1:29 PM, <[hidden email]> wrote:

> what is your index configuration???


Not sure what you mean.  We're using 1.2, though we've tested with a recent
nightly and didn't see a significant change in performance...


> What is your average size form the returned fields ???


Returned fields are relatively small, ~200 characters total per document.
We're requesting the top 10 or so docs.

How much memory have your System ??


8g.  We give the jvm a 2g (max) heap.  We have another solr running on the
same box also w/ 2g heap.  The Linux kernel caches ~2.5g of disk.


> Do you have long fieds who is returned in the queries ?


No.  The searched and returned fields are relatively short.  One
searched-over (but not returned) field can get up to a few hundred
characters, but it's safe to assume they're all < 1k.


> Do you have actívate the Highlighting in the request ?


Nope.


> Are you using multi-value filed for filter ...


No, it does not have the multiValue attribute turned on.  The qf field is
just an integer.

Any thoughts/comments are appreciated.

Thanks,

Jason

Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Grant Ingersoll-2
In reply to this post by Jason Rennie-2
The bottleneck may simply be there are a lot of docs to score since  
you are using fairly common terms.

Also, what file format (compound, non-compound) are you using?  Is it  
optimized?  Have you profiled your app for these queries?  When you  
say the "query is longer", define "longer"...  5 terms?  50 terms?  Do  
you have lots of deleted docs?  Can you share your DisMax params?  Are  
you doing wildcard queries?  Can you share the syntax of one of the  
offending queries?

Since you want to keep "stopwords", you might consider a slightly  
better use of them, whereby you use them in n-grams only during query  
parsing.

See also https://issues.apache.org/jira/browse/LUCENE-494 for related  
stuff.

-Grant


On Sep 11, 2008, at 11:24 AM, Jason Rennie wrote:

> We have a 14 million document index that we only use for querying
> (optimized, read-only).  When we issue queries that have few,  
> relatively
> rare words, the query returns quickly.  However, when the query is  
> longer
> and uses more common words (hitting, say, ~1 million docs), it might  
> take
> seconds to return.  I'd like to know: what's the bottleneck?  It  
> doesn't
> seem to be disk---i/o wait times on the machine are much, much lower  
> than on
> our database servers (e.g. 3% vs. 50%).  Our search server is an 8-
> core
> machine and we do see cpu regularly holding above 100%, so cpu seems
> plausible, but would it really take that long to compute scores?
>
> We're using DisMax.  There are a number of different fields that we  
> search
> over (5 to be exact).  We also have an fq on a single-digit status  
> field.
> Does it make sense that computation time could easily exceed a  
> second?  If
> cpu is the bottleneck, is there anything we could do to easily trim-
> down
> computation time (besides removing common words from the query)?
>
> Jason
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Jason Rennie-2
Thanks for all the replies!

Mike: we're not using pf.  Our qf is always "status:0".  The "status" field
is "0" for all good docs (90%+) and some other integer for any docs we don't
want returned.

Jeyrl: federated search is definitely something we'll consider.

On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll <[hidden email]>wrote:

> The bottleneck may simply be there are a lot of docs to score since you are
> using fairly common terms.


Yeah, I'm coming to the realization that it may be as simple as that.  Even
a short, simple query like "shirt" can take seconds to return, presumably
because it hits ("numFound") 2 million docs.


> Also, what file format (compound, non-compound) are you using?  Is it
> optimized?  Have you profiled your app for these queries?  When you say the
> "query is longer", define "longer"...  5 terms?  50 terms?  Do you have lots
> of deleted docs?  Can you share your DisMax params?  Are you doing wildcard
> queries?  Can you share the syntax of one of the offending queries?


I think we're using the non-compound format.  We see eight different files
(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
also read-only---we don't update/delete.  DisMax: we specify qf, fl, mm, fq;
mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt"; takes 2
secs to run according to the solr log, hits 2 million docs.


> Since you want to keep "stopwords", you might consider a slightly better
> use of them, whereby you use them in n-grams only during query parsing.


Not sure what you mean here...


> See also https://issues.apache.org/jira/browse/LUCENE-494 for related
> stuff.
>

Thanks for the pointer.

Jason
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

kkrugler
>Thanks for all the replies!
>
>Mike: we're not using pf.  Our qf is always "status:0".  The "status" field
>is "0" for all good docs (90%+) and some other integer for any docs we don't
>want returned.
>
>Jeyrl: federated search is definitely something we'll consider.
>
>On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll <[hidden email]>wrote:
>
>>  The bottleneck may simply be there are a lot of docs to score since you are
>>  using fairly common terms.
>
>Yeah, I'm coming to the realization that it may be as simple as that.  Even
>a short, simple query like "shirt" can take seconds to return, presumably
>because it hits ("numFound") 2 million docs.
>
>
>>  Also, what file format (compound, non-compound) are you using?  Is it
>>  optimized?  Have you profiled your app for these queries?  When you say the
>>  "query is longer", define "longer"...  5 terms?  50 terms?  Do you have lots
>>  of deleted docs?  Can you share your DisMax params?  Are you doing wildcard
>>  queries?  Can you share the syntax of one of the offending queries?
>
>
>I think we're using the non-compound format.  We see eight different files
>(fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
>also read-only---we don't update/delete.  DisMax: we specify qf, fl, mm, fq;
>mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt"; takes 2
>secs to run according to the solr log, hits 2 million docs.
>
>
>  > Since you want to keep "stopwords", you might consider a slightly better
>>  use of them, whereby you use them in n-grams only during query parsing.
>
>
>Not sure what you mean here...

You might want to look at how Nutch handles this issue. Nutch also
has stopwords that it wants to keep around. So what it does is
generates combo terms like the-<next term> in the index. The query
parser does the same thing, so that if your query phrase has common
terms, you wind up searching across a much smaller slice of your
index.

This comes, of course, at the expense of a larger index with a lot
more unique terms (due to all of the combo terms).

But this can be a big win - for example, at our site
(http://www.krugle.org) we index source files. Without this
optimization, searches could take several seconds. With it, we got
down to < 100ms with lots of breathing room.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Otis Gospodnetic-2
In reply to this post by Jason Rennie-2
Jason, you could also post what the final query looks like (after dismax chews on it) - use &debugQuery=true and let's see if there is anything strange there.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Jason Rennie <[hidden email]>
> To: [hidden email]
> Sent: Friday, September 12, 2008 2:17:28 PM
> Subject: Re: What's the bottleneck?
>
> Thanks for all the replies!
>
> Mike: we're not using pf.  Our qf is always "status:0".  The "status" field
> is "0" for all good docs (90%+) and some other integer for any docs we don't
> want returned.
>
> Jeyrl: federated search is definitely something we'll consider.
>
> On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll wrote:
>
> > The bottleneck may simply be there are a lot of docs to score since you are
> > using fairly common terms.
>
>
> Yeah, I'm coming to the realization that it may be as simple as that.  Even
> a short, simple query like "shirt" can take seconds to return, presumably
> because it hits ("numFound") 2 million docs.
>
>
> > Also, what file format (compound, non-compound) are you using?  Is it
> > optimized?  Have you profiled your app for these queries?  When you say the
> > "query is longer", define "longer"...  5 terms?  50 terms?  Do you have lots
> > of deleted docs?  Can you share your DisMax params?  Are you doing wildcard
> > queries?  Can you share the syntax of one of the offending queries?
>
>
> I think we're using the non-compound format.  We see eight different files
> (fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
> also read-only---we don't update/delete.  DisMax: we specify qf, fl, mm, fq;
> mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt"; takes 2
> secs to run according to the solr log, hits 2 million docs.
>
>
> > Since you want to keep "stopwords", you might consider a slightly better
> > use of them, whereby you use them in n-grams only during query parsing.
>
>
> Not sure what you mean here...
>
>
> > See also https://issues.apache.org/jira/browse/LUCENE-494 for related
> > stuff.
> >
>
> Thanks for the pointer.
>
> Jason

Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Grant Ingersoll-2
In reply to this post by Jason Rennie-2
See also https://issues.apache.org/jira/browse/SOLR-502 (timeout  
searches)

and https://issues.apache.org/jira/browse/LUCENE-997

This is committed on trunk and will be in 1.3.  Don't ask me how it  
works, b/c I haven't tried it yet, but maybe Sean Timm or someone can  
help out.  I'm not sure if returns partial results or not.

Also, what kind of caching/warming do you do?  How often do these slow  
queries appear?  Have you profiled your application yet?  How many  
results are you retrieving?

In some cases, you may just want to figure out how to just return a  
cached set of results for your most frequent, slow queries.  I mean,  
if you know "shirt" is going to retrieve 2 million docs, what  
difference does it make if it really has 2 million and 1 documents?  
Do the query once, cache the top, oh 1000, and be done.  Doesn't even  
necessarily need to hit Solr.  I know, I know, it's not search, but  
most search applications do these kinds of things.

Still, would be nice if there were a little better solution for you.

On Sep 12, 2008, at 2:17 PM, Jason Rennie wrote:

> Thanks for all the replies!
>
> Mike: we're not using pf.  Our qf is always "status:0".  The  
> "status" field
> is "0" for all good docs (90%+) and some other integer for any docs  
> we don't
> want returned.
>
> Jeyrl: federated search is definitely something we'll consider.
>
> On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll  
> <[hidden email]>wrote:
>
>> The bottleneck may simply be there are a lot of docs to score since  
>> you are
>> using fairly common terms.
>
>
> Yeah, I'm coming to the realization that it may be as simple as  
> that.  Even
> a short, simple query like "shirt" can take seconds to return,  
> presumably
> because it hits ("numFound") 2 million docs.
>
>
>> Also, what file format (compound, non-compound) are you using?  Is it
>> optimized?  Have you profiled your app for these queries?  When you  
>> say the
>> "query is longer", define "longer"...  5 terms?  50 terms?  Do you  
>> have lots
>> of deleted docs?  Can you share your DisMax params?  Are you doing  
>> wildcard
>> queries?  Can you share the syntax of one of the offending queries?
>
>
> I think we're using the non-compound format.  We see eight different  
> files
> (fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  
> It's
> also read-only---we don't update/delete.  DisMax: we specify qf, fl,  
> mm, fq;
> mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt";  
> takes 2
> secs to run according to the solr log, hits 2 million docs.
>
>
>> Since you want to keep "stopwords", you might consider a slightly  
>> better
>> use of them, whereby you use them in n-grams only during query  
>> parsing.
>
>
> Not sure what you mean here...
>
>
>> See also https://issues.apache.org/jira/browse/LUCENE-494 for related
>> stuff.
>>
>
> Thanks for the pointer.
>
> Jason

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







Reply | Threaded
Open this post in threaded view
|

RE: What's the bottleneck?

r.prieto
In reply to this post by Jason Rennie-2
Hi Jason,

I'd like to know how you solved the problem.
could you post the solution??

Thanks

Raúl
-----Mensaje original-----
De: Jason Rennie [mailto:[hidden email]]
Enviado el: jueves, 11 de septiembre de 2008 21:58
Para: [hidden email]
Asunto: Re: What's the bottleneck?

On Thu, Sep 11, 2008 at 1:29 PM, <[hidden email]> wrote:

> what is your index configuration???


Not sure what you mean.  We're using 1.2, though we've tested with a recent
nightly and didn't see a significant change in performance...


> What is your average size form the returned fields ???


Returned fields are relatively small, ~200 characters total per document.
We're requesting the top 10 or so docs.

How much memory have your System ??


8g.  We give the jvm a 2g (max) heap.  We have another solr running on the
same box also w/ 2g heap.  The Linux kernel caches ~2.5g of disk.


> Do you have long fieds who is returned in the queries ?


No.  The searched and returned fields are relatively short.  One
searched-over (but not returned) field can get up to a few hundred
characters, but it's safe to assume they're all < 1k.


> Do you have actívate the Highlighting in the request ?


Nope.


> Are you using multi-value filed for filter ...


No, it does not have the multiValue attribute turned on.  The qf field is
just an integer.

Any thoughts/comments are appreciated.

Thanks,

Jason

Reply | Threaded
Open this post in threaded view
|

Re: What's the bottleneck?

Sean Timm
In reply to this post by Grant Ingersoll-2
The HitCollector used by the Searcher is wrapped by a
TimeLimitedCollector
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/TimeLimitedCollector.html>
which times out search requests that take longer than the maximum
allowed search time limit during the collect.  Any hits that have been
collected before the time expires are returned and a partialResults flag
is set.

This is the use case that I had in mind:

    The timeout is to protect the server side. The client side can be
    largely protected by setting a read timeout, but if the client
    aborts before the server responds, the server is just wasting
    resources processing a request that will never be used. The partial
    results is useful in a couple of scenarios, probably the most
    important is a large distributed complex where you would rather get
    whatever results you can from a slow shard rather than throw them away.

    As a real world example, the query "contact us about our site" on a
    2.3MM document index (partial Dmoz crawl) takes several seconds to
    complete, while the mean response time is sub 50 ms. We've had cases
    where a bot walks the next page links (including expensive queries
    such as this). Also users are prone to repeatedly click the query
    button if they get impatient on a slow site. Without a server side
    timeout, this is a real issue.

But, you may find it useful for your scenario.  You aren't guaranteed to
get the most relevant documents returned however, since they may not
have been collected.  The new distributed search features of 1.3 may be
something you want to look into.  That will allow you to decrease your
response time by dividing your index into smaller partitions.

-Sean

Grant Ingersoll wrote:

> See also https://issues.apache.org/jira/browse/SOLR-502 (timeout
> searches)
>
> and https://issues.apache.org/jira/browse/LUCENE-997
>
> This is committed on trunk and will be in 1.3.  Don't ask me how it
> works, b/c I haven't tried it yet, but maybe Sean Timm or someone can
> help out.  I'm not sure if returns partial results or not.
>
> Also, what kind of caching/warming do you do?  How often do these slow
> queries appear?  Have you profiled your application yet?  How many
> results are you retrieving?
>
> In some cases, you may just want to figure out how to just return a
> cached set of results for your most frequent, slow queries.  I mean,
> if you know "shirt" is going to retrieve 2 million docs, what
> difference does it make if it really has 2 million and 1 documents?  
> Do the query once, cache the top, oh 1000, and be done.  Doesn't even
> necessarily need to hit Solr.  I know, I know, it's not search, but
> most search applications do these kinds of things.
>
> Still, would be nice if there were a little better solution for you.
>
> On Sep 12, 2008, at 2:17 PM, Jason Rennie wrote:
>
>> Thanks for all the replies!
>>
>> Mike: we're not using pf.  Our qf is always "status:0".  The "status"
>> field
>> is "0" for all good docs (90%+) and some other integer for any docs
>> we don't
>> want returned.
>>
>> Jeyrl: federated search is definitely something we'll consider.
>>
>> On Fri, Sep 12, 2008 at 8:39 AM, Grant Ingersoll
>> <[hidden email]>wrote:
>>
>>> The bottleneck may simply be there are a lot of docs to score since
>>> you are
>>> using fairly common terms.
>>
>>
>> Yeah, I'm coming to the realization that it may be as simple as
>> that.  Even
>> a short, simple query like "shirt" can take seconds to return,
>> presumably
>> because it hits ("numFound") 2 million docs.
>>
>>
>>> Also, what file format (compound, non-compound) are you using?  Is it
>>> optimized?  Have you profiled your app for these queries?  When you
>>> say the
>>> "query is longer", define "longer"...  5 terms?  50 terms?  Do you
>>> have lots
>>> of deleted docs?  Can you share your DisMax params?  Are you doing
>>> wildcard
>>> queries?  Can you share the syntax of one of the offending queries?
>>
>>
>> I think we're using the non-compound format.  We see eight different
>> files
>> (fdt, fdx, fnm, etc.) in an optimized index.  Yes, it's optimized.  It's
>> also read-only---we don't update/delete.  DisMax: we specify qf, fl,
>> mm, fq;
>> mm=1; we use boosts for qf.  No wildcards.  Example query: "shirt";
>> takes 2
>> secs to run according to the solr log, hits 2 million docs.
>>
>>
>>> Since you want to keep "stopwords", you might consider a slightly
>>> better
>>> use of them, whereby you use them in n-grams only during query parsing.
>>
>>
>> Not sure what you mean here...
>>
>>
>>> See also https://issues.apache.org/jira/browse/LUCENE-494 for related
>>> stuff.
>>>
>>
>> Thanks for the pointer.
>>
>> Jason
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>