Performance help for heavy indexing workload

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance help for heavy indexing workload

James Brady-3
Hello,
I'm looking for some configuration guidance to help improve  
performance of my application, which tends to do a lot more indexing  
than searching.

At present, it needs to index around two documents / sec - a document  
being the stripped content of a webpage. However, performance was so  
poor that I've had to disable indexing of the webpage content as an  
emergency measure. In addition, some search queries take an  
inordinate length of time - regularly over 60 seconds.

This is running on a medium sized EC2 instance (2 x 2GHz Opterons and  
8GB RAM), and there's not too much else going on on the box. In  
total, there are about 1.5m documents in the index.

I'm using a fairly standard configuration - the things I've tried  
changing so far have been parameters like maxMergeDocs, mergeFactor  
and the autoCommit options. I'm only using the  
StandardRequestHandler, no faceting. I have a scheduled task causing  
a database commit every 15 seconds.

Obviously, every workload varies, but could anyone comment on whether  
this sort of hardware should, with proper configuration, be able to  
manage this sort of workload?

I can't see signs of Solr being IO-bound, CPU-bound or memory-bound,  
although my scheduled commit operation, or perhaps GC, does spike up  
the CPU utilisation at intervals.

Any help appreciated!
James
Reply | Threaded
Open this post in threaded view
|

Fwd: Performance help for heavy indexing workload

James Brady-3
Hi again,
More analysis showed that the extraordinarily long query times only  
appeared when I specify a sort. A concrete example:

For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
3A39&start=0&rows=1&fl=*%2Cscore&qt=standard&wt=standard&explainOther=
The QTime is ~500ms.
For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
3A39&start=0&rows=1&fl=*%
2Cscore&qt=standard&wt=standard&explainOther=&sort=date_added%20asc
The QTime is ~75s

I.e. I am using the StandardRequestHandler to search for a user  
entered term ("apache" above) and filtering by a user_id field.

This seems to be the case for every sort option except score asc and  
score desc. Please tell me Solr doesn't sort all matching documents  
before applying boolean filters?

James

Begin forwarded message:

> From: James Brady <[hidden email]>
> Date: 11 February 2008 23:38:16 GMT-08:00
> To: [hidden email]
> Subject: Performance help for heavy indexing workload
>
> Hello,
> I'm looking for some configuration guidance to help improve  
> performance of my application, which tends to do a lot more  
> indexing than searching.
>
> At present, it needs to index around two documents / sec - a  
> document being the stripped content of a webpage. However,  
> performance was so poor that I've had to disable indexing of the  
> webpage content as an emergency measure. In addition, some search  
> queries take an inordinate length of time - regularly over 60 seconds.
>
> This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
> and 8GB RAM), and there's not too much else going on on the box. In  
> total, there are about 1.5m documents in the index.
>
> I'm using a fairly standard configuration - the things I've tried  
> changing so far have been parameters like maxMergeDocs, mergeFactor  
> and the autoCommit options. I'm only using the  
> StandardRequestHandler, no faceting. I have a scheduled task  
> causing a database commit every 15 seconds.
>
> Obviously, every workload varies, but could anyone comment on  
> whether this sort of hardware should, with proper configuration, be  
> able to manage this sort of workload?
>
> I can't see signs of Solr being IO-bound, CPU-bound or memory-
> bound, although my scheduled commit operation, or perhaps GC, does  
> spike up the CPU utilisation at intervals.
>
> Any help appreciated!
> James

Reply | Threaded
Open this post in threaded view
|

Re: Performance help for heavy indexing workload

Erick Erickson
Well, the *first* sort to the underlying Lucene engine is expensive since
it builds up the terms to sort. I wonder if you're closing and opening the
underlying searcher for every request? This is a definite limiter.

Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask
me how to change this behavior <G>. But your comment about
frequent updates to the index prompted this question....

Best
Erick

On Feb 12, 2008 3:54 AM, James Brady <[hidden email]> wrote:

> Hi again,
> More analysis showed that the extraordinarily long query times only
> appeared when I specify a sort. A concrete example:
>
> For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
> 3A39&start=0&rows=1&fl=*%2Cscore&qt=standard&wt=standard&explainOther=
> The QTime is ~500ms.
> For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
> 3A39&start=0&rows=1&fl=*%
> 2Cscore&qt=standard&wt=standard&explainOther=&sort=date_added%20asc
> The QTime is ~75s
>
> I.e. I am using the StandardRequestHandler to search for a user
> entered term ("apache" above) and filtering by a user_id field.
>
> This seems to be the case for every sort option except score asc and
> score desc. Please tell me Solr doesn't sort all matching documents
> before applying boolean filters?
>
> James
>
> Begin forwarded message:
>
> > From: James Brady <[hidden email]>
> > Date: 11 February 2008 23:38:16 GMT-08:00
> > To: [hidden email]
> > Subject: Performance help for heavy indexing workload
> >
> > Hello,
> > I'm looking for some configuration guidance to help improve
> > performance of my application, which tends to do a lot more
> > indexing than searching.
> >
> > At present, it needs to index around two documents / sec - a
> > document being the stripped content of a webpage. However,
> > performance was so poor that I've had to disable indexing of the
> > webpage content as an emergency measure. In addition, some search
> > queries take an inordinate length of time - regularly over 60 seconds.
> >
> > This is running on a medium sized EC2 instance (2 x 2GHz Opterons
> > and 8GB RAM), and there's not too much else going on on the box. In
> > total, there are about 1.5m documents in the index.
> >
> > I'm using a fairly standard configuration - the things I've tried
> > changing so far have been parameters like maxMergeDocs, mergeFactor
> > and the autoCommit options. I'm only using the
> > StandardRequestHandler, no faceting. I have a scheduled task
> > causing a database commit every 15 seconds.
> >
> > Obviously, every workload varies, but could anyone comment on
> > whether this sort of hardware should, with proper configuration, be
> > able to manage this sort of workload?
> >
> > I can't see signs of Solr being IO-bound, CPU-bound or memory-
> > bound, although my scheduled commit operation, or perhaps GC, does
> > spike up the CPU utilisation at intervals.
> >
> > Any help appreciated!
> > James
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Performance help for heavy indexing workload

kkrugler
In reply to this post by James Brady-3
Hi James,

>>I'm looking for some configuration guidance to help improve
>>performance of my application, which tends to do a lot more
>>indexing than searching.
>>
>>At present, it needs to index around two documents / sec - a
>>document being the stripped content of a webpage. However,
>>performance was so poor that I've had to disable indexing of the
>>webpage content as an emergency measure. In addition, some search
>>queries take an inordinate length of time - regularly over 60
>>seconds.

In general immediate updating of an index with a continuous stream of
new content, and fast search results, work in opposition. The
searcher's various caches are getting continuously flushed to avoid
stale content, which can easily kill your performance.

This issue was one of the more interesting topics discussed during
the Lucene BoF meeting at ApacheCon. You're not alone in wanting to
have it both ways, but it's clear this is A Hard Problem.

If you can relax the need for immediate updates to the index, and
accept some level of lag time between receiving new content and this
showing up in the index, then I'd suggest splitting the two
processes. Have a backend system that deals with updates, and then at
some slower interval update the search index.

-- Ken

>>
>>This is running on a medium sized EC2 instance (2 x 2GHz Opterons
>>and 8GB RAM), and there's not too much else going on on the box. In
>>total, there are about 1.5m documents in the index.
>>
>>I'm using a fairly standard configuration - the things I've tried
>>changing so far have been parameters like maxMergeDocs, mergeFactor
>>and the autoCommit options. I'm only using the
>>StandardRequestHandler, no faceting. I have a scheduled task
>>causing a database commit every 15 seconds.
>>
>>Obviously, every workload varies, but could anyone comment on
>>whether this sort of hardware should, with proper configuration, be
>>able to manage this sort of workload?
>>
>>I can't see signs of Solr being IO-bound, CPU-bound or
>>memory-bound, although my scheduled commit operation, or perhaps
>>GC, does spike up the CPU utilisation at intervals.
>>
>>Any help appreciated!
>>James


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"
Reply | Threaded
Open this post in threaded view
|

Re: Performance help for heavy indexing workload

Walter Underwood, Netflix
In reply to this post by Erick Erickson
That does seem really slow. Is the index on NFS-mounted storage?

wunder

On 2/12/08 7:04 AM, "Erick Erickson" <[hidden email]> wrote:

> Well, the *first* sort to the underlying Lucene engine is expensive since
> it builds up the terms to sort. I wonder if you're closing and opening the
> underlying searcher for every request? This is a definite limiter.
>
> Disclaimer: I mostly do Lucene, not SOLR (yet), so don't *even* ask
> me how to change this behavior <G>. But your comment about
> frequent updates to the index prompted this question....
>
> Best
> Erick
>
> On Feb 12, 2008 3:54 AM, James Brady <[hidden email]> wrote:
>
>> Hi again,
>> More analysis showed that the extraordinarily long query times only
>> appeared when I specify a sort. A concrete example:
>>
>> For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
>> 3A39&start=0&rows=1&fl=*%2Cscore&qt=standard&wt=standard&explainOther=
>> The QTime is ~500ms.
>> For a querystring such as: ?indent=on&version=2.2&q=apache+user_id%
>> 3A39&start=0&rows=1&fl=*%
>> 2Cscore&qt=standard&wt=standard&explainOther=&sort=date_added%20asc
>> The QTime is ~75s
>>
>> I.e. I am using the StandardRequestHandler to search for a user
>> entered term ("apache" above) and filtering by a user_id field.
>>
>> This seems to be the case for every sort option except score asc and
>> score desc. Please tell me Solr doesn't sort all matching documents
>> before applying boolean filters?
>>
>> James
>>
>> Begin forwarded message:
>>
>>> From: James Brady <[hidden email]>
>>> Date: 11 February 2008 23:38:16 GMT-08:00
>>> To: [hidden email]
>>> Subject: Performance help for heavy indexing workload
>>>
>>> Hello,
>>> I'm looking for some configuration guidance to help improve
>>> performance of my application, which tends to do a lot more
>>> indexing than searching.
>>>
>>> At present, it needs to index around two documents / sec - a
>>> document being the stripped content of a webpage. However,
>>> performance was so poor that I've had to disable indexing of the
>>> webpage content as an emergency measure. In addition, some search
>>> queries take an inordinate length of time - regularly over 60 seconds.
>>>
>>> This is running on a medium sized EC2 instance (2 x 2GHz Opterons
>>> and 8GB RAM), and there's not too much else going on on the box. In
>>> total, there are about 1.5m documents in the index.
>>>
>>> I'm using a fairly standard configuration - the things I've tried
>>> changing so far have been parameters like maxMergeDocs, mergeFactor
>>> and the autoCommit options. I'm only using the
>>> StandardRequestHandler, no faceting. I have a scheduled task
>>> causing a database commit every 15 seconds.
>>>
>>> Obviously, every workload varies, but could anyone comment on
>>> whether this sort of hardware should, with proper configuration, be
>>> able to manage this sort of workload?
>>>
>>> I can't see signs of Solr being IO-bound, CPU-bound or memory-
>>> bound, although my scheduled commit operation, or perhaps GC, does
>>> spike up the CPU utilisation at intervals.
>>>
>>> Any help appreciated!
>>> James
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Performance help for heavy indexing workload

Walter Underwood, Netflix
In reply to this post by kkrugler
On 2/12/08 7:40 AM, "Ken Krugler" <[hidden email]> wrote:

> In general immediate updating of an index with a continuous stream of
> new content, and fast search results, work in opposition. The
> searcher's various caches are getting continuously flushed to avoid
> stale content, which can easily kill your performance.

One approach is to have a big, rarely-updated index and a small index
for new or changed content. Once a day, add everything from the small
index into the big one. You may need external bookkeeping for deleted
documents.

Another trick from Infoseek.

wunder

Reply | Threaded
Open this post in threaded view
|

Re: Performance help for heavy indexing workload

Mike Klaas-2
In reply to this post by James Brady-3
On 11-Feb-08, at 11:38 PM, James Brady wrote:

> Hello,
> I'm looking for some configuration guidance to help improve  
> performance of my application, which tends to do a lot more  
> indexing than searching.
>
> At present, it needs to index around two documents / sec - a  
> document being the stripped content of a webpage. However,  
> performance was so poor that I've had to disable indexing of the  
> webpage content as an emergency measure. In addition, some search  
> queries take an inordinate length of time - regularly over 60 seconds.
>
> This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
> and 8GB RAM), and there's not too much else going on on the box. In  
> total, there are about 1.5m documents in the index.
>
> I'm using a fairly standard configuration - the things I've tried  
> changing so far have been parameters like maxMergeDocs, mergeFactor  
> and the autoCommit options. I'm only using the  
> StandardRequestHandler, no faceting. I have a scheduled task  
> causing a database commit every 15 seconds.

By "database commit" do you mean "solr commit"?  If so, that is far  
too frequent if you are sorting on big fields.

I use Solr to serve queries for ~10m docs on a medium size EC2  
instance.  This is an optimized configuration where highlighting is  
broken off into a separate index, and load balanced into two  
subindices of 5m docs a piece.  I do a good deal of faceting but no  
sorting.  The only reason that this is possible is that the index is  
only updated every few days.

On another box we have a several hundred thousand document index  
which is updated relatively frequently (autocommit time: 20s).  These  
are merged with the static-er index to create an illusion of real-
time index updates.

When lucene supports efficient, reopen()able fieldcache upates, this  
situation might improve, but the above architecture would still  
probably be better.  Note that the second index can be on the same  
machine.

-Mike
Reply | Threaded
Open this post in threaded view
|

Re: Performance help for heavy indexing workload

James Brady-3
Hi - thanks to everyone for their responses.

A couple of extra pieces of data which should help me optimise -  
documents are very rarely updated once in the index, and I can throw  
away index data older than 7 days.

So, based on advice from Mike and Walter, it seems my best option  
will be to have seven separate indices. 6 indices will never change  
and hold data from the six previous days. One index will change and  
will hold data from the current day. Deletions and updates will be  
handled by effectively storing a revocation list in the mutable index.

In this way, I will only need to perform Solr commits (yes, I did  
mean Solr commits rather than database commits below - my apologies)  
on the current day's index, and closing and opening new searchers for  
these commits shouldn't be as painful as it is currently.

To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and  
optimise the index at this point)
- ideally, properly warm new searchers in the background to further  
improve search performance on the changing index

Does that sound like a reasonable strategy in general, and has anyone  
got advice on the specific points I raise above?

Thanks,
James

On 12 Feb 2008, at 11:45, Mike Klaas wrote:

> On 11-Feb-08, at 11:38 PM, James Brady wrote:
>
>> Hello,
>> I'm looking for some configuration guidance to help improve  
>> performance of my application, which tends to do a lot more  
>> indexing than searching.
>>
>> At present, it needs to index around two documents / sec - a  
>> document being the stripped content of a webpage. However,  
>> performance was so poor that I've had to disable indexing of the  
>> webpage content as an emergency measure. In addition, some search  
>> queries take an inordinate length of time - regularly over 60  
>> seconds.
>>
>> This is running on a medium sized EC2 instance (2 x 2GHz Opterons  
>> and 8GB RAM), and there's not too much else going on on the box.  
>> In total, there are about 1.5m documents in the index.
>>
>> I'm using a fairly standard configuration - the things I've tried  
>> changing so far have been parameters like maxMergeDocs,  
>> mergeFactor and the autoCommit options. I'm only using the  
>> StandardRequestHandler, no faceting. I have a scheduled task  
>> causing a database commit every 15 seconds.
>
> By "database commit" do you mean "solr commit"?  If so, that is far  
> too frequent if you are sorting on big fields.
>
> I use Solr to serve queries for ~10m docs on a medium size EC2  
> instance.  This is an optimized configuration where highlighting is  
> broken off into a separate index, and load balanced into two  
> subindices of 5m docs a piece.  I do a good deal of faceting but no  
> sorting.  The only reason that this is possible is that the index  
> is only updated every few days.
>
> On another box we have a several hundred thousand document index  
> which is updated relatively frequently (autocommit time: 20s).  
> These are merged with the static-er index to create an illusion of  
> real-time index updates.
>
> When lucene supports efficient, reopen()able fieldcache upates,  
> this situation might improve, but the above architecture would  
> still probably be better.  Note that the second index can be on the  
> same machine.
>
> -Mike

Reply | Threaded
Open this post in threaded view
|

RE: Performance help for heavy indexing workload

Lance Norskog-2
1) autowarming: it means that if you have a cached query or similar, and do
a commit, it then reloads each cached query. This is in solrconfig.xml
2) sorting is a pig. A sort creates an array of N integers where N is the
size of the index, not the query. If the sorted field is anything but an
integer, a second array of size N is created with a copy of the field's
contents.  If you want a field to sort fast, you have to make it an int or
make an integer-format shadow field.

3) Large query return sets cause out-of-memory exceptions. If the Solr is
only doing queries, this is OK: the instance keeps working. We find that if
the Solr is also indexing when you hit an out-of-memory, the instance is
unusueable until you restart the Java container. This is with Tomcat 5 and
Linux RHEL4 with the standard Linux file system.

4) This can also be done by having one index. You do a mass delete on stuff
from 8 days ago.  There is a larger IT commitment in running multiple Solrs
or Lucene files. This is not Oracle or MySQL, where it is well-behaved and
you get cute little UIs to run everything. A large Solr index with
continuous indexing is not a turnkey application.

5) Be sure to check out 'filters'. These are really useful for trimming
queries if you have commonly used subsets of the index, like "language =
English".

We were new to Solr and Lucene and transferred over a several-million-record
index from FAST in 3 weeks. There is a learning curve, but it is an
impressive app.

Lance

-----Original Message-----
From: James Brady [mailto:[hidden email]]
Sent: Tuesday, February 12, 2008 12:41 PM
To: [hidden email]
Subject: Re: Performance help for heavy indexing workload

Hi - thanks to everyone for their responses.

A couple of extra pieces of data which should help me optimise - documents
are very rarely updated once in the index, and I can throw away index data
older than 7 days.

So, based on advice from Mike and Walter, it seems my best option will be to
have seven separate indices. 6 indices will never change and hold data from
the six previous days. One index will change and will hold data from the
current day. Deletions and updates will be handled by effectively storing a
revocation list in the mutable index.

In this way, I will only need to perform Solr commits (yes, I did mean Solr
commits rather than database commits below - my apologies) on the current
day's index, and closing and opening new searchers for these commits
shouldn't be as painful as it is currently.

To do this, I need to work out how to do the following:
- parallel multi search through Solr
- move to a new index on a scheduled basis (probably commit and optimise the
index at this point)
- ideally, properly warm new searchers in the background to further improve
search performance on the changing index

Does that sound like a reasonable strategy in general, and has anyone got
advice on the specific points I raise above?

Thanks,
James

On 12 Feb 2008, at 11:45, Mike Klaas wrote:

> On 11-Feb-08, at 11:38 PM, James Brady wrote:
>
>> Hello,
>> I'm looking for some configuration guidance to help improve
>> performance of my application, which tends to do a lot more indexing
>> than searching.
>>
>> At present, it needs to index around two documents / sec - a document
>> being the stripped content of a webpage. However, performance was so
>> poor that I've had to disable indexing of the webpage content as an
>> emergency measure. In addition, some search queries take an
>> inordinate length of time - regularly over 60 seconds.
>>
>> This is running on a medium sized EC2 instance (2 x 2GHz Opterons and
>> 8GB RAM), and there's not too much else going on on the box.
>> In total, there are about 1.5m documents in the index.
>>
>> I'm using a fairly standard configuration - the things I've tried
>> changing so far have been parameters like maxMergeDocs, mergeFactor
>> and the autoCommit options. I'm only using the
>> StandardRequestHandler, no faceting. I have a scheduled task causing
>> a database commit every 15 seconds.
>
> By "database commit" do you mean "solr commit"?  If so, that is far
> too frequent if you are sorting on big fields.
>
> I use Solr to serve queries for ~10m docs on a medium size EC2
> instance.  This is an optimized configuration where highlighting is
> broken off into a separate index, and load balanced into two
> subindices of 5m docs a piece.  I do a good deal of faceting but no
> sorting.  The only reason that this is possible is that the index is
> only updated every few days.
>
> On another box we have a several hundred thousand document index  
> which is updated relatively frequently (autocommit time: 20s).  
> These are merged with the static-er index to create an illusion of
> real-time index updates.
>
> When lucene supports efficient, reopen()able fieldcache upates, this
> situation might improve, but the above architecture would still
> probably be better.  Note that the second index can be on the same
> machine.
>
> -Mike