Improving Index Search Performance

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Improving Index Search Performance

Shailendra Mudgal
Hi Everyone,

We are using Lucene to search on a index of around 20G size with around 3
million documents. We are facing performance issues loading large results
from the index. Based on the various posts on the forum and documentation,
we have made the following code changes to improve the performance:

i. Modified the code to use HitCollector instead of Hits since we will be
loading all the documents in the index based on keyword matching
ii. Added MapFieldSelector to load only selected fields(2 fields only)
instead of all the 14

After all these changes, it seems to be  taking around 90 secs to load 17k
documents. After profiling, we found that the max time is spent in *
searcher.doc(id,selector).

*Here is the code:

*                public void collect(int id, float score) {
                    try {
                        MapFieldSelector selector = new MapFieldSelector(new
String[] {COMPANY_ID, ID});
                        doc = searcher.doc(id, selector);
                        mappedCompanies = doc.getValues(COMPANY_ID);
                    } catch (IOException e) {
                        logger.debug("inside IDCollector.collect()
:"+e.getMessage());
                    }
                }*

*
*We also read in one of the posts that we should use bitSet.set(doc)
instead of calling searcher.doc(id). But we are unable to to understand how
this might help in our case since we will anyway have to load the document
to get the other required field(company_id). Also we observed that the
searcher is actually using only 1G RAM though we have 4G allocated to it.

Can someone suggest if there is any other optimization that can done to
improve the search performance on MultiSearcher. Any help would be
appreciated.

Thanks,
Vipin
Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Toke Eskildsen
On Tue, 2008-03-25 at 18:13 +0530, Shailendra Mudgal wrote:
> We are using Lucene to search on a index of around 20G size with around 3
> million documents. We are facing performance issues loading large results
> from the index. [...]
> After all these changes, it seems to be  taking around 90 secs to load 17k
> documents. [...]

That's fairly slow. Are you doing any warm-up? It is my experience that
it helps tremendously with performance.

I tried requesting a stored field from all hits for all searches with
logged queries on our index (9 million documents, 37GB), no fancy
tricks, just Hits and hit.get(fieldname). For the first couple of
minutes, using standard harddisks, performance was about 2-300
field-requests/second. After that, the speed increased to about 2-3000
field-requests/second.

Using solid state drives, the same pattern could be seen, just with much
lower warm-up time before the full speed kicked in.

> *Here is the code:
>
> *                public void collect(int id, float score) {
>                     try {
>                         MapFieldSelector selector = new MapFieldSelector(new
> String[] {COMPANY_ID, ID});
>                         doc = searcher.doc(id, selector);
>                         mappedCompanies = doc.getValues(COMPANY_ID);
>                     } catch (IOException e) {
>                         logger.debug("inside IDCollector.collect()
> :"+e.getMessage());
>                     }
>                 }*
>
> *

There's no need to initialize the selector for every collect-call.
Try moving the initialization outside of the collect method.

> [...] Also we observed that the searcher is actually using only 1G RAM though
>  we have 4G allocated to it.

The system will (hopefully) utilize the free RAM for disk-cache, so the
last 3GB are not wasted.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Paul Elschot
In reply to this post by Shailendra Mudgal
Shailendra,

Have a look at the javadocs of HitCollector:
http://lucene.apache.org/java/2_3_0/api/core/org/apache/lucene/search/HitCollector.html

The problem is with the use of the disk head, when retrieving
the documents during collecting, the disk head has to move
between the inverted index and the stored documents; see also
the file formats.

To avoid such excessive disk head movement, you need to collect
all (or at least many more than 1 of) your document ids during
collect(), for example into an int[].
After collecting retrieve the all the docs with Searcher.doc().

Also, for the same reason, retrieving docs is best done in doc id
order, but that is unlikely to go wrong as doc ids are normally
collected in increasing order.

Regards,
Paul Elschot


Op Tuesday 25 March 2008 13:43:18 schreef Shailendra Mudgal:

> Hi Everyone,
>
> We are using Lucene to search on a index of around 20G size with
> around 3 million documents. We are facing performance issues loading
> large results from the index. Based on the various posts on the forum
> and documentation, we have made the following code changes to improve
> the performance:
>
> i. Modified the code to use HitCollector instead of Hits since we
> will be loading all the documents in the index based on keyword
> matching ii. Added MapFieldSelector to load only selected fields(2
> fields only) instead of all the 14
>
> After all these changes, it seems to be  taking around 90 secs to
> load 17k documents. After profiling, we found that the max time is
> spent in * searcher.doc(id,selector).
>
> *Here is the code:
>
> *                public void collect(int id, float score) {
>                     try {
>                         MapFieldSelector selector = new
> MapFieldSelector(new String[] {COMPANY_ID, ID});
>                         doc = searcher.doc(id, selector);
>                         mappedCompanies = doc.getValues(COMPANY_ID);
>                     } catch (IOException e) {
>                         logger.debug("inside IDCollector.collect()
>
> :"+e.getMessage());
>
>                     }
>                 }*
>
> *
> *We also read in one of the posts that we should use bitSet.set(doc)
> instead of calling searcher.doc(id). But we are unable to to
> understand how this might help in our case since we will anyway have
> to load the document to get the other required field(company_id).
> Also we observed that the searcher is actually using only 1G RAM
> though we have 4G allocated to it.
>
> Can someone suggest if there is any other optimization that can done
> to improve the search performance on MultiSearcher. Any help would be
> appreciated.
>
> Thanks,
> Vipin



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

hossman
In reply to this post by Shailendra Mudgal

: *We also read in one of the posts that we should use bitSet.set(doc)
: instead of calling searcher.doc(id). But we are unable to to understand how
: this might help in our case since we will anyway have to load the document
: to get the other required field(company_id). Also we observed that the
: searcher is actually using only 1G RAM though we have 4G allocated to it.

in addition to Paul's previous excellent suggestion, note that if:
  * companyId is a single value field (ie: no document has more then one)
  * companyId is indexed

you can use the FieldCache to lookup the compnayId for each doc.  on the
aggregate this will most likely be much faster then accessing the stored
fields.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Shailendra Mudgal
Hi All,

Thanks for your reply. I would like to mention here is that the companyId is
a multivalued field. I tried paul's suggestions also but doesn't seem much
gain. Still the searcher.doc() method is taking almost the same amount of
time.


> you can use the FieldCache to lookup the compnayId for each doc.  on the
> aggregate this will most likely be much faster then accessing the stored
> fields.
>

As i understand the FieldCache, it will load fields for all the documents.
But in our case we want to load fields only for the matched documents.

Here is the code snippet after using the BitSet:


               *public Map getIds() {
                    MapFieldSelector selector = new MapFieldSelector(new
String[] {COMPANY_ID, ID});
                    for(int i=bitSet.nextSetBit(0); i>=0; i=
bitSet.nextSetBit(i+1)) {
                        try {
                            doc = searcher.doc(i, selector);
                            mappedCompanies = doc.getValues(COMPANY_ID);
                        } catch (CorruptIndexException e) {
                        } catch (IOException e) {
                        }
                    }
                    return results;
                }
*
Any suggestions for further optimizing the code.

Thanks and Regards,
Vipin
Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Ian Lea
Hi


The bottom line is that reading fields from docs is expensive.
FieldCache will, I believe, load fields for all documents but only
once - so the second and subsequent times it will be fast.  Even
without using a cache it is likely that things will speed up because
of caching by the OS.

If you've got plenty of memory vs index size you could look at
RAMDirectory or MMapDirectory.  Or how about some solid state disks?
Someone recently posted some very impressive performance stats.

Another approach we've used is to implement our own simple in-memory
cache of field values.  Read values from cache if present otherwise
read from lucene and cache. This works for us on an index of 3.5
million+ docs.  It helps a lot that we only update the index once a
day so only open new readers once a day and the cache has plenty of
time to fill up.


--
Ian.


On Wed, Mar 26, 2008 at 9:12 AM, Shailendra Mudgal
<[hidden email]> wrote:

> Hi All,
>
>  Thanks for your reply. I would like to mention here is that the companyId is
>  a multivalued field. I tried paul's suggestions also but doesn't seem much
>  gain. Still the searcher.doc() method is taking almost the same amount of
>  time.
>
>
>
>  > you can use the FieldCache to lookup the compnayId for each doc.  on the
>  > aggregate this will most likely be much faster then accessing the stored
>  > fields.
>  >
>
>  As i understand the FieldCache, it will load fields for all the documents.
>  But in our case we want to load fields only for the matched documents.
>
>  Here is the code snippet after using the BitSet:
>
>
>                *public Map getIds() {
>
>                     MapFieldSelector selector = new MapFieldSelector(new
>  String[] {COMPANY_ID, ID});
>                     for(int i=bitSet.nextSetBit(0); i>=0; i=
>  bitSet.nextSetBit(i+1)) {
>                         try {
>                             doc = searcher.doc(i, selector);
>
>                             mappedCompanies = doc.getValues(COMPANY_ID);
>                         } catch (CorruptIndexException e) {
>                         } catch (IOException e) {
>                         }
>                     }
>                     return results;
>                 }
>  *
>  Any suggestions for further optimizing the code.
>
>  Thanks and Regards,
>  Vipin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Toke Eskildsen
On Wed, 2008-03-26 at 10:45 +0000, Ian Lea wrote:
> If you've got plenty of memory vs index size you could look at
> RAMDirectory or MMapDirectory.  Or how about some solid state disks?
> Someone recently posted some very impressive performance stats.

That was probably me. A (very) quick test for field-look-ups the stupid
way (hit.get(fieldname)) showed about a factor 10 in speed for SSDs over
harddisks. It doesn't seem to be enough in this case though - 9 seconds
instead of 90 is still a long time to wait.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Shailendra Mudgal
In reply to this post by Ian Lea
> The bottom line is that reading fields from docs is expensive.
> FieldCache will, I believe, load fields for all documents but only
> once - so the second and subsequent times it will be fast.  Even
> without using a cache it is likely that things will speed up because
> of caching by the OS.


As i mentioned in my previous mail that the companyId is a multivalued
field, so caching it will consume a lot of memory. And this way we'll have
to keep the document vs field mapping also in the memory.


> If you've got plenty of memory vs index size you could look at
> RAMDirectory or MMapDirectory.  Or how about some solid state disks?
> Someone recently posted some very impressive performance stats.


The index size is around 20G and the available Memory is 4G so, keeping the
entire index into the memory  is not possible.   But as i mentioned earlier
that it is using only 1 G out of 4 G, so is their a way to specify the
lucene to cache more documents , say use 2G for caching the index ??

I'll appreciate more suggestions on the same problem.

Regards,
Vipin
Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Ian Lea
Well, caching is designed to use memory.  If you are saying that you
haven't got enough memory to cache all your values then caching them
all isn't going to work, at any level. If you implemented your own
cache you could control memory usage with an LRU algorithm or whatever
made sense for your application.  We use an array as the cache with
the lucene document id as the index, so don't have to store document
vs field mapping.

I'm not aware of any way to tell lucene to cache documents, let alone
up to a user-supplied memory threshold.

If you are on a unix type OS and your lucene app is the only or main
thing on the machine, the OS is likely using the spare memory as cache
behind the scenes.  Don't know if the same applies to MS Win.


--
Ian.


On Wed, Mar 26, 2008 at 12:51 PM, Shailendra Mudgal
<[hidden email]> wrote:

> > The bottom line is that reading fields from docs is expensive.
>  > FieldCache will, I believe, load fields for all documents but only
>  > once - so the second and subsequent times it will be fast.  Even
>  > without using a cache it is likely that things will speed up because
>  > of caching by the OS.
>
>
>  As i mentioned in my previous mail that the companyId is a multivalued
>  field, so caching it will consume a lot of memory. And this way we'll have
>  to keep the document vs field mapping also in the memory.
>
>
>
>  > If you've got plenty of memory vs index size you could look at
>  > RAMDirectory or MMapDirectory.  Or how about some solid state disks?
>  > Someone recently posted some very impressive performance stats.
>
>
>  The index size is around 20G and the available Memory is 4G so, keeping the
>  entire index into the memory  is not possible.   But as i mentioned earlier
>  that it is using only 1 G out of 4 G, so is their a way to specify the
>  lucene to cache more documents , say use 2G for caching the index ??
>
>  I'll appreciate more suggestions on the same problem.
>
>  Regards,
>  Vipin
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Improving Index Search Performance

Paul Elschot
In reply to this post by Shailendra Mudgal
Since you're using all the results for a query, and ignoring the
score value, you might try and do the same thing with a relational
database. But I would not expect that to be much faster,
especially when using a field cache.

Other than that, you could also go the other way, and try and
add more data to the lucene index that can be used to reduce
the number of results to be fetched.

Regards,
Paul Elschot



Op Wednesday 26 March 2008 13:51:24 schreef Shailendra Mudgal:

> > The bottom line is that reading fields from docs is expensive.
> > FieldCache will, I believe, load fields for all documents but only
> > once - so the second and subsequent times it will be fast.  Even
> > without using a cache it is likely that things will speed up
> > because of caching by the OS.
>
> As i mentioned in my previous mail that the companyId is a
> multivalued field, so caching it will consume a lot of memory. And
> this way we'll have to keep the document vs field mapping also in the
> memory.
>
> > If you've got plenty of memory vs index size you could look at
> > RAMDirectory or MMapDirectory.  Or how about some solid state
> > disks? Someone recently posted some very impressive performance
> > stats.
>
> The index size is around 20G and the available Memory is 4G so,
> keeping the entire index into the memory  is not possible.   But as i
> mentioned earlier that it is using only 1 G out of 4 G, so is their a
> way to specify the lucene to cache more documents , say use 2G for
> caching the index ??
>
> I'll appreciate more suggestions on the same problem.
>
> Regards,
> Vipin



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]