Fastest way to fetch N documents with unique keys within large numbers of indexes..

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
Hey.

I'm trying to figure out the FASTEST way to solve this problem.

We have a system where I'll be given 10 or 20 unique keys.  Which are
stored as non-tokenized fields within Lucene.  Each key represents a
unique document.

Internally I'm creating a new Term and then calling
IndexReader.termDocs() on this term.  Then if termdocs.next() matches
then I'll return this document.

The problem is that this doesn't work very fast either.  This is not an
academic debate as I've put the system in a profiler and Lucene is the
top bottleneck (by far).

I don't think there's anything faster than this right?  Could I maybe
cache a TermEnum and keep it as a pointer to the FIRST field for these
IDs and then reuse that?  This might allow me to search faster to the
start of my terms?

Does Lucene internally do a binary search for my term?

I could of course do an index merge of all this content but thats a
separate problem.  We have a lot of indexes and often have more than 40
and constantly merging these into a multigig index just takes FOREVER.

It seems that internally IO is the problem. I'm about as fast on IO as I
can get as I'm on a SCSI RAID array at RAID0 on FAST scsi disks...  I
also tried tweaking InputStream.BUFFER_SIZE with no visible change in
performance.

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Chris Hostetter-3

I haven't profiled either of thse suggestions but:

1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the
   total time to execute the search, and access each Hit slower then your
   termDocs approach?

2) have you tried sorting your terms first, then opening a TermDocs on the
   first one, and seeking to each of the remaining terms?  it seems like
   that would be faster then opening a new TermDocs for each Term.

: The problem is that this doesn't work very fast either.  This is not an
: academic debate as I've put the system in a profiler and Lucene is the
: top bottleneck (by far).

The tricky thing about profiling code: something is allways the bottleneck.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Matt Quail
In reply to this post by Kevin Burton
> We have a system where I'll be given 10 or 20 unique keys.

I assume you mean you have one unique-key field, and you are given  
10-20 values to find for this one field?
>
> Internally I'm creating a new Term and then calling  
> IndexReader.termDocs() on this term.  Then if termdocs.next()  
> matches then I'll return this document.

Are you calling reader.termDocs() inside a (tight) loop? It might be  
better to create one TermEnum, and use "seek". Something like this:

reader = ...;
td = reader.termDocs();

String[] keys = ....; // your unique keys;
sort(keys); // this probably helps seek() below

for (key in keys) {
     Term t = new Term("unique", key);
     td.seek(t);
     if (td.next()) {
         // found a match
     }
}

I'm pretty sure that will work. And if you can avoid the multi-
threading issues, you might try and use the same TermDocs object for  
as long as possible (that is, move it up out of as many tight loops  
as you can).

=Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote:

>I haven't profiled either of thse suggestions but:
>
>1) have you tried constructing a BooleanQuery of all 10-20 terms? Is the
>   total time to execute the search, and access each Hit slower then your
>   termDocs approach?
>  
>
Actually using any type of query was very slow.  The problem was when it
was computing the score.  This was a big performance gain.  About 2x and
since its the slowest part of our app it was a nice one. :)

We were using a TermQuery though.

I wonder if there's a way to tell lucene not to score.  Maybe I could
then use a BooleanQuery with internal TermQueries and then scan the
indexes once each.

>2) have you tried sorting your terms first, then opening a TermDocs on the
>   first one, and seeking to each of the remaining terms?  it seems like
>   that would be faster then opening a new TermDocs for each Term.
>  
>
How do I do this?  I just assumed that termDocs was already sorted...

I don't see any mention of this in the API...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
In reply to this post by Matt Quail
Matt Quail wrote:

>> We have a system where I'll be given 10 or 20 unique keys.
>
>
> I assume you mean you have one unique-key field, and you are given  
> 10-20 values to find for this one field?
>
>>
>> Internally I'm creating a new Term and then calling  
>> IndexReader.termDocs() on this term.  Then if termdocs.next()  
>> matches then I'll return this document.
>
>
> Are you calling reader.termDocs() inside a (tight) loop? It might be  
> better to create one TermEnum, and use "seek". Something like this:

Yes.. this is another approach I was thinking of taking.  I was thinking
of building up a list of indexes which had a high probability of holding
the given document and then searching for each of them.

What I'm worried about though is that it would be a bit slower...  I'm
just going to have to test out different implementations to see....

<snip>

>
>
> I'm pretty sure that will work. And if you can avoid the multi-
> threading issues, you might try and use the same TermDocs object for  
> as long as possible (that is, move it up out of as many tight loops  
> as you can).

Well... that doesn't look like the biggest overhead.  The bottleneck
seens to be in seek() and the fact that its using an InputStream to read
bytes off disk.  I actually tried to speed that up by crainking
InputSteam.BUFFER_SIZE var higher but that didn't work either though I'm
not sure if its a caching issue.  I sent an email to the list about this
earlier but no one responded.

So it seems like my bottleneck is in seek() so It would make sense to
figure out how to limit this.

Is this O(log(N))  btw or is it O(N) ?

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Chris Hostetter-3
In reply to this post by Kevin Burton

: was computing the score.  This was a big performance gain.  About 2x and
: since its the slowest part of our app it was a nice one. :)
:
: We were using a TermQuery though.

I believe that one search on one BooleanQuery containing 20
TermQueries should be faster then 20 searches on 20 TermQueries.

: >2) have you tried sorting your terms first, then opening a TermDocs on the
: >   first one, and seeking to each of the remaining terms?  it seems like
: >   that would be faster then opening a new TermDocs for each Term.
: >
: >
: How do I do this?  I just assumed that termDocs was already sorted...
:
: I don't see any mention of this in the API...

I'm not talking about any special API to sort a TermDocs -- it is sorted.

I'm saying you should start by sorting your input (the 10-20 unique IDs
you mentioned) and then seek to the first ID, then using the same
TermDocs instance, seek to the second ID, etc....


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Paul Elschot
In reply to this post by Kevin Burton
On Tuesday 07 June 2005 07:17, Kevin Burton wrote:

> Matt Quail wrote:
>
> >> We have a system where I'll be given 10 or 20 unique keys.
> >
> >
> > I assume you mean you have one unique-key field, and you are given  
> > 10-20 values to find for this one field?
> >
> >>
> >> Internally I'm creating a new Term and then calling  
> >> IndexReader.termDocs() on this term.  Then if termdocs.next()  
> >> matches then I'll return this document.
> >
> >
> > Are you calling reader.termDocs() inside a (tight) loop? It might be  
> > better to create one TermEnum, and use "seek". Something like this:
>
> Yes.. this is another approach I was thinking of taking.  I was thinking
> of building up a list of indexes which had a high probability of holding
> the given document and then searching for each of them.
>
> What I'm worried about though is that it would be a bit slower...  I'm
> just going to have to test out different implementations to see....
>
> <snip>
>
> >
> >
> > I'm pretty sure that will work. And if you can avoid the multi-
> > threading issues, you might try and use the same TermDocs object for  
> > as long as possible (that is, move it up out of as many tight loops  
> > as you can).
>
> Well... that doesn't look like the biggest overhead.  The bottleneck
> seens to be in seek() and the fact that its using an InputStream to read
> bytes off disk.  I actually tried to speed that up by crainking
> InputSteam.BUFFER_SIZE var higher but that didn't work either though I'm
> not sure if its a caching issue.  I sent an email to the list about this
> earlier but no one responded.
>
> So it seems like my bottleneck is in seek() so It would make sense to
> figure out how to limit this.
>
> Is this O(log(N))  btw or is it O(N) ?

Seeking terms should be O(log(N)), and the fastest way to do that
is by sorting the terms first, but that does not really help much.

The biggest increase in performance that I had for a similar problem
was to first collect all the document numbers, then sort them, and
then fetch all the corresponding docs. See also the file formats.
This avoids the disk head going between the terms and the stored docs,
and it also minimizes the head movement for retrieving the docs.

For a single term, the document numbers are already ordered, for
multiple terms the sort is needed.

For a large number of indexes, it may be necessary to do this over
multiple indexes by first getting the doc numbers for all indexes,
then sorting these per index, then retrieving them
from all indexes, and repeating the whole thing using terms determined
from the retrieved docs.

With the indexes on multiple discs, some parallellism can be introduced.
A thread per disk could be used.
In case there are multiple requests pending, they can be serialized just
before the sorting of the terms, and just before the sorting of the document
numbers.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Paul Elschot
On Tuesday 07 June 2005 09:22, Paul Elschot wrote:
...
>
> With the indexes on multiple discs, some parallellism can be introduced.
> A thread per disk could be used.
> In case there are multiple requests pending, they can be serialized just
> before the sorting of the terms, and just before the sorting of the document
> numbers.

That should be 'merged' instead of 'serialized'.
 
> Regards,
> Paul Elschot
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
In reply to this post by Chris Hostetter-3
Chris Hostetter wrote:

>: was computing the score.  This was a big performance gain.  About 2x and
>: since its the slowest part of our app it was a nice one. :)
>:
>: We were using a TermQuery though.
>
>I believe that one search on one BooleanQuery containing 20
>TermQueries should be faster then 20 searches on 20 TermQueries.
>  
>
Actually.. it wasn't... :-/

It was about 4x slower.

Ug...

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
In reply to this post by Paul Elschot
Paul Elschot wrote:

>For a large number of indexes, it may be necessary to do this over
>multiple indexes by first getting the doc numbers for all indexes,
>then sorting these per index, then retrieving them
>from all indexes, and repeating the whole thing using terms determined
>from the retrieved docs.
>  
>
Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x
performance increase.

Times in milliseconds:

Before:

duration: 1127
duration: 449
duration: 394
duration: 564

After:

duration: 182
duration: 39
duration: 12
duration: 11

The values of 2-4  I'm sure are due to the filesystem buffer cache but I
can't imagine why they'd be faster in the second round.  It might be
that Linux is deciding not to buffer the document blocks.

Kevin

--

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Paul Elschot
On Wednesday 08 June 2005 01:18, Kevin Burton wrote:

> Paul Elschot wrote:
>
> >For a large number of indexes, it may be necessary to do this over
> >multiple indexes by first getting the doc numbers for all indexes,
> >then sorting these per index, then retrieving them
> >from all indexes, and repeating the whole thing using terms determined
> >from the retrieved docs.
> >  
> >
> Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x
> performance increase.
>
> Times in milliseconds:
>
> Before:
>
> duration: 1127
> duration: 449
> duration: 394
> duration: 564
>
> After:
>
> duration: 182
> duration: 39
> duration: 12
> duration: 11

There is no need for a relational db when you have Lucene :)
Thanks for reporting the old and the new times.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Andrew Boyd
In reply to this post by Kevin Burton
Kevin,
  Those results are awsome.  Could you please give those of us that were following but not quite understanding everything some pseudo code or some more explaination?

Thanks,

andrew

-----Original Message-----
From: Kevin Burton <[hidden email]>
Sent: Jun 7, 2005 7:18 PM
To: [hidden email]
Subject: Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Paul Elschot wrote:

>For a large number of indexes, it may be necessary to do this over
>multiple indexes by first getting the doc numbers for all indexes,
>then sorting these per index, then retrieving them
>from all indexes, and repeating the whole thing using terms determined
>from the retrieved docs.
>  
>
Well this was a BIG win.  Just benchmarking it out shows a 10x -> 50x
performance increase.

Times in milliseconds:

Before:

duration: 1127
duration: 449
duration: 394
duration: 564

After:

duration: 182
duration: 39
duration: 12
duration: 11

The values of 2-4  I'm sure are due to the filesystem buffer cache but I
can't imagine why they'd be faster in the second round.  It might be
that Linux is deciding not to buffer the document blocks.

Kevin

--

Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Fastest way to fetch N documents with unique keys within large numbers of indexes..

Kevin Burton
Andrew Boyd wrote:

>Kevin,
>  Those results are awsome.  Could you please give those of us that were following but not quite understanding everything some pseudo code or some more explaination?
>
>  
>
Ug.. I hate to say this bug ignore these numbers.  Turns out that I was
hitting a cache ... I thought I had disable this but I set the wrong
var. :-/

If only it was true ;)

Kevin

--


Use Rojo (RSS/Atom aggregator)! - visit http://rojo.com.
See irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

   Kevin A. Burton, Location - San Francisco, CA
      AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]