The best way to iterate over document

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

The best way to iterate over document

wojtek hury
Hi all,

our problem is to choose the best (the fastest) way to iterate over huge set
of documents (basic and most important case is to iterate over all documents
in the index). Some slow process accesses documents and now it is done via
repeating query (for instance MatchAllDocsQuery). It processess first N docs
then repeats query and processes next N docs and so on. Repeating query
means in fact quadratic time! So we think about changing the way docs are
accessed.
In case of generic query the only way to speed it up we see is to keep
HitCollector in memory between requests for docs. Isn't this approach too
memory consuming?
In case of iterating over all documents I was wondering if there is a way to
determine set of index ids over which we could iterate (and of course
control index changes - if index is changed between requests we should
probably invalidate 'iterating session').
What is the best solution for this problem?
Thanks and regards,

wojtek
Reply | Threaded
Open this post in threaded view
|

Re: The best way to iterate over document

Erick Erickson
Why not keep a Filter in memory? It consists of a single bit per document
and the ordinal position of that bit is the Lucene doc ID. You could create
this reasonably quickly for the *first* query that came in via HitCollector.

Then each time you wanted another chunk, use the filter to know which
docs to return. You could either, say, extend the Filter class and add
some bookeeping or just zero out each bit that you returned to the user.

NOTE: you don't get relevance this way, but for the case of returning all
docs do you really want it?

About updating the index. Remember that there is no "update in place".
So you'll only have to check whether any document in the filter has been
deleted when you are returning. Then you'd have to do something about
looking for any new additions as you returned the last document in the
set...
But remember that until you close/reopen the searcher, you won't see changes
anyway.....

But you may not need to do any of this. If, each time you return a chunk,
you're using a Hits object, then this is the first thing I'd change. A Hits
object re-executes the query every 100th element you look at. So, assume
you have something like

(bad pseudo code here)
for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx)
{
}

for (idx = 0; idx < chunksize && Hits.next(); ++idx)
{
    assemble doc for return
}
and the first doc you want to return is number 1,000, you'll actually
be re-executing the query 10 times. Which probably accounts for your
quadratic time.

So I'd try just using a new HitCollector each time and see if that solves
your problems before getting fancy. There really shouldn't be any
noticeable difference between the first and last request unless you're
doing something like accessing the documents before you get to
the first one you expect to return. And a TopDocs should even
preserve scoring.......

Best
Erick



On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[hidden email]> wrote:

> Hi all,
>
> our problem is to choose the best (the fastest) way to iterate over huge
> set
> of documents (basic and most important case is to iterate over all
> documents
> in the index). Some slow process accesses documents and now it is done via
> repeating query (for instance MatchAllDocsQuery). It processess first N
> docs
> then repeats query and processes next N docs and so on. Repeating query
> means in fact quadratic time! So we think about changing the way docs are
> accessed.
> In case of generic query the only way to speed it up we see is to keep
> HitCollector in memory between requests for docs. Isn't this approach too
> memory consuming?
> In case of iterating over all documents I was wondering if there is a way
> to
> determine set of index ids over which we could iterate (and of course
> control index changes - if index is changed between requests we should
> probably invalidate 'iterating session').
> What is the best solution for this problem?
> Thanks and regards,
>
> wojtek
>
Reply | Threaded
Open this post in threaded view
|

Re: The best way to iterate over document

wojtek hury
Thank you for reply. What I did not mention before was that for
iteration we don't care about scoring, so that's not the issue at all.
Creating Filter with BitSet seems much better idea than keeping
HitIterator in memory. Am I right that in such a case with
MatchAllDocsQuery memory usage would be around
( NUM_OF_DOCS_IN_INDEX / 8 )   bytes?
We didn't check it yet, but do you think that time for accessing
documents (reader.doc(i)) is big enough to make iteration in
HitCollector (without accessing any objects) over documents
already returned almost un-noticable?
And another question - if I don't care about scoring is there a way to
make Lucene don't spend time on calculating score (I don't know if
that time matters)? HitCollector receives doc and its score (as far as
I remember the difference here is that it is not normalized to value
between 0 and 1). Is there a way (and does it make sense) to make
scoring faster in such a case?
And to make things clear - am I right that if I operate on the same
searcher over requests for docs chunks I don't see neither additions
nor deletions which could happen meanwhile? So if I would like to
iterate over point-in-time keeping the same searcher opened would do.
Thanks and regards,
wojtek


2008/3/26, Erick Erickson <[hidden email]>:

> Why not keep a Filter in memory? It consists of a single bit per document
>  and the ordinal position of that bit is the Lucene doc ID. You could create
>  this reasonably quickly for the *first* query that came in via HitCollector.
>
>  Then each time you wanted another chunk, use the filter to know which
>  docs to return. You could either, say, extend the Filter class and add
>  some bookeeping or just zero out each bit that you returned to the user.
>
>  NOTE: you don't get relevance this way, but for the case of returning all
>  docs do you really want it?
>
>  About updating the index. Remember that there is no "update in place".
>  So you'll only have to check whether any document in the filter has been
>  deleted when you are returning. Then you'd have to do something about
>  looking for any new additions as you returned the last document in the
>  set...
>  But remember that until you close/reopen the searcher, you won't see changes
>  anyway.....
>
>  But you may not need to do any of this. If, each time you return a chunk,
>  you're using a Hits object, then this is the first thing I'd change. A Hits
>  object re-executes the query every 100th element you look at. So, assume
>  you have something like
>
>  (bad pseudo code here)
>  for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx)
>  {
>  }
>
>  for (idx = 0; idx < chunksize && Hits.next(); ++idx)
>  {
>     assemble doc for return
>  }
>  and the first doc you want to return is number 1,000, you'll actually
>  be re-executing the query 10 times. Which probably accounts for your
>  quadratic time.
>
>  So I'd try just using a new HitCollector each time and see if that solves
>  your problems before getting fancy. There really shouldn't be any
>  noticeable difference between the first and last request unless you're
>  doing something like accessing the documents before you get to
>  the first one you expect to return. And a TopDocs should even
>  preserve scoring.......
>
>  Best
>
> Erick
>
>
>
>
>  On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[hidden email]> wrote:
>
>  > Hi all,
>  >
>  > our problem is to choose the best (the fastest) way to iterate over huge
>  > set
>  > of documents (basic and most important case is to iterate over all
>  > documents
>  > in the index). Some slow process accesses documents and now it is done via
>  > repeating query (for instance MatchAllDocsQuery). It processess first N
>  > docs
>  > then repeats query and processes next N docs and so on. Repeating query
>  > means in fact quadratic time! So we think about changing the way docs are
>  > accessed.
>  > In case of generic query the only way to speed it up we see is to keep
>  > HitCollector in memory between requests for docs. Isn't this approach too
>  > memory consuming?
>  > In case of iterating over all documents I was wondering if there is a way
>  > to
>  > determine set of index ids over which we could iterate (and of course
>  > control index changes - if index is changed between requests we should
>  > probably invalidate 'iterating session').
>  > What is the best solution for this problem?
>  > Thanks and regards,
>  >
>  > wojtek
>  >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: The best way to iterate over document

Erick Erickson
See below...

On Wed, Mar 26, 2008 at 11:29 AM, Wojtek H <[hidden email]> wrote:

> Thank you for reply. What I did not mention before was that for
> iteration we don't care about scoring, so that's not the issue at all.
> Creating Filter with BitSet seems much better idea than keeping
> HitIterator in memory. Am I right that in such a case with
> MatchAllDocsQuery memory usage would be around
> ( NUM_OF_DOCS_IN_INDEX / 8 )   bytes?


Yes, it's very close to that number as far as I can tell.


>
> We didn't check it yet, but do you think that time for accessing
> documents (reader.doc(i)) is big enough to make iteration in
> HitCollector (without accessing any objects) over documents
> already returned almost un-noticable?


I don't quite understand this. Accessing a single document
in HitCollector via reader.doc(i) is probably unnoticeable. Accessing
every documents in the HitCollector is bad, very bad. If you're doing
something like spinning through the HitCollector N times, then
returning the Nth document and breaking, I don't know. You'll just have
to experiment I'm afraid.



> And another question - if I don't care about scoring is there a way to
> make Lucene don't spend time on calculating score (I don't know if
> that time matters)? HitCollector receives doc and its score (as far as
> I remember the difference here is that it is not normalized to value
> between 0 and 1). Is there a way (and does it make sense) to make
> scoring faster in such a case?


I *think* ConstantScoreQuery is your friend here, although I haven't
used it personally.


>
> And to make things clear - am I right that if I operate on the same
> searcher over requests for docs chunks I don't see neither additions
> nor deletions which could happen meanwhile? So if I would like to
> iterate over point-in-time keeping the same searcher opened would do.
> Thanks and regards,
>

You must close and re-open a reader to see any changes since the last
time you opened that reader. So I think your assumption is OK.

Have fun,
Erick


> wojtek
>
>
> 2008/3/26, Erick Erickson <[hidden email]>:
> > Why not keep a Filter in memory? It consists of a single bit per
> document
> >  and the ordinal position of that bit is the Lucene doc ID. You could
> create
> >  this reasonably quickly for the *first* query that came in via
> HitCollector.
> >
> >  Then each time you wanted another chunk, use the filter to know which
> >  docs to return. You could either, say, extend the Filter class and add
> >  some bookeeping or just zero out each bit that you returned to the
> user.
> >
> >  NOTE: you don't get relevance this way, but for the case of returning
> all
> >  docs do you really want it?
> >
> >  About updating the index. Remember that there is no "update in place".
> >  So you'll only have to check whether any document in the filter has
> been
> >  deleted when you are returning. Then you'd have to do something about
> >  looking for any new additions as you returned the last document in the
> >  set...
> >  But remember that until you close/reopen the searcher, you won't see
> changes
> >  anyway.....
> >
> >  But you may not need to do any of this. If, each time you return a
> chunk,
> >  you're using a Hits object, then this is the first thing I'd change. A
> Hits
> >  object re-executes the query every 100th element you look at. So,
> assume
> >  you have something like
> >
> >  (bad pseudo code here)
> >  for (int idx = 0; idx < firstdocinchunk && Hits.next(); ++idx)
> >  {
> >  }
> >
> >  for (idx = 0; idx < chunksize && Hits.next(); ++idx)
> >  {
> >     assemble doc for return
> >  }
> >  and the first doc you want to return is number 1,000, you'll actually
> >  be re-executing the query 10 times. Which probably accounts for your
> >  quadratic time.
> >
> >  So I'd try just using a new HitCollector each time and see if that
> solves
> >  your problems before getting fancy. There really shouldn't be any
> >  noticeable difference between the first and last request unless you're
> >  doing something like accessing the documents before you get to
> >  the first one you expect to return. And a TopDocs should even
> >  preserve scoring.......
> >
> >  Best
> >
> > Erick
> >
> >
> >
> >
> >  On Wed, Mar 26, 2008 at 5:48 AM, Wojtek H <[hidden email]> wrote:
> >
> >  > Hi all,
> >  >
> >  > our problem is to choose the best (the fastest) way to iterate over
> huge
> >  > set
> >  > of documents (basic and most important case is to iterate over all
> >  > documents
> >  > in the index). Some slow process accesses documents and now it is
> done via
> >  > repeating query (for instance MatchAllDocsQuery). It processess first
> N
> >  > docs
> >  > then repeats query and processes next N docs and so on. Repeating
> query
> >  > means in fact quadratic time! So we think about changing the way docs
> are
> >  > accessed.
> >  > In case of generic query the only way to speed it up we see is to
> keep
> >  > HitCollector in memory between requests for docs. Isn't this approach
> too
> >  > memory consuming?
> >  > In case of iterating over all documents I was wondering if there is a
> way
> >  > to
> >  > determine set of index ids over which we could iterate (and of course
> >  > control index changes - if index is changed between requests we
> should
> >  > probably invalidate 'iterating session').
> >  > What is the best solution for this problem?
> >  > Thanks and regards,
> >  >
> >  > wojtek
> >  >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>