IndexWriter.deleteDocuments(Query query)

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

IndexWriter.deleteDocuments(Query query)

John Wang-9
Hi guys:
    IndexWriter.deleteDocuments(Query query) api is not really making sense
to me. Wouldn't IndexWriter.deleteDocuments(DocIdSet set) be better? Since
we don't really care about scoring for this call.

Also, can we expose  IndexWriter.deleteDocuments(int[] docids)? Using the
current api is very cumbersome if the docids to be deleted is already known.
I would have to use the above API and create a Query instance.

Thanks

-John
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Yonik Seeley-2-2
On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]> wrote:
> Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?

Exposing internal ids from the IndexWriter may not be a good idea
given that they are transient.


-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
I fail to see the difference of exposing the api to allow for a Query
instance to be passed in vs a DocIdSet. In this specific case, Query is
essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it
what DocIdSet is?
Thanks

-John

On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
<[hidden email]>wrote:

> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]> wrote:
> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
>
> Exposing internal ids from the IndexWriter may not be a good idea
> given that they are transient.
>
>
> -Yonik
> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Yonik Seeley-2-2
On Tue, Mar 31, 2009 at 4:58 PM, John Wang <[hidden email]> wrote:
> I fail to see the difference of exposing the api to allow for a Query
> instance to be passed in vs a DocIdSet.

I was commenting specifically on your idea to allow deletion by int[]
(docids) on the IndexWriter.

DocIdSet is a different issue - it didn't exist when the conversation
to add deleteByQuery was going on.

-Yonik
http://www.lucidimagination.com


 In this specific case, Query is

> essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it
> what DocIdSet is?
> Thanks
>
> -John
>
> On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
> <[hidden email]>wrote:
>
>> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]> wrote:
>> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
>>
>> Exposing internal ids from the IndexWriter may not be a good idea
>> given that they are transient.
>>
>>
>> -Yonik
>> http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
So do you think it is a good addition/change to the current api now?

-John

On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <[hidden email]>wrote:

> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <[hidden email]> wrote:
> > I fail to see the difference of exposing the api to allow for a Query
> > instance to be passed in vs a DocIdSet.
>
> I was commenting specifically on your idea to allow deletion by int[]
> (docids) on the IndexWriter.
>
> DocIdSet is a different issue - it didn't exist when the conversation
> to add deleteByQuery was going on.
>
> -Yonik
> http://www.lucidimagination.com
>
>
>  In this specific case, Query is
> > essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it
> > what DocIdSet is?
> > Thanks
> >
> > -John
> >
> > On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
> > <[hidden email]>wrote:
> >
> >> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]> wrote:
> >> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
> >>
> >> Exposing internal ids from the IndexWriter may not be a good idea
> >> given that they are transient.
> >>
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
John,

I think this has the same problem as exposing delete by docID, ie, how
would you produce that docIdSet?

We could consider delete by Filter instead, since that exposes the
necessary getDocIdSet(IndexReader) method.

Or, with near real-time search, we could enhance it to allow deletions
via the obtained reader (the first approach doesn't).

Mike

On Tue, Mar 31, 2009 at 10:23 PM, John Wang <[hidden email]> wrote:

> So do you think it is a good addition/change to the current api now?
>
> -John
>
> On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <[hidden email]>wrote:
>
>> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <[hidden email]> wrote:
>> > I fail to see the difference of exposing the api to allow for a Query
>> > instance to be passed in vs a DocIdSet.
>>
>> I was commenting specifically on your idea to allow deletion by int[]
>> (docids) on the IndexWriter.
>>
>> DocIdSet is a different issue - it didn't exist when the conversation
>> to add deleteByQuery was going on.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>>  In this specific case, Query is
>> > essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't it
>> > what DocIdSet is?
>> > Thanks
>> >
>> > -John
>> >
>> > On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
>> > <[hidden email]>wrote:
>> >
>> >> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]> wrote:
>> >> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
>> >>
>> >> Exposing internal ids from the IndexWriter may not be a good idea
>> >> given that they are transient.
>> >>
>> >>
>> >> -Yonik
>> >> http://www.lucidimagination.com
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
Hi Michael:

    Let me first share what I am doing w.r.t deleting by docid:

I have a customized index reader that stores a mapping of docid -> uid in
the payload (something Michael Bush and Ning Li suggested a while back) And
that mapping is loaded a IndexReader load time and is shared by searchers.

I do realtime update, so I get a batch of updates with a uid associated with
each batch. So I do deleted on the uid and add the document. And I
implemented using IndexWriter.deleteDocuments(Term[])

I realized I have an IndexReader around already with a docId->uid mapping, I
can just find out the docid from that list and simply call
IndexReader.deleteDocument(int). So out of curiosity, I compare the times
doing deletes with these two mechanisms with 1 batch of 10000 deletes. And
on my macbook pro, I see a difference/overhead of 3-4 seconds (with various
runs and how much term table is cached etc.) And that is something I would
expect because we essentially doing a "query" per element in the batch,
albeit posting list length is only 1, but still...

Now to me that is significant enough to move away from
IndexWriter.deleteDocuments().

However, to actually implement the delete with IndexWriter on docids, I have
to create a customized Query object that iterates my int[] of docids. Which
is kinda silly, since IndexWriter calls delete on the docid anyway. I don't
want to open an IndexReader for the deletes (cuz that's where the api is at)
and then open another IndexWriter to add documents because:
1) seems like we are moving away from this paradigm with the delete apis on
IndexWriter
2) I want to have delete and add in 1 commit.

Here is my use-case, and I don't think it is that far-fetched.

Having IndexWriter.deleteDocuments take a Filter than DocIdSet makes sense.
For me at lease, IndexWriter.deleteDocument(int) would be useful.

Just my $0.02.

-John

On Wed, Apr 1, 2009 at 1:02 AM, Michael McCandless <
[hidden email]> wrote:

> John,
>
> I think this has the same problem as exposing delete by docID, ie, how
> would you produce that docIdSet?
>
> We could consider delete by Filter instead, since that exposes the
> necessary getDocIdSet(IndexReader) method.
>
> Or, with near real-time search, we could enhance it to allow deletions
> via the obtained reader (the first approach doesn't).
>
> Mike
>
> On Tue, Mar 31, 2009 at 10:23 PM, John Wang <[hidden email]> wrote:
> > So do you think it is a good addition/change to the current api now?
> >
> > -John
> >
> > On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <
> [hidden email]>wrote:
> >
> >> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <[hidden email]> wrote:
> >> > I fail to see the difference of exposing the api to allow for a Query
> >> > instance to be passed in vs a DocIdSet.
> >>
> >> I was commenting specifically on your idea to allow deletion by int[]
> >> (docids) on the IndexWriter.
> >>
> >> DocIdSet is a different issue - it didn't exist when the conversation
> >> to add deleteByQuery was going on.
> >>
> >> -Yonik
> >> http://www.lucidimagination.com
> >>
> >>
> >>  In this specific case, Query is
> >> > essentially a factory to produce a DocIdSetIterator (or Scorer) Isn't
> it
> >> > what DocIdSet is?
> >> > Thanks
> >> >
> >> > -John
> >> >
> >> > On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
> >> > <[hidden email]>wrote:
> >> >
> >> >> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]>
> wrote:
> >> >> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
> >> >>
> >> >> Exposing internal ids from the IndexWriter may not be a good idea
> >> >> given that they are transient.
> >> >>
> >> >>
> >> >> -Yonik
> >> >> http://www.lucidimagination.com
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Yonik Seeley-2-2
In reply to this post by Michael McCandless-2
On Wed, Apr 1, 2009 at 4:02 AM, Michael McCandless
<[hidden email]> wrote:
> I think this has the same problem as exposing delete by docID, ie, how
> would you produce that docIdSet?

Whoops, right.  I was going by memory that there was a
get(IndexReader) type method there... but that's on Filter of course.


-Yonik
http://www.lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
In reply to this post by John Wang-9
> For me at lease, IndexWriter.deleteDocument(int) would be useful.

I completely agree: delete-by-docID in IndexWriter would be a great
feature.  Long ago I became convinced of that.

Where this feature always gets stuck (search the lists -- it's gotten
stuck alot) is how to implement it?  At any time, a merge can commit,
which invalidates all docIDs stored anywhere.  We need to solve that
before we can delete by docID.

I don't see a clean solution.  Do you?

> I have a customized index reader that stores a mapping of docid -> uid in
> the payload (something Michael Bush and Ning Li suggested a while back) And
> that mapping is loaded a IndexReader load time and is shared by searchers.

OK

> I do realtime update, so I get a batch of updates with a uid associated with
> each batch. So I do deleted on the uid and add the document. And I
> implemented using IndexWriter.deleteDocuments(Term[])

OK

> I realized I have an IndexReader around already with a docId->uid mapping, I
> can just find out the docid from that list and simply call
> IndexReader.deleteDocument(int). So out of curiosity, I compare the times
> doing deletes with these two mechanisms with 1 batch of 10000 deletes. And
> on my macbook pro, I see a difference/overhead of 3-4 seconds (with various
> runs and how much term table is cached etc.) And that is something I would
> expect because we essentially doing a "query" per element in the batch,
> albeit posting list length is only 1, but still...

But, your mapping is stale (you can't trust the docIDs) as soon as you
open an IndexWriter on the same index, so this isn't really a valid
test.

3-4 seconds out of how much total time?

Can you give more details on this test?  Are you including time to
open IndexReader, time to load your docID/uid mapping, and time to
commit the changes (to be apples/apples)?

> Now to me that is significant enough to move away from
> IndexWriter.deleteDocuments().
>

> However, to actually implement the delete with IndexWriter on
> docids, I have to create a customized Query object that iterates my
> int[] of docids.

That won't work (the docIDs might be invalid by the time your Query is
visited).  The point of delete-by-Query is IW hands you a reader,
which you must use right then use to find the docIDs; only the docIDs
from that reader are valid.

> Having IndexWriter.deleteDocuments take a Filter than DocIdSet makes
> sense.

Well... when IW deletes-by-Query, it's already using the Query as a
Filter (ie, not doing any scoring).  Changing the API to
delete-by-Filter won't change the performance.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Jason Rutherglen
In reply to this post by John Wang-9
John,

We looked at implementing delete by doc id for LUCENE-1516, however it
seemed to be something that if enough people wanted we could implement it at
as a later patch.

The implementation involves maintaining a genealogy of SegmentReaders within
IndexWriter so that deletes to a reader that has been merged away will be
imported by later readers.

-J

On Wed, Apr 1, 2009 at 9:13 AM, John Wang <[hidden email]> wrote:

> Hi Michael:
>
>    Let me first share what I am doing w.r.t deleting by docid:
>
> I have a customized index reader that stores a mapping of docid -> uid in
> the payload (something Michael Bush and Ning Li suggested a while back) And
> that mapping is loaded a IndexReader load time and is shared by searchers.
>
> I do realtime update, so I get a batch of updates with a uid associated
> with
> each batch. So I do deleted on the uid and add the document. And I
> implemented using IndexWriter.deleteDocuments(Term[])
>
> I realized I have an IndexReader around already with a docId->uid mapping,
> I
> can just find out the docid from that list and simply call
> IndexReader.deleteDocument(int). So out of curiosity, I compare the times
> doing deletes with these two mechanisms with 1 batch of 10000 deletes. And
> on my macbook pro, I see a difference/overhead of 3-4 seconds (with various
> runs and how much term table is cached etc.) And that is something I would
> expect because we essentially doing a "query" per element in the batch,
> albeit posting list length is only 1, but still...
>
> Now to me that is significant enough to move away from
> IndexWriter.deleteDocuments().
>
> However, to actually implement the delete with IndexWriter on docids, I
> have
> to create a customized Query object that iterates my int[] of docids. Which
> is kinda silly, since IndexWriter calls delete on the docid anyway. I don't
> want to open an IndexReader for the deletes (cuz that's where the api is
> at)
> and then open another IndexWriter to add documents because:
> 1) seems like we are moving away from this paradigm with the delete apis on
> IndexWriter
> 2) I want to have delete and add in 1 commit.
>
> Here is my use-case, and I don't think it is that far-fetched.
>
> Having IndexWriter.deleteDocuments take a Filter than DocIdSet makes sense.
> For me at lease, IndexWriter.deleteDocument(int) would be useful.
>
> Just my $0.02.
>
> -John
>
> On Wed, Apr 1, 2009 at 1:02 AM, Michael McCandless <
> [hidden email]> wrote:
>
> > John,
> >
> > I think this has the same problem as exposing delete by docID, ie, how
> > would you produce that docIdSet?
> >
> > We could consider delete by Filter instead, since that exposes the
> > necessary getDocIdSet(IndexReader) method.
> >
> > Or, with near real-time search, we could enhance it to allow deletions
> > via the obtained reader (the first approach doesn't).
> >
> > Mike
> >
> > On Tue, Mar 31, 2009 at 10:23 PM, John Wang <[hidden email]> wrote:
> > > So do you think it is a good addition/change to the current api now?
> > >
> > > -John
> > >
> > > On Tue, Mar 31, 2009 at 2:18 PM, Yonik Seeley <
> > [hidden email]>wrote:
> > >
> > >> On Tue, Mar 31, 2009 at 4:58 PM, John Wang <[hidden email]>
> wrote:
> > >> > I fail to see the difference of exposing the api to allow for a
> Query
> > >> > instance to be passed in vs a DocIdSet.
> > >>
> > >> I was commenting specifically on your idea to allow deletion by int[]
> > >> (docids) on the IndexWriter.
> > >>
> > >> DocIdSet is a different issue - it didn't exist when the conversation
> > >> to add deleteByQuery was going on.
> > >>
> > >> -Yonik
> > >> http://www.lucidimagination.com
> > >>
> > >>
> > >>  In this specific case, Query is
> > >> > essentially a factory to produce a DocIdSetIterator (or Scorer)
> Isn't
> > it
> > >> > what DocIdSet is?
> > >> > Thanks
> > >> >
> > >> > -John
> > >> >
> > >> > On Tue, Mar 31, 2009 at 12:57 PM, Yonik Seeley
> > >> > <[hidden email]>wrote:
> > >> >
> > >> >> On Tue, Mar 31, 2009 at 3:41 PM, John Wang <[hidden email]>
> > wrote:
> > >> >> > Also, can we expose  IndexWriter.deleteDocuments(int[] docids)?
> > >> >>
> > >> >> Exposing internal ids from the IndexWriter may not be a good idea
> > >> >> given that they are transient.
> > >> >>
> > >> >>
> > >> >> -Yonik
> > >> >> http://www.lucidimagination.com
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [hidden email]
> > >> For additional commands, e-mail: [hidden email]
> > >>
> > >>
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
In reply to this post by Michael McCandless-2
Thanks Michael for the info.
I do guarantee there are not modifications between when
"MySpecialIndexReader" is loaded and when I iterate and find the deleted
docids. I am, however, not aware that when IndexWriter is opened, docids
move. I thought only when docs are added and when it is committed.

With this information. I agree I cannot reuse the uid mapping array from
"MySpecialIndexReader". And I would have to load this mapping using the
IndexReader given to me by the IndexWriter.

My test essentially this. I took out the reader.deleteDocuments call from
both scenarios. I took a index of 5m docs. a batch of 10000 randomly
generated uids.

Compared the following scenarios:
1)
* open index reader
* for each uid in the batch, find the corresponding docid and add to an
IntList.
*close reader

2)
* open index reader
* load uid array from payload field
* iterate uid array, and check to see if uid is in deleted set, and add to
an IntList

The datastructure holding deleted set is IntOpenHashSet from fastutil.

1) took about 3500 - 4500 ms
2) took about 815 ms

-John

On Wed, Apr 1, 2009 at 10:17 AM, Michael McCandless <
[hidden email]> wrote:

> > For me at lease, IndexWriter.deleteDocument(int) would be useful.
>
> I completely agree: delete-by-docID in IndexWriter would be a great
> feature.  Long ago I became convinced of that.
>
> Where this feature always gets stuck (search the lists -- it's gotten
> stuck alot) is how to implement it?  At any time, a merge can commit,
> which invalidates all docIDs stored anywhere.  We need to solve that
> before we can delete by docID.
>
> I don't see a clean solution.  Do you?
>
> > I have a customized index reader that stores a mapping of docid -> uid in
> > the payload (something Michael Bush and Ning Li suggested a while back)
> And
> > that mapping is loaded a IndexReader load time and is shared by
> searchers.
>
> OK
>
> > I do realtime update, so I get a batch of updates with a uid associated
> with
> > each batch. So I do deleted on the uid and add the document. And I
> > implemented using IndexWriter.deleteDocuments(Term[])
>
> OK
>
> > I realized I have an IndexReader around already with a docId->uid
> mapping, I
> > can just find out the docid from that list and simply call
> > IndexReader.deleteDocument(int). So out of curiosity, I compare the times
> > doing deletes with these two mechanisms with 1 batch of 10000 deletes.
> And
> > on my macbook pro, I see a difference/overhead of 3-4 seconds (with
> various
> > runs and how much term table is cached etc.) And that is something I
> would
> > expect because we essentially doing a "query" per element in the batch,
> > albeit posting list length is only 1, but still...
>
> But, your mapping is stale (you can't trust the docIDs) as soon as you
> open an IndexWriter on the same index, so this isn't really a valid
> test.
>
> 3-4 seconds out of how much total time?
>
> Can you give more details on this test?  Are you including time to
> open IndexReader, time to load your docID/uid mapping, and time to
> commit the changes (to be apples/apples)?
>
> > Now to me that is significant enough to move away from
> > IndexWriter.deleteDocuments().
> >
>
> > However, to actually implement the delete with IndexWriter on
> > docids, I have to create a customized Query object that iterates my
> > int[] of docids.
>
> That won't work (the docIDs might be invalid by the time your Query is
> visited).  The point of delete-by-Query is IW hands you a reader,
> which you must use right then use to find the docIDs; only the docIDs
> from that reader are valid.
>
> > Having IndexWriter.deleteDocuments take a Filter than DocIdSet makes
> > sense.
>
> Well... when IW deletes-by-Query, it's already using the Query as a
> Filter (ie, not doing any scoring).  Changing the API to
> delete-by-Filter won't change the performance.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
On Wed, Apr 1, 2009 at 2:04 PM, John Wang <[hidden email]> wrote:

> My test essentially this. I took out the reader.deleteDocuments call from
> both scenarios. I took a index of 5m docs. a batch of 10000 randomly
> generated uids.
>
> Compared the following scenarios:
> 1)
> * open index reader
> * for each uid in the batch, find the corresponding docid and add to an
> IntList.
> *close reader

How exactly do you find the corresponding docid?  TermDocs?

> 2)
> * open index reader
> * load uid array from payload field
> * iterate uid array, and check to see if uid is in deleted set, and add to
> an IntList

In this case, each doc has a dedicated field that only has a payload
that stores the one uid for that doc?  But I'm confused how you then
map from uid -> docID.  I must be missing something.

> The datastructure holding deleted set is IntOpenHashSet from fastutil.
>
> 1) took about 3500 - 4500 ms
> 2) took about 815 ms

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
Hi Michael:

    1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term)
is doing under the cover.
    2) We iterate the docid->uid mapping, for each docid, get the
corresponding ui and check that to see if that is in the deleted set. If so,
add the docid to the list. There is no uid->docid lookup needed.

      However, in our sharded architecture, we partition by continuous uids,
in which case we keep both mappings since we know the range of the the uid.
In which case, uid->docid mapping is available.

-John

On Wed, Apr 1, 2009 at 11:27 AM, Michael McCandless <
[hidden email]> wrote:

> On Wed, Apr 1, 2009 at 2:04 PM, John Wang <[hidden email]> wrote:
>
> > My test essentially this. I took out the reader.deleteDocuments call from
> > both scenarios. I took a index of 5m docs. a batch of 10000 randomly
> > generated uids.
> >
> > Compared the following scenarios:
> > 1)
> > * open index reader
> > * for each uid in the batch, find the corresponding docid and add to an
> > IntList.
> > *close reader
>
> How exactly do you find the corresponding docid?  TermDocs?
>
> > 2)
> > * open index reader
> > * load uid array from payload field
> > * iterate uid array, and check to see if uid is in deleted set, and add
> to
> > an IntList
>
> In this case, each doc has a dedicated field that only has a payload
> that stores the one uid for that doc?  But I'm confused how you then
> map from uid -> docID.  I must be missing something.
>
> > The datastructure holding deleted set is IntOpenHashSet from fastutil.
> >
> > 1) took about 3500 - 4500 ms
> > 2) took about 815 ms
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
On Wed, Apr 1, 2009 at 5:22 PM, John Wang <[hidden email]> wrote:
> Hi Michael:
>
>    1) Yes, we use TermDocs, exactly what IndexWriter.deleteDocuments(Term)
> is doing under the cover.

This part I understand :)

>    2) We iterate the docid->uid mapping, for each docid, get the
> corresponding ui and check that to see if that is in the deleted set. If so,
> add the docid to the list. There is no uid->docid lookup needed.

Does this mean you iterate all docs in the index, and only when you
come across a UID that's deleted, you add to deleted set?

Do you have a separate payload field per document?  (I'm still unclear
how you use payloads to encode the full docID -> UID map).

>      However, in our sharded architecture, we partition by continuous uids,
> in which case we keep both mappings since we know the range of the the uid.
> In which case, uid->docid mapping is available.

OK

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
a code snippet is worth 1000 words :)


private static final Term UID_TERM = new Term("uid_payload", "_UID");

private static class SinglePayloadTokenStream extends TokenStream {

   private Token token = new Token(UID_TERM.text(), 0, 0);

   private byte[] buffer = new byte[4];

   private boolean returnToken = false;


   void setUID(int uid) {

     buffer[0] = (byte) (uid);

     buffer[1] = (byte) (uid >> 8);

     buffer[2] = (byte) (uid >> 16);

     buffer[3] = (byte) (uid >> 24);

     token.setPayload(new Payload(buffer));

     returnToken = true;

   }


   public Token next() throws IOException {

     if (returnToken) {

       returnToken = false;

       return token;

     } else {

       return null;

     }

   }

 }


When building docs:

f=new Field(UID_TERM.field(), singlePayloadTokenStream);

   doc.add(f);


When we load the index, we do:


int maxDoc = reader.maxDoc();

_uidArray = new int[maxDoc];

TermPositions tp = null;

byte[] payloadBuffer = new byte[4];       // four bytes for an int

try

{

          tp = reader.termPositions(UID_TERM);

          int idx = 0;

          while (tp.next())

          {

            int doc = tp.doc();

            assert doc < maxDoc;



            while(idx < doc) _uidArray[idx++] = -1; // fill the gap



            tp.nextPosition();

            tp.getPayload(payloadBuffer, 0);

            int uid = bytesToInt(payloadBuffer);

            if(uid < _minUID) _minUID = uid;

            if(uid > _maxUID) _maxUID = uid;

            _uidArray[idx++] = uid;

      }

}

finally

{

          if (tp!=null)

          {

          tp.close();

          }

}



This is actually code Mike B. posted a while back.


-John


On Wed, Apr 1, 2009 at 2:29 PM, Michael McCandless <
[hidden email]> wrote:

> On Wed, Apr 1, 2009 at 5:22 PM, John Wang <[hidden email]> wrote:
> > Hi Michael:
> >
> >    1) Yes, we use TermDocs, exactly what
> IndexWriter.deleteDocuments(Term)
> > is doing under the cover.
>
> This part I understand :)
>
> >    2) We iterate the docid->uid mapping, for each docid, get the
> > corresponding ui and check that to see if that is in the deleted set. If
> so,
> > add the docid to the list. There is no uid->docid lookup needed.
>
> Does this mean you iterate all docs in the index, and only when you
> come across a UID that's deleted, you add to deleted set?
>
> Do you have a separate payload field per document?  (I'm still unclear
> how you use payloads to encode the full docID -> UID map).
>
> >      However, in our sharded architecture, we partition by continuous
> uids,
> > in which case we keep both mappings since we know the range of the the
> uid.
> > In which case, uid->docid mapping is available.
>
> OK
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
On Wed, Apr 1, 2009 at 6:37 PM, John Wang <[hidden email]> wrote:
> a code snippet is worth 1000 words :)

Here here!

OK, now I understand the difference.

With approach 1, for each of N UIDs you use a TermDocs to find the
postings for that UID, and retrieve the one docID corresponding to
that UID.  You retrieve UID -> docID.

With approach 2, you iterate through all docs in the index, using a
single full walk through the single TermPositions instance for your
special UID_TERM, and retrieve the UID stored in the 4-byte payload.
You retrieve docID -> UID.

Approach 1 is expected to be more costly, per UID - Lucene must
consult the terms dict (binary search on the terms index, followed by
scan on disk within the 128 term block) to find the posting, then seek
to the posting and read that.

Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs
into RAM (ie, you cannot be selective about which UIDs you load).

So if the number of UIDs you need to process is small, approach 1
should win; but after that number crosses X (apparently X < 10000 for
you), approach 2's "bulk walk" will win.

Approach 1 will get faster with the "pulsing" approach for inlining
low-frequency postings directly into the terms dict (discussed on
java-dev and implemented as a codec in the experimental flexible
indexing patch on LUCENE-1458), because we save the second seek.

Approach 2 will get much faster with column-stride fields
(LUCENE-1231).

Though we may want to take this even further and allow inversion for
special fields ("primary key int" field, ie your UID) to be stored as
a column-stride field.  Probably this could simply be another codec in
LUCENE-1458.  Then, delete-by-Term would be exceptionally fast for
such fields.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

John Wang-9
Hi Michael:
    Thanks for looking into this.

    Approach 2 has a dependency on how fast the delete set performs a check
on a given id, approach one doesn't. After replacing my delete set with a
simple bitset, approach 2 gets a 25-30% improvement.

   I understand if the delete set is small, approach 1 would be faster,
while approach two has a more constant/deterministic performance. I would
also save from indexing the UID term into the index if going with approach
two.

   I don't however see how column-stride fields would help here, isn't it a
generalization of what I am doing?

  BTW, can you shine some light on why would IndexWriter move docids around
when it is opened and no docs has been added to it?

Thanks

-John

On Thu, Apr 2, 2009 at 2:20 AM, Michael McCandless <
[hidden email]> wrote:

> On Wed, Apr 1, 2009 at 6:37 PM, John Wang <[hidden email]> wrote:
> > a code snippet is worth 1000 words :)
>
> Here here!
>
> OK, now I understand the difference.
>
> With approach 1, for each of N UIDs you use a TermDocs to find the
> postings for that UID, and retrieve the one docID corresponding to
> that UID.  You retrieve UID -> docID.
>
> With approach 2, you iterate through all docs in the index, using a
> single full walk through the single TermPositions instance for your
> special UID_TERM, and retrieve the UID stored in the 4-byte payload.
> You retrieve docID -> UID.
>
> Approach 1 is expected to be more costly, per UID - Lucene must
> consult the terms dict (binary search on the terms index, followed by
> scan on disk within the 128 term block) to find the posting, then seek
> to the posting and read that.
>
> Approach 2 is an efficient "bulk" walk, but it loads all docID -> UIDs
> into RAM (ie, you cannot be selective about which UIDs you load).
>
> So if the number of UIDs you need to process is small, approach 1
> should win; but after that number crosses X (apparently X < 10000 for
> you), approach 2's "bulk walk" will win.
>
> Approach 1 will get faster with the "pulsing" approach for inlining
> low-frequency postings directly into the terms dict (discussed on
> java-dev and implemented as a codec in the experimental flexible
> indexing patch on LUCENE-1458), because we save the second seek.
>
> Approach 2 will get much faster with column-stride fields
> (LUCENE-1231).
>
> Though we may want to take this even further and allow inversion for
> special fields ("primary key int" field, ie your UID) to be stored as
> a column-stride field.  Probably this could simply be another codec in
> LUCENE-1458.  Then, delete-by-Term would be exceptionally fast for
> such fields.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: IndexWriter.deleteDocuments(Query query)

Michael McCandless-2
On Thu, Apr 2, 2009 at 2:26 PM, John Wang <[hidden email]> wrote:
> Hi Michael:
>    Thanks for looking into this.
>
>    Approach 2 has a dependency on how fast the delete set performs a check
> on a given id, approach one doesn't. After replacing my delete set with a
> simple bitset, approach 2 gets a 25-30% improvement.

Excellent.

>   I understand if the delete set is small, approach 1 would be faster,
> while approach two has a more constant/deterministic performance. I would
> also save from indexing the UID term into the index if going with approach
> two.

True.

>   I don't however see how column-stride fields would help here, isn't it a
> generalization of what I am doing?

Sorry, yes, and I shouldn't have said "much faster".  What I'm
picturing with column stride fields is that you'd be able to load an
int[] per segment, mapping docID -> UID.  That load may be faster than
the decode process you do now, though probably not that much faster.
If we do the inverted column stride field, then you'd have an array
mapping UID -> docID and then should be faster (load time'd be the
same, but you could then visit only the deleted UIDs instead of
sweeping all docs).

>  BTW, can you shine some light on why would IndexWriter move docids around
> when it is opened and no docs has been added to it?

Actually, sorry, I'm wrong about this: in IndexWriter.init, we don't
actually kick off merges.

Though I don't think it's safe to rely on that (Lucene could someday,
eg if index was closed with close(false) then it may need merging on
reopening).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]