Which is faster/better

classic Classic list List threaded Threaded
13 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Which is faster/better

adb
In 2.4, as well as IndexWriter.deleteDocuments(Term) there is also
IndexReader.deleteDocuments(Term).

I understand opening a reader is expensive, so does this means using
IndexWriter.deleteDocuments would be faster from a closed index position?

As the IndexReader instance is newer, it has better Javadocs, so it's unclear
which is the 'right' one to use.

Any pointers?
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Michael McCandless-2

If you have nothing open already, and all you want to do is delete
certain documents and make a commit point, then using IndexReader vs
IndexWriter should show very little difference in speed.

But if you have mixed adds/deletes, especially a batch of them where
you don't need any commit points until the end, doing everything with
a single IndexWriter is faster because IndexWriter will buffer up a
bunch of deletes before applying them.

That said, I'd really like to deprecate IndexReader.deleteDocuments,
eventually.  I prefer that there be one obvious way to do things (I
copied this from Python, btw).

As of 2.4, IndexWriter now provides delete-by-Query, which I think
ought to meet nearly all of the cases where someone wants to
delete-by-docID using IndexReader.

Or are there situations out there where delete-by-docID is still
compelling?

Mike

Antony Bowesman wrote:

> In 2.4, as well as IndexWriter.deleteDocuments(Term) there is also  
> IndexReader.deleteDocuments(Term).
>
> I understand opening a reader is expensive, so does this means using  
> IndexWriter.deleteDocuments would be faster from a closed index  
> position?
>
> As the IndexReader instance is newer, it has better Javadocs, so  
> it's unclear which is the 'right' one to use.
>
> Any pointers?
> Antony
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Grant Ingersoll-2

On Nov 25, 2008, at 7:53 AM, Michael McCandless wrote:

>
> As of 2.4, IndexWriter now provides delete-by-Query, which I think
> ought to meet nearly all of the cases where someone wants to
> delete-by-docID using IndexReader.
>
> Or are there situations out there where delete-by-docID is still
> compelling?


Assuming delete-by-DocId means IndexReader.deleteDocument(int) right?  
That is, you mean the internal Lucene doc id, right?

If you already have the docId, would you need to/want to do delete-by-
Query or even delete-by-Term?  Isn't delete-by-id a lot lighter weight  
since it only marks the the doc as deleted, where as d-b-Q can  
potentially force a flush, etc?

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Michael McCandless-2

Grant Ingersoll wrote:

>
> On Nov 25, 2008, at 7:53 AM, Michael McCandless wrote:
>
>>
>> As of 2.4, IndexWriter now provides delete-by-Query, which I think
>> ought to meet nearly all of the cases where someone wants to
>> delete-by-docID using IndexReader.
>>
>> Or are there situations out there where delete-by-docID is still
>> compelling?
>
>
> Assuming delete-by-DocId means IndexReader.deleteDocument(int)  
> right?  That is, you mean the internal Lucene doc id, right?

Right, I mean delete by internal docID.

> If you already have the docId, would you need to/want to do delete-
> by-Query or even delete-by-Term?  Isn't delete-by-id a lot lighter  
> weight since it only marks the the doc as deleted, where as d-b-Q  
> can potentially force a flush, etc?

I guess the question is how you got that docID in the first place?  If
you got it by running a query, and deleting all docIDs that are
returned, then you could dBQ instead?

Lucene's (IndexWriter's) dbQ doesn't force a flush: it's buffered just
like other deletes and then applied in bulk at certain times.  When
autoCommit is false, currently the deletes are applied when a
merge wants to start (ie not at each segment flush).  Or, if you call
commit().

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Grant Ingersoll-2

On Nov 25, 2008, at 10:46 AM, Michael McCandless wrote:

>> If you already have the docId, would you need to/want to do delete-
>> by-Query or even delete-by-Term?  Isn't delete-by-id a lot lighter  
>> weight since it only marks the the doc as deleted, where as d-b-Q  
>> can potentially force a flush, etc?
>
> I guess the question is how you got that docID in the first place?  If
> you got it by running a query, and deleting all docIDs that are
> returned, then you could dBQ instead?

User does a search.  Gets back a set of docs.  Picks docs to delete,  
deletes them.

>
>
> Lucene's (IndexWriter's) dbQ doesn't force a flush: it's buffered just
> like other deletes and then applied in bulk at certain times.  When
> autoCommit is false, currently the deletes are applied when a
> merge wants to start (ie not at each segment flush).  Or, if you call
> commit().

I was just going based of the code of the two:  In the IndexReader,  
all it's doing is marking a bit in a bit vector, right?  Whereas in  
the IndexWriter, it's checking if it's a time to flush, etc.
 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Michael McCandless-2

Grant Ingersoll wrote:

> On Nov 25, 2008, at 10:46 AM, Michael McCandless wrote:
>
>>> If you already have the docId, would you need to/want to do delete-
>>> by-Query or even delete-by-Term?  Isn't delete-by-id a lot lighter  
>>> weight since it only marks the the doc as deleted, where as d-b-Q  
>>> can potentially force a flush, etc?
>>
>> I guess the question is how you got that docID in the first place?  
>> If
>> you got it by running a query, and deleting all docIDs that are
>> returned, then you could dBQ instead?
>
> User does a search.  Gets back a set of docs.  Picks docs to delete,  
> deletes them.

User means end-user, eg via a UI?  Probably delete-by-term would  
suffice here?

If user means developer who wrote some interesting programmatic logic  
that iterates through the docs returned by a search and deletes  
certain ones, that could be implemented as a Filter, right?  I guess  
it's sort of a hassle now since IndexWriter doesn't have a delete-by-
Filter (you'd have to wrap it in ConstantScoreQuery, which is sort of  
silly).

>> Lucene's (IndexWriter's) dbQ doesn't force a flush: it's buffered  
>> just
>> like other deletes and then applied in bulk at certain times.  When
>> autoCommit is false, currently the deletes are applied when a
>> merge wants to start (ie not at each segment flush).  Or, if you call
>> commit().
>
> I was just going based of the code of the two:  In the IndexReader,  
> all it's doing is marking a bit in a bit vector, right?  Whereas in  
> the IndexWriter, it's checking if it's a time to flush, etc.


Sure, IndexWriter has to manage other things (flushing new segments,  
merging, etc).

But the actual mechanics of deletion (marking bits in the BitVector)  
are actually the same because under the hood, when IndexWriter applies  
the deletes, it's asking a [private] SegmentReader to do so.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Khawaja Shams
In reply to this post by Grant Ingersoll-2
On Tue, Nov 25, 2008 at 8:42 AM, Grant Ingersoll <[hidden email]>wrote:

>
> On Nov 25, 2008, at 10:46 AM, Michael McCandless wrote:
>
>  If you already have the docId, would you need to/want to do
>>> delete-by-Query or even delete-by-Term?  Isn't delete-by-id a lot lighter
>>> weight since it only marks the the doc as deleted, where as d-b-Q can
>>> potentially force a flush, etc?
>>>
>>
>> I guess the question is how you got that docID in the first place?  If
>> you got it by running a query, and deleting all docIDs that are
>> returned, then you could dBQ instead?
>>
>
> User does a search.  Gets back a set of docs.  Picks docs to delete,
> deletes them.


Grant, can we assume that the document id will remain consistent from the
time user obtained the result and when they click delete? I was under the
impression that the document ids can change on optimize, etc.

>
>
>
>>
>> Lucene's (IndexWriter's) dbQ doesn't force a flush: it's buffered just
>> like other deletes and then applied in bulk at certain times.  When
>> autoCommit is false, currently the deletes are applied when a
>> merge wants to start (ie not at each segment flush).  Or, if you call
>> commit().
>>
>
> I was just going based of the code of the two:  In the IndexReader, all
> it's doing is marking a bit in a bit vector, right?  Whereas in the
> IndexWriter, it's checking if it's a time to flush, etc.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Grant Ingersoll-2

On Nov 25, 2008, at 12:59 PM, Khawaja Shams wrote:

> On Tue, Nov 25, 2008 at 8:42 AM, Grant Ingersoll  
> <[hidden email]>wrote:
>
>>
>> On Nov 25, 2008, at 10:46 AM, Michael McCandless wrote:
>>
>> If you already have the docId, would you need to/want to do
>>>> delete-by-Query or even delete-by-Term?  Isn't delete-by-id a lot  
>>>> lighter
>>>> weight since it only marks the the doc as deleted, where as d-b-Q  
>>>> can
>>>> potentially force a flush, etc?
>>>>
>>>
>>> I guess the question is how you got that docID in the first  
>>> place?  If
>>> you got it by running a query, and deleting all docIDs that are
>>> returned, then you could dBQ instead?
>>>
>>
>> User does a search.  Gets back a set of docs.  Picks docs to delete,
>> deletes them.
>
>
> Grant, can we assume that the document id will remain consistent  
> from the
> time user obtained the result and when they click delete? I was  
> under the
> impression that the document ids can change on optimize, etc.


That's up to your business logic to determine.  The id is consistent  
for the life of the IdxReader.  Whether you change your reader or not,  
is up to you.





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

adb
In reply to this post by Michael McCandless-2
Michael McCandless wrote:
>
> If you have nothing open already, and all you want to do is delete
> certain documents and make a commit point, then using IndexReader vs
> IndexWriter should show very little difference in speed.

Thanks.  This use case can assume there may be nothing open.  I prefer
IndexWriter as delete=write is a much clearer concept that delete=read...

> As of 2.4, IndexWriter now provides delete-by-Query, which I think
> ought to meet nearly all of the cases where someone wants to
> delete-by-docID using IndexReader.

Yes, that is an excellent addition.  Up to now, our only use case for
delete-by-docId is to perform a dBQ and so far, we have been using your
suggestion from last year about how to do delete documents for ALL terms.

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Ganesh - yahoo
I have to Tag a document based on User request. On deletion, I should do
'marked for delete' and on document state change, i need to update the
document.
Update internally does delete and add. I am commiting the writer and
re-opening the reader, every minute.

Consider, In a minute, lets say User has deleted a document from the UI.
If i use IndexWriter, then it is updating the document. but it is getting
refreshed only after a minute. If User refreshes his page, then he could see
the deleted item again.

In order to avoid this situitation, i need to plan
 1. Delete the document using reader
 2. Add the document with new state using Writer.

I think, we can't avoid using DeleteDocument of Reader. Suggest me, if there
is any other way.

Regards
Ganesh


----- Original Message -----
From: "Antony Bowesman" <[hidden email]>
To: <[hidden email]>
Sent: Wednesday, November 26, 2008 4:00 AM
Subject: Re: Which is faster/better


> Michael McCandless wrote:
>>
>> If you have nothing open already, and all you want to do is delete
>> certain documents and make a commit point, then using IndexReader vs
>> IndexWriter should show very little difference in speed.
>
> Thanks.  This use case can assume there may be nothing open.  I prefer
> IndexWriter as delete=write is a much clearer concept that delete=read...
>
>> As of 2.4, IndexWriter now provides delete-by-Query, which I think
>> ought to meet nearly all of the cases where someone wants to
>> delete-by-docID using IndexReader.
>
> Yes, that is an excellent addition.  Up to now, our only use case for
> delete-by-docId is to perform a dBQ and so far, we have been using your
> suggestion from last year about how to do delete documents for ALL terms.
>
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Michael McCandless-2

So in your UI, you'd like the delete to happen immediately and then  
it's OK if the updated (added) document then takes a minute to appear?

OK, I agree this (the immediacy of doing deletes via IndexReader) is a  
good reason to keep IndexReader.deleteDocument for now.

Mike

Ganesh wrote:

> I have to Tag a document based on User request. On deletion, I  
> should do 'marked for delete' and on document state change, i need  
> to update the document.
> Update internally does delete and add. I am commiting the writer and  
> re-opening the reader, every minute.
>
> Consider, In a minute, lets say User has deleted a document from the  
> UI.
> If i use IndexWriter, then it is updating the document. but it is  
> getting refreshed only after a minute. If User refreshes his page,  
> then he could see the deleted item again.
>
> In order to avoid this situitation, i need to plan
> 1. Delete the document using reader
> 2. Add the document with new state using Writer.
>
> I think, we can't avoid using DeleteDocument of Reader. Suggest me,  
> if there is any other way.
>
> Regards
> Ganesh
>
>
> ----- Original Message ----- From: "Antony Bowesman"  
> <[hidden email]>
> To: <[hidden email]>
> Sent: Wednesday, November 26, 2008 4:00 AM
> Subject: Re: Which is faster/better
>
>
>> Michael McCandless wrote:
>>>
>>> If you have nothing open already, and all you want to do is delete
>>> certain documents and make a commit point, then using IndexReader vs
>>> IndexWriter should show very little difference in speed.
>>
>> Thanks.  This use case can assume there may be nothing open.  I  
>> prefer IndexWriter as delete=write is a much clearer concept that  
>> delete=read...
>>
>>> As of 2.4, IndexWriter now provides delete-by-Query, which I think
>>> ought to meet nearly all of the cases where someone wants to
>>> delete-by-docID using IndexReader.
>>
>> Yes, that is an excellent addition.  Up to now, our only use case  
>> for delete-by-docId is to perform a dBQ and so far, we have been  
>> using your suggestion from last year about how to do delete  
>> documents for ALL terms.
>>
>> Antony
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>
> Send instant messages to your online friends http://in.messenger.yahoo.com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Jason Rutherglen
It would be nice to have a pluggable solution for deleteddocs in IndexReader
that accepts a Filter, and have BitVector implement Filter.  This way I do
not have to implement IndexReader.clone.

On Mon, Dec 1, 2008 at 5:04 PM, Michael McCandless <
[hidden email]> wrote:

>
> So in your UI, you'd like the delete to happen immediately and then it's OK
> if the updated (added) document then takes a minute to appear?
>
> OK, I agree this (the immediacy of doing deletes via IndexReader) is a good
> reason to keep IndexReader.deleteDocument for now.
>
> Mike
>
>
> Ganesh wrote:
>
>  I have to Tag a document based on User request. On deletion, I should do
>> 'marked for delete' and on document state change, i need to update the
>> document.
>> Update internally does delete and add. I am commiting the writer and
>> re-opening the reader, every minute.
>>
>> Consider, In a minute, lets say User has deleted a document from the UI.
>> If i use IndexWriter, then it is updating the document. but it is getting
>> refreshed only after a minute. If User refreshes his page, then he could see
>> the deleted item again.
>>
>> In order to avoid this situitation, i need to plan
>> 1. Delete the document using reader
>> 2. Add the document with new state using Writer.
>>
>> I think, we can't avoid using DeleteDocument of Reader. Suggest me, if
>> there is any other way.
>>
>> Regards
>> Ganesh
>>
>>
>> ----- Original Message ----- From: "Antony Bowesman" <[hidden email]>
>> To: <[hidden email]>
>> Sent: Wednesday, November 26, 2008 4:00 AM
>> Subject: Re: Which is faster/better
>>
>>
>>  Michael McCandless wrote:
>>>
>>>>
>>>> If you have nothing open already, and all you want to do is delete
>>>> certain documents and make a commit point, then using IndexReader vs
>>>> IndexWriter should show very little difference in speed.
>>>>
>>>
>>> Thanks.  This use case can assume there may be nothing open.  I prefer
>>> IndexWriter as delete=write is a much clearer concept that delete=read...
>>>
>>>  As of 2.4, IndexWriter now provides delete-by-Query, which I think
>>>> ought to meet nearly all of the cases where someone wants to
>>>> delete-by-docID using IndexReader.
>>>>
>>>
>>> Yes, that is an excellent addition.  Up to now, our only use case for
>>> delete-by-docId is to perform a dBQ and so far, we have been using your
>>> suggestion from last year about how to do delete documents for ALL terms.
>>>
>>> Antony
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>> Send instant messages to your online friends
>> http://in.messenger.yahoo.com
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Which is faster/better

Ganesh - yahoo
In reply to this post by Michael McCandless-2
>
> So in your UI, you'd like the delete to happen immediately and then  it's
> OK if the updated (added) document then takes a minute to appear?
Yes. Whenever a document state is changed, it moves to different store
(basically a Mail applicaiton, each mail has state of deleted, junk,
delivered etc). Each store has separate UI. When User is viewing a store and
updates a document. The record will be deleted, certain action will be
performed and added with new state so that it could be viewed from different
store. I am using only Lucene as my DB and not using any other database.

Regards
Ganesh


----- Original Message -----
From: "Michael McCandless" <[hidden email]>
To: <[hidden email]>
Sent: Tuesday, December 02, 2008 6:34 AM
Subject: Re: Which is faster/better


>
> So in your UI, you'd like the delete to happen immediately and then  it's
> OK if the updated (added) document then takes a minute to appear?
>
> OK, I agree this (the immediacy of doing deletes via IndexReader) is a
> good reason to keep IndexReader.deleteDocument for now.
>
> Mike
>
> Ganesh wrote:
>
>> I have to Tag a document based on User request. On deletion, I  should do
>> 'marked for delete' and on document state change, i need  to update the
>> document.
>> Update internally does delete and add. I am commiting the writer and
>> re-opening the reader, every minute.
>>
>> Consider, In a minute, lets say User has deleted a document from the  UI.
>> If i use IndexWriter, then it is updating the document. but it is
>> getting refreshed only after a minute. If User refreshes his page,  then
>> he could see the deleted item again.
>>
>> In order to avoid this situitation, i need to plan
>> 1. Delete the document using reader
>> 2. Add the document with new state using Writer.
>>
>> I think, we can't avoid using DeleteDocument of Reader. Suggest me,  if
>> there is any other way.
>>
>> Regards
>> Ganesh
>>
>>
>> ----- Original Message ----- From: "Antony Bowesman"  <[hidden email]>
>> To: <[hidden email]>
>> Sent: Wednesday, November 26, 2008 4:00 AM
>> Subject: Re: Which is faster/better
>>
>>
>>> Michael McCandless wrote:
>>>>
>>>> If you have nothing open already, and all you want to do is delete
>>>> certain documents and make a commit point, then using IndexReader vs
>>>> IndexWriter should show very little difference in speed.
>>>
>>> Thanks.  This use case can assume there may be nothing open.  I  prefer
>>> IndexWriter as delete=write is a much clearer concept that
>>> delete=read...
>>>
>>>> As of 2.4, IndexWriter now provides delete-by-Query, which I think
>>>> ought to meet nearly all of the cases where someone wants to
>>>> delete-by-docID using IndexReader.
>>>
>>> Yes, that is an excellent addition.  Up to now, our only use case  for
>>> delete-by-docId is to perform a dBQ and so far, we have been  using your
>>> suggestion from last year about how to do delete  documents for ALL
>>> terms.
>>>
>>> Antony
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>
>> Send instant messages to your online friends
>> http://in.messenger.yahoo.com
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Send instant messages to your online friends http://in.messenger.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]