TermEnum with deleted dccuments

classic Classic list List threaded Threaded
3 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

TermEnum with deleted dccuments

adb
I am merging Index A to Index B.  First I read the terms for a particular field
from index A and some of the documents in A get deleted.

I then enumerate the terms on a different field also in index A, but the terms
from the deleted document are still present.

The termEnum.docFreq() also returns > 0 for those terms even though the docs are
deleted.

Should this be the case?  I have tried closing the reader between enumerations,
but no difference.

Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: TermEnum with deleted dccuments

Michael McCandless-2
This is known & expected.

Lucene does not update the terms dictionary (meaning which terms are
in the index, and their frequency) in response to deleted docs.

It does update TermDocs enumeration, ie once you get the TermDocs for
a given term and step through its docs, the deleted docs will not be
returned.

One workaround is to call IndexWriter.expungeDeletes, but that's a
costly operation (forces merges of any segments containing deletes).

https://issues.apache.org/jira/browse/LUCENE-1613 was opened to gather
use cases / issues on this... if this is impacting your application,
can you post some details to that issue?

Mike

On Thu, May 7, 2009 at 1:04 AM, Antony Bowesman <[hidden email]> wrote:

> I am merging Index A to Index B.  First I read the terms for a particular
> field from index A and some of the documents in A get deleted.
>
> I then enumerate the terms on a different field also in index A, but the
> terms from the deleted document are still present.
>
> The termEnum.docFreq() also returns > 0 for those terms even though the docs
> are deleted.
>
> Should this be the case?  I have tried closing the reader between
> enumerations, but no difference.
>
> Antony
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: TermEnum with deleted dccuments

adb
Hi Mike,

Thanks for the response.

I looked at that issue, but my case is trivial to fix.  I just keep the Set of
terms I have deleted and ignore those during my second interation.

Thanks
Antony



Michael McCandless wrote:

> This is known & expected.
>
> Lucene does not update the terms dictionary (meaning which terms are
> in the index, and their frequency) in response to deleted docs.
>
> It does update TermDocs enumeration, ie once you get the TermDocs for
> a given term and step through its docs, the deleted docs will not be
> returned.
>
> One workaround is to call IndexWriter.expungeDeletes, but that's a
> costly operation (forces merges of any segments containing deletes).
>
> https://issues.apache.org/jira/browse/LUCENE-1613 was opened to gather
> use cases / issues on this... if this is impacting your application,
> can you post some details to that issue?
>
> Mike
>
> On Thu, May 7, 2009 at 1:04 AM, Antony Bowesman <[hidden email]> wrote:
>> I am merging Index A to Index B.  First I read the terms for a particular
>> field from index A and some of the documents in A get deleted.
>>
>> I then enumerate the terms on a different field also in index A, but the
>> terms from the deleted document are still present.
>>
>> The termEnum.docFreq() also returns > 0 for those terms even though the docs
>> are deleted.
>>
>> Should this be the case?  I have tried closing the reader between
>> enumerations, but no difference.
>>
>> Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]