"Deleting" documents without deleting them

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

"Deleting" documents without deleting them

Daniel Noll-3-2
Hi all.

I'm trying to implement a form of document deletion where the previous
versions are kept around forever ( a primitive form of versioning) but
excluded from the search results.

I notice that after calling IndexWriter.deleteDocuments, even if you
close and reopen the index, the documents are still accessible using
document(int) but are returned from queries, which is exactly the
behaviour I want.  However, if I call optimize() they will obviously
be obliterated.

My question is: as long as I never call optimize() -- will the deleted
documents hang around forever, or will a merge due to adding the new
documents eventually cause them to be removed?

If they will be removed then I need some other way to avoid them being
returned.  I was thinking of actually *not* deleting them, but
maintaining a giant filter - I could store this filter on disk but
it's going to be pretty large even if I use a BitSet. :-(   Is there
any other way to go about it?

Daniel




--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Deleting" documents without deleting them

Michael McCandless-2
An incidental merge will delete them.

I think you'll have to maintain your own filter... but it shouldn't be
that large?  Ie it's as large as deleted docs BitVector would be
anyway... except that the docs never go away.

Mike

On Mon, Mar 15, 2010 at 11:20 PM, Daniel Noll <[hidden email]> wrote:

> Hi all.
>
> I'm trying to implement a form of document deletion where the previous
> versions are kept around forever ( a primitive form of versioning) but
> excluded from the search results.
>
> I notice that after calling IndexWriter.deleteDocuments, even if you
> close and reopen the index, the documents are still accessible using
> document(int) but are returned from queries, which is exactly the
> behaviour I want.  However, if I call optimize() they will obviously
> be obliterated.
>
> My question is: as long as I never call optimize() -- will the deleted
> documents hang around forever, or will a merge due to adding the new
> documents eventually cause them to be removed?
>
> If they will be removed then I need some other way to avoid them being
> returned.  I was thinking of actually *not* deleting them, but
> maintaining a giant filter - I could store this filter on disk but
> it's going to be pretty large even if I use a BitSet. :-(   Is there
> any other way to go about it?
>
> Daniel
>
>
>
>
> --
> Daniel Noll                            Forensic and eDiscovery Software
> Senior Developer                              The world's most advanced
> Nuix                                                email data analysis
> http://nuix.com/                                and eDiscovery software
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Deleting" documents without deleting them

Rene Hackl-Sommer
In reply to this post by Daniel Noll-3-2
Hi Daniel,

Unless you have only a few documents and a small index, I don't think
never calling optimize is going to be a means you should rely upon.

What about if you reindexed the documents you are deleting, adding a
field <excludeFromSearch> with the value "true"? This would imply that
either

1) all fields are stored, so you may retrieve them from the original doc
and add them to the new one plus the exclusion field
2) or if a lot of fields are only indexed you'd need access to the
original source. (With limitations it is also possible to reconstruct a
field from indexed data only, but not generally recommendable)

During search, just add "NOT excludeFromSearch:true" to the query.

If you need to keep track of which versions belong together, you may
need to think about how you uniquely identify documents, how this
changes between versions, and if the update dates might be of any help.

Cheers
Rene


Am 16.03.2010 05:20, schrieb Daniel Noll:

> Hi all.
>
> I'm trying to implement a form of document deletion where the previous
> versions are kept around forever ( a primitive form of versioning) but
> excluded from the search results.
>
> I notice that after calling IndexWriter.deleteDocuments, even if you
> close and reopen the index, the documents are still accessible using
> document(int) but are returned from queries, which is exactly the
> behaviour I want.  However, if I call optimize() they will obviously
> be obliterated.
>
> My question is: as long as I never call optimize() -- will the deleted
> documents hang around forever, or will a merge due to adding the new
> documents eventually cause them to be removed?
>
> If they will be removed then I need some other way to avoid them being
> returned.  I was thinking of actually *not* deleting them, but
> maintaining a giant filter - I could store this filter on disk but
> it's going to be pretty large even if I use a BitSet. :-(   Is there
> any other way to go about it?
>
> Daniel
>
>
>
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Deleting" documents without deleting them

TCK-2
Wouldn't these excluded/filtered documents skew the scores even though they
are supposed to be marked as deleted? Don't the idf values used in scoring
depend on the entire document set and not just the matching hits for a
query?

Thanks,
TCK




On Tue, Mar 16, 2010 at 5:45 AM, Rene Hackl-Sommer <[hidden email]>wrote:

> Hi Daniel,
>
> Unless you have only a few documents and a small index, I don't think never
> calling optimize is going to be a means you should rely upon.
>
> What about if you reindexed the documents you are deleting, adding a field
> <excludeFromSearch> with the value "true"? This would imply that either
>
> 1) all fields are stored, so you may retrieve them from the original doc
> and add them to the new one plus the exclusion field
> 2) or if a lot of fields are only indexed you'd need access to the original
> source. (With limitations it is also possible to reconstruct a field from
> indexed data only, but not generally recommendable)
>
> During search, just add "NOT excludeFromSearch:true" to the query.
>
> If you need to keep track of which versions belong together, you may need
> to think about how you uniquely identify documents, how this changes between
> versions, and if the update dates might be of any help.
>
> Cheers
> Rene
>
>
> Am 16.03.2010 05:20, schrieb Daniel Noll:
>
>  Hi all.
>>
>> I'm trying to implement a form of document deletion where the previous
>> versions are kept around forever ( a primitive form of versioning) but
>> excluded from the search results.
>>
>> I notice that after calling IndexWriter.deleteDocuments, even if you
>> close and reopen the index, the documents are still accessible using
>> document(int) but are returned from queries, which is exactly the
>> behaviour I want.  However, if I call optimize() they will obviously
>> be obliterated.
>>
>> My question is: as long as I never call optimize() -- will the deleted
>> documents hang around forever, or will a merge due to adding the new
>> documents eventually cause them to be removed?
>>
>> If they will be removed then I need some other way to avoid them being
>> returned.  I was thinking of actually *not* deleting them, but
>> maintaining a giant filter - I could store this filter on disk but
>> it's going to be pretty large even if I use a BitSet. :-(   Is there
>> any other way to go about it?
>>
>> Daniel
>>
>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: "Deleting" documents without deleting them

Rene Hackl-Sommer
I cannot comment on the "marked-as-deleted" documents, but for the
approach I outlined: this might impact the scores. I prefer to say
'impact' instead of 'skew', because to me 'skew' would imply that the
original scores are some kind of ideal state which is distorted. I don't
think this is necessarily the case with term weight shifts.

It really depends on the specific setup. If there are millions of
documents in the index, and some of them are in there ten times and
others a hundred times in terms of their contribution to statistical
figures (not real physical multiple instances), I don't think this would
lead to a significant change overall. With a large index, I would be
surprised if this would affect precision by something drastic, say 5%.

And if marginal shifts are troublesome, you can always maintain two
indexes: one with all the document versions for reference if required
and the other one with only the current documents for everyday searches.

Cheers
Rene

Am 16.03.2010 14:05, schrieb TCK:

> Wouldn't these excluded/filtered documents skew the scores even though they
> are supposed to be marked as deleted? Don't the idf values used in scoring
> depend on the entire document set and not just the matching hits for a
> query?
>
> Thanks,
> TCK
>
>
>
>
> On Tue, Mar 16, 2010 at 5:45 AM, Rene Hackl-Sommer<[hidden email]>wrote:
>
>    


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: "Deleting" documents without deleting them

Daniel Noll-3-2
In reply to this post by Rene Hackl-Sommer
On Tue, Mar 16, 2010 at 20:45, Rene Hackl-Sommer <[hidden email]> wrote:

> Hi Daniel,
>
> Unless you have only a few documents and a small index, I don't think never
> calling optimize is going to be a means you should rely upon.
>
> What about if you reindexed the documents you are deleting, adding a field
> <excludeFromSearch> with the value "true"? This would imply that either
>
> 1) all fields are stored, so you may retrieve them from the original doc and
> add them to the new one plus the exclusion field
> 2) or if a lot of fields are only indexed you'd need access to the original
> source. (With limitations it is also possible to reconstruct a field from
> indexed data only, but not generally recommendable)

Unfortunately it also makes the assumption that it's OK for the doc
IDs to shift - in our application this is not the case as we use it to
key to various databases.  So for us, the effects would be like this:
   Relocating one document to the end of the index and marking the
earlier one as fakedeleted => One query to each relevant table to
update that one document
   Deleting a document in order to re-add a fakedeleted version at the
end (implying that merging occurs) => If 100,000 documents shift, up
to 100,000 IDs in each table need to be updated.

(Why don't we use a separate int field?  Because for tables like tags,
it's too slow to do an additional query into Lucene to map the virtual
ID back to the real doc ID when building filters by tag.)

Of course, if it were possible to add one field to a document without
deleting and re-adding it, yes -- then this would be the way to go for
sure.  In fact, if Lucene had the ability to incrementally update a
document in the first place, I would never have needed to embark on
this whole exercise, as I could just update the document I want to
update and move the old fields to new fields.  At some point a
replaceDocument() which maintains doc IDs would be a very nice thing
to have.

> If you need to keep track of which versions belong together, you may need to
> think about how you uniquely identify documents, how this changes between
> versions, and if the update dates might be of any help.

That gives me an idea.  We have a GUID field already, which is
actually for other purposes, but I could go over TermEnum/TermDocs for
that field and build a filter which only matches the last doc for each
term.  Then I don't have to pay for the storage of a filter... but I
guess it will cost to build this filter anyway so I don't know if it's
practical yet.

I guess storing the filter on disk would be an easier way to go, with
the caveat that it will cost a bit to flip bits each time a new
document is fakedeleted.

Daniel


--
Daniel Noll                            Forensic and eDiscovery Software
Senior Developer                              The world's most advanced
Nuix                                                email data analysis
http://nuix.com/                                and eDiscovery software

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]