How to not overwrite a Document if it 'already exists'?

classic Classic list List threaded Threaded
5 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

How to not overwrite a Document if it 'already exists'?

adb
I'm adding Documents in batches to an index with IndexWriter.  In certain
circumstances, I do not want to add the Document if it already exists, where
existence is determined by field id=myId.

Is there any way to do this with IndexWriter or do I have to open a reader and
look for the term id:XXX?  Given that opening a reader is expensive, is there
any way to do this efficiently?

I guess what I want is

IndexWriter.addDocumentIfMissing(Term term, Document doc, Analyzer analyzer)

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to not overwrite a Document if it 'already exists'?

Michael McCandless-2
Lucene doesn't provide any way to do this, except opening a reader.

Opening a reader is not "that" expensive if you use it for this
purpose.  EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.

But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.

Mike

On Tue, May 5, 2009 at 6:45 AM, Antony Bowesman <[hidden email]> wrote:

> I'm adding Documents in batches to an index with IndexWriter.  In certain
> circumstances, I do not want to add the Document if it already exists, where
> existence is determined by field id=myId.
>
> Is there any way to do this with IndexWriter or do I have to open a reader
> and look for the term id:XXX?  Given that opening a reader is expensive, is
> there any way to do this efficiently?
>
> I guess what I want is
>
> IndexWriter.addDocumentIfMissing(Term term, Document doc, Analyzer analyzer)
>
> Thanks
> Antony
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: How to not overwrite a Document if it 'already exists'?

adb
Michael McCandless wrote:
> Lucene doesn't provide any way to do this, except opening a reader.
>
> Opening a reader is not "that" expensive if you use it for this
> purpose.  EG neither norms nor FieldCache will be loaded if you just
> enumerate the term docs.

Thanks for that info.  These indexes will be large, in the 10s of millions.  id
field is unique and is 29 bytes.  I guess that's still a lot of data to trawl
through to get to the term.

> But, you can let Lucene do the same thing for you by just always using
> updateDocument, which'll remove the old doc if it's present.

That's precisely what I don't want to occur.  I have two forms of a Document,
which represent mail items.  One 'full' version containing all index and stored
data, which represents a searchable mail item and one 'base', which is simply a
marker Document which represents a mail in a forwarded mail chain, with just a
couple of stored fields containing the mail meta data.

Under normal circumstances there are no problems as mails arrive in sequence and
are never handled twice, but there is one case, during a reindex op, when the
arrival of those mails can come out of sequence, i.e. a full mail is indexed
first, but that mail is later processed as part of a forwarded mail chain of
another mail.

It is the second time that mail is handled as a base mail that I do not want it
to overwrite the full version.

Would it be technically difficult to support something like this in the
IndexWriter API and if not, would it end up being more efficient that using a
reader/terms to check this?

Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to not overwrite a Document if it 'already exists'?

Michael McCandless-2
On Tue, May 5, 2009 at 7:24 PM, Antony Bowesman <[hidden email]> wrote:

> Michael McCandless wrote:
>>
>> Lucene doesn't provide any way to do this, except opening a reader.
>>
>> Opening a reader is not "that" expensive if you use it for this
>> purpose.  EG neither norms nor FieldCache will be loaded if you just
>> enumerate the term docs.
>
> Thanks for that info.  These indexes will be large, in the 10s of millions.
>  id field is unique and is 29 bytes.  I guess that's still a lot of data to
> trawl through to get to the term.

Have you tested how long it takes to look up docs from your id?

>> But, you can let Lucene do the same thing for you by just always using
>> updateDocument, which'll remove the old doc if it's present.
>
> That's precisely what I don't want to occur.  I have two forms of a
> Document, which represent mail items.  One 'full' version containing all
> index and stored data, which represents a searchable mail item and one
> 'base', which is simply a marker Document which represents a mail in a
> forwarded mail chain, with just a couple of stored fields containing the
> mail meta data.
>
> Under normal circumstances there are no problems as mails arrive in sequence
> and are never handled twice, but there is one case, during a reindex op,
> when the arrival of those mails can come out of sequence, i.e. a full mail
> is indexed first, but that mail is later processed as part of a forwarded
> mail chain of another mail.
>
> It is the second time that mail is handled as a base mail that I do not want
> it to overwrite the full version.
>
> Would it be technically difficult to support something like this in the
> IndexWriter API and if not, would it end up being more efficient that using
> a reader/terms to check this?

Couldn't you just give the base & full docs different ids?  Then you
can independently choose which one to update?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: How to not overwrite a Document if it 'already exists'?

adb
>> Thanks for that info.  These indexes will be large, in the 10s of millions.
>>  id field is unique and is 29 bytes.  I guess that's still a lot of data to
>> trawl through to get to the term.
>
> Have you tested how long it takes to look up docs from your id?

Not in indexes that size in a live environment as I don't have the hardware to
make those sorts of test :( although I know in general, lookup is fast.

> Couldn't you just give the base & full docs different ids?  Then you
> can independently choose which one to update?

I considered that, but as the normal case will not need to worry about this
scenario.

There is only ever one instance of a mail Doc, whether it is a root mail or part
of a forward chain and a root mail can of course be part of a forward chain at
some point, so it should be optimal to just fetch the one Document for the mail
Id without first trying the true Id, then some pseudo Id if it isn't found.

Unfortunately, I'm having to solve this problem in my Lucene app as the tool
that's generating this data is unable to know what has or has not been handled
previously.

I'm implementing it using the IndexReader approach for now and will try to get
some performance data, so thanks for your comments Mike.

Antony








---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]