Deleting a single TermPosition <doc, frequency, position> for a Document

classic Classic list List threaded Threaded
3 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Deleting a single TermPosition <doc, frequency, position> for a Document

adb
I'd like to 'update' a single Document in a Lucene index.  In practice, this
'update' is actually just a removal of a single TermPosition for a given Term
for a given doc Id.

I don't think this is currently possible, but would it be easy to change Lucene
to support this type of usage?

The reason for this is to optimise my index usage.  I'm using Lucene to index
arbitrary data sets, however, in some data sets, each Document is indexed once
for each user who has an interest in the document.  For example, with mail data,
a mail item (with a single recipient) is stored as two Documents, once with the
'user' field set to the sender's user Id and again with the user field set to
the recipents's user Id.  Searches just filter mail for a given user by the user
field.

When one of those users deletes the mail, the Document with the 'user' field is
simply deleted.  One of the original reasons for doing this was to enable
horizontal partitioning of the index.  This works nicely, but of course the
index is bigger than necessary and the number of terms positions is at least
double what is necessary.

I had thought to originally indexed the data once, with the user field set to
the sender and recipient user Id, but when the sender or recipient deletes the
mail from their mailbox, searching becomes more complicated as the index does
not reflect the external database state unless the mail is reindexed.

Is this something other's have wanted or are there other solutions to this problem?

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Deleting a single TermPosition <doc, frequency, position> for a Document

Otis Gospodnetic-2
Is your user field stored?  If so, you cold find the target Document, get the user field value, modify it, and re-add it to the Document (or something close to this -- I am doing this with one of the indices on simpy.com and it's working well).

Otis

--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Antony Bowesman <[hidden email]>
To: [hidden email]
Sent: Tuesday, January 8, 2008 12:47:05 AM
Subject: Deleting a single TermPosition <doc, frequency, position> for a Document

I'd like to 'update' a single Document in a Lucene index.  In practice,
 this
'update' is actually just a removal of a single TermPosition for a
 given Term
for a given doc Id.

I don't think this is currently possible, but would it be easy to
 change Lucene
to support this type of usage?

The reason for this is to optimise my index usage.  I'm using Lucene to
 index
arbitrary data sets, however, in some data sets, each Document is
 indexed once
for each user who has an interest in the document.  For example, with
 mail data,
a mail item (with a single recipient) is stored as two Documents, once
 with the
'user' field set to the sender's user Id and again with the user field
 set to
the recipents's user Id.  Searches just filter mail for a given user by
 the user
field.

When one of those users deletes the mail, the Document with the 'user'
 field is
simply deleted.  One of the original reasons for doing this was to
 enable
horizontal partitioning of the index.  This works nicely, but of course
 the
index is bigger than necessary and the number of terms positions is at
 least
double what is necessary.

I had thought to originally indexed the data once, with the user field
 set to
the sender and recipient user Id, but when the sender or recipient
 deletes the
mail from their mailbox, searching becomes more complicated as the
 index does
not reflect the external database state unless the mail is reindexed.

Is this something other's have wanted or are there other solutions to
 this problem?

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Deleting a single TermPosition <doc, frequency, position> for a Document

adb
Otis Gospodnetic wrote:
> Is your user field stored?  If so, you cold find the target Document, get the
> user field value, modify it, and re-add it to the Document (or something
> close to this -- I am doing this with one of the indices on simpy.com and
> it's working well).

No, it's not stored.  I'm not sure I understand how you 'modify it' as it's not
possible to modify an existing Document or do you mean you fetch all the stored
fields from the existing Document, delete the existing Document then add it back
with the modified field?

I have ~20 fields per Document and most are not stored

Antony


>
> Otis
>
> -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ---- From: Antony Bowesman <[hidden email]> To:
> [hidden email] Sent: Tuesday, January 8, 2008 12:47:05 AM
> Subject: Deleting a single TermPosition <doc, frequency, position> for a
> Document
>
> I'd like to 'update' a single Document in a Lucene index.  In practice, this
>  'update' is actually just a removal of a single TermPosition for a given
> Term for a given doc Id.
>
> I don't think this is currently possible, but would it be easy to change
> Lucene to support this type of usage?
>
> The reason for this is to optimise my index usage.  I'm using Lucene to index
>  arbitrary data sets, however, in some data sets, each Document is indexed
> once for each user who has an interest in the document.  For example, with
> mail data, a mail item (with a single recipient) is stored as two Documents,
> once with the 'user' field set to the sender's user Id and again with the
> user field set to the recipents's user Id.  Searches just filter mail for a
> given user by the user field.
>
> When one of those users deletes the mail, the Document with the 'user' field
> is simply deleted.  One of the original reasons for doing this was to enable
>  horizontal partitioning of the index.  This works nicely, but of course the
>  index is bigger than necessary and the number of terms positions is at least
>  double what is necessary.
>
> I had thought to originally indexed the data once, with the user field set to
>  the sender and recipient user Id, but when the sender or recipient deletes
> the mail from their mailbox, searching becomes more complicated as the index
> does not reflect the external database state unless the mail is reindexed.
>
> Is this something other's have wanted or are there other solutions to this
> problem?
>
> Thanks Antony
>
>
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]
>
>
>
>
>
> --------------------------------------------------------------------- To
> unsubscribe, e-mail: [hidden email] For additional
> commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]