Modifying a document by updating a payloads?

classic Classic list List threaded Threaded
4 messages Options
adb
Reply | Threaded
Open this post in threaded view
|

Modifying a document by updating a payloads?

adb
I seem to recall some discussion about updating a payload, but I can't find it.

I was wondering if it were possible to use a payload to implement 'modify' of a
Lucene document.  For example, I have an ID field, which has a unique ID
refering to an external DB.  For example, I would like to store a short bitmap
giving state information about aspects of the Document and this state could
change during the life of the Document and be available to my searchers.

I've not yet played with payloads and I understand there is something in the
pipeline about updating Documents, but is it possible to update a payload for an
existing Document?

Antony



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modifying a document by updating a payloads?

Michael McCandless-2

Unfortunately you will have to delete the old doc, then reindex a new  
doc, in order to change any payloads in the document's Tokens.

This issue:

     https://issues.apache.org/jira/browse/LUCENE-1231

which is still in progress, could make updating stored (but not  
indexed) fields a much lower cost operation, but that's not for sure  
and it's not clear when that issue will be done.

Mike

Antony Bowesman wrote:

> I seem to recall some discussion about updating a payload, but I  
> can't find it.
>
> I was wondering if it were possible to use a payload to implement  
> 'modify' of a Lucene document.  For example, I have an ID field,  
> which has a unique ID refering to an external DB.  For example, I  
> would like to store a short bitmap giving state information about  
> aspects of the Document and this state could change during the life  
> of the Document and be available to my searchers.
>
> I've not yet played with payloads and I understand there is  
> something in the pipeline about updating Documents, but is it  
> possible to update a payload for an existing Document?
>
> Antony
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

adb
Reply | Threaded
Open this post in threaded view
|

Re: Modifying a document by updating a payloads?

adb
Hi Mike,

> Unfortunately you will have to delete the old doc, then reindex a new
> doc, in order to change any payloads in the document's Tokens.
>
> This issue:
>
>     https://issues.apache.org/jira/browse/LUCENE-1231
>
> which is still in progress, could make updating stored (but not indexed)
> fields a much lower cost operation, but that's not for sure and it's not
> clear when that issue will be done.

Michael Busch's Apache Con (2006/7??) presentation summarized with the bullet

"Per-document Payloads – updateable"

Is making a document 'updatable' (in _some_ way) something still seen as a long
term goal for Lucene?

As far as implementation is concerned, if a stored (not indexed) field may be
updatable with 1231, is there some difficulty with making payloads, which from
my understanding are attributed to a posting of an indexed field, updatable.  I
guess they ultimately equate to the same thing - i.e. using a stored field to
hold the document's "payload", but it would be an extra field to load.

Antony





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Modifying a document by updating a payloads?

Michael McCandless-2

Antony Bowesman wrote:

> Hi Mike,
>
>> Unfortunately you will have to delete the old doc, then reindex a  
>> new doc, in order to change any payloads in the document's Tokens.
>> This issue:
>>    https://issues.apache.org/jira/browse/LUCENE-1231
>> which is still in progress, could make updating stored (but not  
>> indexed) fields a much lower cost operation, but that's not for  
>> sure and it's not clear when that issue will be done.
>
> Michael Busch's Apache Con (2006/7??) presentation summarized with  
> the bullet
>
> "Per-document Payloads – updateable"

Ahh -- this is just another name for "column-stride fields" (which is  
the above issue I linked to).

Normal payloads are per term occurrence, ie, every position in the  
document can have its own payload.

Whereas "per-document payloads" means there is a single payload per  
field in the document, which logically is no different than a stored  
field, except the underly storage would be more efficient (column-
stride, where that field's value for all docs is stored together vs  
the normal row-stride used by current stored fields, where all field  
values for a single document are stored together).

> Is making a document 'updatable' (in _some_ way) something still  
> seen as a long term goal for Lucene?

I would say it is a goal in that there is alot of interest and  
discussion around how to do this.  I think LUCENE-1231 is the most  
concrete recent effort & most likely to be the first path that makes  
updating documents possible.

> As far as implementation is concerned, if a stored (not indexed)  
> field may be updatable with 1231, is there some difficulty with  
> making payloads, which from my understanding are attributed to a  
> posting of an indexed field, updatable.  I guess they ultimately  
> equate to the same thing - i.e. using a stored field to hold the  
> document's "payload", but it would be an extra field to load.

Updating the postings lists (freq/prx&payloads) is unfortunately quite  
a bit trickier than updating a column-stride or row-stride stored  
fields.

I think the approach we need to eventually take is to allow "patches"  
onto a segments posting lists.

For example, segment _X would have the original large _X.frq/prx but  
then could have say _X_1.frq/prx which is a much smaller file  
containing postings for those docs that have been updated since the  
segment was originally created.  If more docs are updated that would  
produce _X_2.frq/prx, etc.

IndexReaders would then need to hold open all of these postings and  
dynamically "apply" the patch such that a doc's postings are iterated  
from the newest frq/prx file that it exists in.  Optimize() and  
partial optimize() would then coalesce these files back into 1 (or  
maybe a few) frq/prx files.

At least that's my current thinking on how we would approach updating  
postings... but realistically these are just thoughts and are quite a  
ways off from becoming a reality!

Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]