TimestampUpdateProcessorFactory updates the field even if the value if present

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

TimestampUpdateProcessorFactory updates the field even if the value if present

gnandre
Hi,

Following is the update request processor chain.

<updateRequestProcessorChain name="DefaultProcessorChain" default="true" > <
processor class="solr.TimestampUpdateProcessorFactory"> <str name=
"fieldName">index_time_stamp_create</str> </processor> <processor class=
"solr.LogUpdateProcessorFactory" /> <processor class=
"solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>

And, here is how the field is defined in schema.xml

<field name="index_time_stamp_create" type="date" indexed="true" stored=
"true" />

Every time I index the same document, above field changes its value with
latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
page, if a document does not contain a value in the timestamp field, a new
Date will be generated and added as the value of that field. After the
first indexing this document should always have a value, so why then it
gets updated later?

I am using Solr Admin UI's Documents tab to index the document for testing.
I am using Solr 6.3 in master-slave architecture mode.
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

kamaci
Hi,

How do you index that document? Do you index it with an empty
*index_time_stamp_create* field as the second time too?

Kind Regards,
Furkan KAMACI

On Fri, May 22, 2020 at 12:05 AM gnandre <[hidden email]> wrote:

> Hi,
>
> Following is the update request processor chain.
>
> <updateRequestProcessorChain name="DefaultProcessorChain" default="true" >
> <
> processor class="solr.TimestampUpdateProcessorFactory"> <str name=
> "fieldName">index_time_stamp_create</str> </processor> <processor class=
> "solr.LogUpdateProcessorFactory" /> <processor class=
> "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
>
> And, here is how the field is defined in schema.xml
>
> <field name="index_time_stamp_create" type="date" indexed="true" stored=
> "true" />
>
> Every time I index the same document, above field changes its value with
> latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> page, if a document does not contain a value in the timestamp field, a new
> Date will be generated and added as the value of that field. After the
> first indexing this document should always have a value, so why then it
> gets updated later?
>
> I am using Solr Admin UI's Documents tab to index the document for testing.
> I am using Solr 6.3 in master-slave architecture mode.
>
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

gnandre
Hi,

I do not pass that field at all.

Here is the document that I index again and again to test through Solr
Admin UI.
{
asset_id:"x:1",
        title:"x"
}

On Thu, May 21, 2020 at 5:25 PM Furkan KAMACI <[hidden email]>
wrote:

> Hi,
>
> How do you index that document? Do you index it with an empty
> *index_time_stamp_create* field as the second time too?
>
> Kind Regards,
> Furkan KAMACI
>
> On Fri, May 22, 2020 at 12:05 AM gnandre <[hidden email]> wrote:
>
> > Hi,
> >
> > Following is the update request processor chain.
> >
> > <updateRequestProcessorChain name="DefaultProcessorChain" default="true"
> >
> > <
> > processor class="solr.TimestampUpdateProcessorFactory"> <str name=
> > "fieldName">index_time_stamp_create</str> </processor> <processor class=
> > "solr.LogUpdateProcessorFactory" /> <processor class=
> > "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
> >
> > And, here is how the field is defined in schema.xml
> >
> > <field name="index_time_stamp_create" type="date" indexed="true" stored=
> > "true" />
> >
> > Every time I index the same document, above field changes its value with
> > latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> > page, if a document does not contain a value in the timestamp field, a
> new
> > Date will be generated and added as the value of that field. After the
> > first indexing this document should always have a value, so why then it
> > gets updated later?
> >
> > I am using Solr Admin UI's Documents tab to index the document for
> testing.
> > I am using Solr 6.3 in master-slave architecture mode.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

kamaci
Hi,

Do you have an id field for your documents? On the other hand, does your
document count increases when you index it again?

Kind Regards,
Furkan KAMACI

On Fri, May 22, 2020 at 1:03 AM gnandre <[hidden email]> wrote:

> Hi,
>
> I do not pass that field at all.
>
> Here is the document that I index again and again to test through Solr
> Admin UI.
> {
> asset_id:"x:1",
>         title:"x"
> }
>
> On Thu, May 21, 2020 at 5:25 PM Furkan KAMACI <[hidden email]>
> wrote:
>
> > Hi,
> >
> > How do you index that document? Do you index it with an empty
> > *index_time_stamp_create* field as the second time too?
> >
> > Kind Regards,
> > Furkan KAMACI
> >
> > On Fri, May 22, 2020 at 12:05 AM gnandre <[hidden email]>
> wrote:
> >
> > > Hi,
> > >
> > > Following is the update request processor chain.
> > >
> > > <updateRequestProcessorChain name="DefaultProcessorChain"
> default="true"
> > >
> > > <
> > > processor class="solr.TimestampUpdateProcessorFactory"> <str name=
> > > "fieldName">index_time_stamp_create</str> </processor> <processor
> class=
> > > "solr.LogUpdateProcessorFactory" /> <processor class=
> > > "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
> > >
> > > And, here is how the field is defined in schema.xml
> > >
> > > <field name="index_time_stamp_create" type="date" indexed="true"
> stored=
> > > "true" />
> > >
> > > Every time I index the same document, above field changes its value
> with
> > > latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> > > page, if a document does not contain a value in the timestamp field, a
> > new
> > > Date will be generated and added as the value of that field. After the
> > > first indexing this document should always have a value, so why then it
> > > gets updated later?
> > >
> > > I am using Solr Admin UI's Documents tab to index the document for
> > testing.
> > > I am using Solr 6.3 in master-slave architecture mode.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

Chris Hostetter-3
In reply to this post by gnandre
: Subject: TimestampUpdateProcessorFactory updates the field even if the value
:     if present
:
: Hi,
:
: Following is the update request processor chain.
:
: <updateRequestProcessorChain name="DefaultProcessorChain" default="true" > <
: processor class="solr.TimestampUpdateProcessorFactory"> <str name=
: "fieldName">index_time_stamp_create</str> </processor> <processor class=
: "solr.LogUpdateProcessorFactory" /> <processor class=
: "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
:
: And, here is how the field is defined in schema.xml
:
: <field name="index_time_stamp_create" type="date" indexed="true" stored=
: "true" />
:
: Every time I index the same document, above field changes its value with
: latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
: page, if a document does not contain a value in the timestamp field, a new

based on the wording of your question, i suspect you are confused about
the overall behavior of how "updating" an existing document works in solr,
and how update processors "see" an *input document* when processing an
add/update command.


First off, completley ignoring TimestampUpdateProcessorFactory and
assuming just the simplest possibel update change, let's clarify how
"updates" work, let's assume you when you say you "index the same
document" twice you do so with a few diff field values ...

First Time...

{  id:"x",  title:"xxxx" }

Second time...

{  id:"x",  body:"xxxx xxxx xxxx xxxx xxxx xxxx xxx" }

Solr does not implicitly know that you are trying to *update* that
document, the final result will not be a document containing both a
"title" field and "body" field in addition to the "id", it will *only*
have the "id" and "body" fields and the title field will be lost.

The way to "update" a document *and keep existing field values* is with
one of the "Atomic Update" command options...

https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html#UpdatingPartsofDocuments-AtomicUpdates

{  id:"x",  title:"xxxx" }

Second time...

{  id:"x",  body: { set: "xxxx xxxx xxxx xxxx xxxx xxxx xxx" } }


Now, with that background info clarified: let's talk about update
processors....


The docs for TimestampUpdateProcessorFactory are refering to how it
modifies an *input* document that it recieves (as part of the processor
chain). It adds the timestamp field if it's not already in the *input*
document, it doesn't know anything about wether that document is already
in the index, or if it has a value for that field in the index.


When processors like TimestampUpdateProcessorFactory (or any other
processor that modifies a *input* document) are run they don't know if the
document you are "indexing" already exists in the index or not.  even if
you are using the "atomic update" options to set/remove/add a field value,
with the intent of preserving all other field values, the documents based
down the processors chain don't include those values until the "document
merger" logic is run -- as part of the DistributedUpdateProcessor (which
if not explicit in your chain happens immediatly before the
RunUpdateProcessorFactory)

Off the top of my head i don't know if there is an "easy" way to have a
Timestamp added to "new" documents, but left "as is" for existing
documents.

Untested idea....

explicitly configured
DistributedUpdateProcessorFactory, so that (in addition to putting
TimestampUpdateProcessorFactory before it) you can
also put MinFieldValueUpdateProcessorFactory on the timestamp field
*after* DistributedUpdateProcessorFactory (but before
RunUpdateProcessorFactory).  

I think that would work?

Just putting TimestampUpdateProcessorFactory after the
DistributedUpdateProcessorFactory would be dangerous, because it would
introduce descrepencies -- each replica would would up with it's own
locally computed timestamp.  having the timetsamp generated before the
distributed update processor ensures the value is computed only once.

-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

gnandre
Thanks for the detailed response, Chris. I am aware of the partial (atomic)
updates. Thanks for clarifying the confusion about input document vs
indexed document. I was thinking that TimestampUpdateProcessorFactory
checks if the value exists in the field inside indexed document before
updating it but actually it does check if it present inside the input
request. But the why do we require explicit processor for that? This can be
done with a simple field in schema that has default value as NOW.

I tried your idea about MinFieldValueUpdateProcessorFactory but it does not
work. Here is the configuration:

<updateRequestProcessorChain name="DefaultProcessorChain" default="true" >
<processor class="solr.TimestampUpdateProcessorFactory"> <str name=
"fieldName">index_time_stamp_create</str> </processor> <processor class=
"solr.LogUpdateProcessorFactory" /> <processor class=
"solr.DistributedUpdateProcessorFactory" /> <processor class=
"solr.MinFieldValueUpdateProcessorFactory"> <str name="fieldName">
index_time_stamp_create</str> </processor> <processor class=
"solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>

I think MinFieldValueUpdateProcessorFactory keeps the min value in a
multivalued field which  index_time_stamp_create is not.

On Tue, May 26, 2020 at 2:31 PM Chris Hostetter <[hidden email]>
wrote:

> : Subject: TimestampUpdateProcessorFactory updates the field even if the
> value
> :     if present
> :
> : Hi,
> :
> : Following is the update request processor chain.
> :
> : <updateRequestProcessorChain name="DefaultProcessorChain" default="true"
> > <
> : processor class="solr.TimestampUpdateProcessorFactory"> <str name=
> : "fieldName">index_time_stamp_create</str> </processor> <processor class=
> : "solr.LogUpdateProcessorFactory" /> <processor class=
> : "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
> :
> : And, here is how the field is defined in schema.xml
> :
> : <field name="index_time_stamp_create" type="date" indexed="true" stored=
> : "true" />
> :
> : Every time I index the same document, above field changes its value with
> : latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
> : page, if a document does not contain a value in the timestamp field, a
> new
>
> based on the wording of your question, i suspect you are confused about
> the overall behavior of how "updating" an existing document works in solr,
> and how update processors "see" an *input document* when processing an
> add/update command.
>
>
> First off, completley ignoring TimestampUpdateProcessorFactory and
> assuming just the simplest possibel update change, let's clarify how
> "updates" work, let's assume you when you say you "index the same
> document" twice you do so with a few diff field values ...
>
> First Time...
>
> {  id:"x",  title:"xxxx" }
>
> Second time...
>
> {  id:"x",  body:"xxxx xxxx xxxx xxxx xxxx xxxx xxx" }
>
> Solr does not implicitly know that you are trying to *update* that
> document, the final result will not be a document containing both a
> "title" field and "body" field in addition to the "id", it will *only*
> have the "id" and "body" fields and the title field will be lost.
>
> The way to "update" a document *and keep existing field values* is with
> one of the "Atomic Update" command options...
>
>
> https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html#UpdatingPartsofDocuments-AtomicUpdates
>
> {  id:"x",  title:"xxxx" }
>
> Second time...
>
> {  id:"x",  body: { set: "xxxx xxxx xxxx xxxx xxxx xxxx xxx" } }
>
>
> Now, with that background info clarified: let's talk about update
> processors....
>
>
> The docs for TimestampUpdateProcessorFactory are refering to how it
> modifies an *input* document that it recieves (as part of the processor
> chain). It adds the timestamp field if it's not already in the *input*
> document, it doesn't know anything about wether that document is already
> in the index, or if it has a value for that field in the index.
>
>
> When processors like TimestampUpdateProcessorFactory (or any other
> processor that modifies a *input* document) are run they don't know if the
> document you are "indexing" already exists in the index or not.  even if
> you are using the "atomic update" options to set/remove/add a field value,
> with the intent of preserving all other field values, the documents based
> down the processors chain don't include those values until the "document
> merger" logic is run -- as part of the DistributedUpdateProcessor (which
> if not explicit in your chain happens immediatly before the
> RunUpdateProcessorFactory)
>
> Off the top of my head i don't know if there is an "easy" way to have a
> Timestamp added to "new" documents, but left "as is" for existing
> documents.
>
> Untested idea....
>
> explicitly configured
> DistributedUpdateProcessorFactory, so that (in addition to putting
> TimestampUpdateProcessorFactory before it) you can
> also put MinFieldValueUpdateProcessorFactory on the timestamp field
> *after* DistributedUpdateProcessorFactory (but before
> RunUpdateProcessorFactory).
>
> I think that would work?
>
> Just putting TimestampUpdateProcessorFactory after the
> DistributedUpdateProcessorFactory would be dangerous, because it would
> introduce descrepencies -- each replica would would up with it's own
> locally computed timestamp.  having the timetsamp generated before the
> distributed update processor ensures the value is computed only once.
>
> -Hoss
> http://www.lucidworks.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: TimestampUpdateProcessorFactory updates the field even if the value if present

Erick Erickson
When is “NOW” ;) ?. The process for updating a doc in SolrCloud is:

1> the doc is received by some solr node.

2> the doc is forwarded to the shard leader if necessary.

3> the doc is distributed from the shard leader to all replicas of that shard.

4> the doc is indexed on each replica.

So just using NOW as the default value, the timestamp would be assigned in
step <4> and would almost certainly be different on the different replicas of
the single shard for any number of reasons from the servers not being
exactly in sync to propagation delays to replica N happening to hit a GC pause
to….

The update processor factory assigns the timestamp once on the leader so
it’s the same on all copies of the doc, assuming it is in the chain in before
DistributedUpdateProcessorFactory.

So with a single-replica (leader only) setup,  or non-cloud setups, the two
would produce near enough to identical results. But if there are multiple replicas
you have to use the factory.

Hmm, I suppose if you are using TLOG/PULL replicas it wouldn’t matter which
approach you used insofar as the doc on each replica would have the
same timestamp.

Best,
Erick

> On May 27, 2020, at 3:49 PM, gnandre <[hidden email]> wrote:
>
> Thanks for the detailed response, Chris. I am aware of the partial (atomic)
> updates. Thanks for clarifying the confusion about input document vs
> indexed document. I was thinking that TimestampUpdateProcessorFactory
> checks if the value exists in the field inside indexed document before
> updating it but actually it does check if it present inside the input
> request. But the why do we require explicit processor for that? This can be
> done with a simple field in schema that has default value as NOW.
>
> I tried your idea about MinFieldValueUpdateProcessorFactory but it does not
> work. Here is the configuration:
>
> <updateRequestProcessorChain name="DefaultProcessorChain" default="true" >
> <processor class="solr.TimestampUpdateProcessorFactory"> <str name=
> "fieldName">index_time_stamp_create</str> </processor> <processor class=
> "solr.LogUpdateProcessorFactory" /> <processor class=
> "solr.DistributedUpdateProcessorFactory" /> <processor class=
> "solr.MinFieldValueUpdateProcessorFactory"> <str name="fieldName">
> index_time_stamp_create</str> </processor> <processor class=
> "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
>
> I think MinFieldValueUpdateProcessorFactory keeps the min value in a
> multivalued field which  index_time_stamp_create is not.
>
> On Tue, May 26, 2020 at 2:31 PM Chris Hostetter <[hidden email]>
> wrote:
>
>> : Subject: TimestampUpdateProcessorFactory updates the field even if the
>> value
>> :     if present
>> :
>> : Hi,
>> :
>> : Following is the update request processor chain.
>> :
>> : <updateRequestProcessorChain name="DefaultProcessorChain" default="true"
>>> <
>> : processor class="solr.TimestampUpdateProcessorFactory"> <str name=
>> : "fieldName">index_time_stamp_create</str> </processor> <processor class=
>> : "solr.LogUpdateProcessorFactory" /> <processor class=
>> : "solr.RunUpdateProcessorFactory" /> </updateRequestProcessorChain>
>> :
>> : And, here is how the field is defined in schema.xml
>> :
>> : <field name="index_time_stamp_create" type="date" indexed="true" stored=
>> : "true" />
>> :
>> : Every time I index the same document, above field changes its value with
>> : latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
>> : page, if a document does not contain a value in the timestamp field, a
>> new
>>
>> based on the wording of your question, i suspect you are confused about
>> the overall behavior of how "updating" an existing document works in solr,
>> and how update processors "see" an *input document* when processing an
>> add/update command.
>>
>>
>> First off, completley ignoring TimestampUpdateProcessorFactory and
>> assuming just the simplest possibel update change, let's clarify how
>> "updates" work, let's assume you when you say you "index the same
>> document" twice you do so with a few diff field values ...
>>
>> First Time...
>>
>> {  id:"x",  title:"xxxx" }
>>
>> Second time...
>>
>> {  id:"x",  body:"xxxx xxxx xxxx xxxx xxxx xxxx xxx" }
>>
>> Solr does not implicitly know that you are trying to *update* that
>> document, the final result will not be a document containing both a
>> "title" field and "body" field in addition to the "id", it will *only*
>> have the "id" and "body" fields and the title field will be lost.
>>
>> The way to "update" a document *and keep existing field values* is with
>> one of the "Atomic Update" command options...
>>
>>
>> https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html#UpdatingPartsofDocuments-AtomicUpdates
>>
>> {  id:"x",  title:"xxxx" }
>>
>> Second time...
>>
>> {  id:"x",  body: { set: "xxxx xxxx xxxx xxxx xxxx xxxx xxx" } }
>>
>>
>> Now, with that background info clarified: let's talk about update
>> processors....
>>
>>
>> The docs for TimestampUpdateProcessorFactory are refering to how it
>> modifies an *input* document that it recieves (as part of the processor
>> chain). It adds the timestamp field if it's not already in the *input*
>> document, it doesn't know anything about wether that document is already
>> in the index, or if it has a value for that field in the index.
>>
>>
>> When processors like TimestampUpdateProcessorFactory (or any other
>> processor that modifies a *input* document) are run they don't know if the
>> document you are "indexing" already exists in the index or not.  even if
>> you are using the "atomic update" options to set/remove/add a field value,
>> with the intent of preserving all other field values, the documents based
>> down the processors chain don't include those values until the "document
>> merger" logic is run -- as part of the DistributedUpdateProcessor (which
>> if not explicit in your chain happens immediatly before the
>> RunUpdateProcessorFactory)
>>
>> Off the top of my head i don't know if there is an "easy" way to have a
>> Timestamp added to "new" documents, but left "as is" for existing
>> documents.
>>
>> Untested idea....
>>
>> explicitly configured
>> DistributedUpdateProcessorFactory, so that (in addition to putting
>> TimestampUpdateProcessorFactory before it) you can
>> also put MinFieldValueUpdateProcessorFactory on the timestamp field
>> *after* DistributedUpdateProcessorFactory (but before
>> RunUpdateProcessorFactory).
>>
>> I think that would work?
>>
>> Just putting TimestampUpdateProcessorFactory after the
>> DistributedUpdateProcessorFactory would be dangerous, because it would
>> introduce descrepencies -- each replica would would up with it's own
>> locally computed timestamp.  having the timetsamp generated before the
>> distributed update processor ensures the value is computed only once.
>>
>> -Hoss
>> http://www.lucidworks.com/
>>