UUIDUpdateProcessorFactory can cause duplicate documents?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
S G
Reply | Threaded
Open this post in threaded view
|

UUIDUpdateProcessorFactory can cause duplicate documents?

S G
Hi,

Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
documents even if the same document is indexed twice without the "id" field
?

And to avoid such a thing, we can use the technique mentioned in
https://wiki.apache.org/solr/Deduplication ?

Thanks
SG
Reply | Threaded
Open this post in threaded view
|

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Aman Tandon
Hi,

Suppose id field is the UUID linked field in the configuration and if this
is missing in the document coming to index then it will generate a UUID and
set it in id field. However if id field is present with some value then it
shouldn't.

Kindly refer
http://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html


On Mon, Jun 4, 2018, 23:52 S G <[hidden email]> wrote:

> Hi,
>
> Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
> documents even if the same document is indexed twice without the "id" field
> ?
>
> And to avoid such a thing, we can use the technique mentioned in
> https://wiki.apache.org/solr/Deduplication ?
>
> Thanks
> SG
>
Reply | Threaded
Open this post in threaded view
|

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Erick Erickson
First, your assumption is correct. It would be A Bad Thing if two
identical UUIDs were generated....

Is this SolrCloud? If so, then the deduplication idea won't work. The
problem is that the uuid is used for routing and there is a decent (1
- 1/numShards) chance that the two "identical" docs would land on
different shards, deduplication at the hash level is local to the
replica.

But why not make the hash of the doc's content the "id" field? Your
ETL process would generate the hash and stuff it into the "id" field.
Then in both SolrCloud or stand-alone it would "just work".

Best,
Erick

On Mon, Jun 4, 2018 at 11:33 AM, Aman Tandon <[hidden email]> wrote:

> Hi,
>
> Suppose id field is the UUID linked field in the configuration and if this
> is missing in the document coming to index then it will generate a UUID and
> set it in id field. However if id field is present with some value then it
> shouldn't.
>
> Kindly refer
> http://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
>
>
> On Mon, Jun 4, 2018, 23:52 S G <[hidden email]> wrote:
>
>> Hi,
>>
>> Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
>> documents even if the same document is indexed twice without the "id" field
>> ?
>>
>> And to avoid such a thing, we can use the technique mentioned in
>> https://wiki.apache.org/solr/Deduplication ?
>>
>> Thanks
>> SG
>>
S G
Reply | Threaded
Open this post in threaded view
|

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

S G
We do not want to generate the "id" ourselves and hence were looking for
something that would generate the "id" automatically.

UUIDUpdateProcessorFactory documentation says nothing about the
automatic "id" generation process identifying if the document received is
same as an existing document or not.

That means if I send {"color":"red", "size":"L"} once,
UUIDUpdateProcessorFactory
will
generate an "id" X and if I send the same document {"color":"red",
"size":"L"}  again,
UUIDUpdateProcessorFactory will not know that its the same document and
will generate an "id" Y.

That ways I will end up with two documents:
{"id": X, "color":"red", "size":"L"}
{"id": Y, "color":"red", "size":"L"}

And that situation can only be avoided if I use the
https://wiki.apache.org/solr/Deduplication technique of
generating an "id" based on the signature of some other fields. That will
avoid duplication and auto-generate
the "id" field too.

Is that a correct understanding?

Thanks
SG


On Mon, Jun 4, 2018 at 8:44 PM Erick Erickson <[hidden email]>
wrote:

> First, your assumption is correct. It would be A Bad Thing if two
> identical UUIDs were generated....
>
> Is this SolrCloud? If so, then the deduplication idea won't work. The
> problem is that the uuid is used for routing and there is a decent (1
> - 1/numShards) chance that the two "identical" docs would land on
> different shards, deduplication at the hash level is local to the
> replica.
>
> But why not make the hash of the doc's content the "id" field? Your
> ETL process would generate the hash and stuff it into the "id" field.
> Then in both SolrCloud or stand-alone it would "just work".
>
> Best,
> Erick
>
> On Mon, Jun 4, 2018 at 11:33 AM, Aman Tandon <[hidden email]>
> wrote:
> > Hi,
> >
> > Suppose id field is the UUID linked field in the configuration and if
> this
> > is missing in the document coming to index then it will generate a UUID
> and
> > set it in id field. However if id field is present with some value then
> it
> > shouldn't.
> >
> > Kindly refer
> >
> http://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
> >
> >
> > On Mon, Jun 4, 2018, 23:52 S G <[hidden email]> wrote:
> >
> >> Hi,
> >>
> >> Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
> >> documents even if the same document is indexed twice without the "id"
> field
> >> ?
> >>
> >> And to avoid such a thing, we can use the technique mentioned in
> >> https://wiki.apache.org/solr/Deduplication ?
> >>
> >> Thanks
> >> SG
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Shawn Heisey-2
On 6/9/2018 1:15 AM, S G wrote:

> That means if I send {"color":"red", "size":"L"} once,
> UUIDUpdateProcessorFactory
> will
> generate an "id" X and if I send the same document {"color":"red",
> "size":"L"}  again,
> UUIDUpdateProcessorFactory will not know that its the same document and
> will generate an "id" Y.
>
> That ways I will end up with two documents:
> {"id": X, "color":"red", "size":"L"}
> {"id": Y, "color":"red", "size":"L"}

Correct, that's exactly what will happen.  That update processor's name
makes it sound like it can be used to completely cover situations where
the source data doesn't already have a unique key.  But all it does is
just randomly generate a unique ID, it won't EVER assign the same ID,
even if the document is absolutely identical to one that was indexed before.

> And that situation can only be avoided if I use the
> https://wiki.apache.org/solr/Deduplication technique of
> generating an "id" based on the signature of some other fields. That will
> avoid duplication and auto-generate
> the "id" field too.
>
> Is that a correct understanding?

The deduplication support generates a signature from the contents of the
named fields.  I haven't used this functionality, but I believe that if
you write the signature to the field designated uniqueKey in the Solr
schema, it would do everything you're hoping for.  The first complete
example on that page you referenced sets signatureField to "id", which
is typically the uniqueKey in Solr's example schemas.

Thanks,
Shawn