Solr document duplicated during pagination

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr document duplicated during pagination

Anil-2
HI,

i am loading solr recrods for a particular query to application  cache.

Lets say total number of eligible records (numFound) are 501.

my solr queries would be

page 1 : q=*:*&start=0&rows=100
page 2 : q=*:*&start=100&rows=100
page 3 : q=*:*&start=200&rows=100
page 4 : q=*:*&start=300&rows=100
page 5 : q=*:*&start=400&rows=100
page 6 : q=*:*&start=500&rows=100

i see page 1 & 2 has common documents and similarly in other pages as well.
Is this correct behavior ? Please correct.


Thanks,
Anil
Reply | Threaded
Open this post in threaded view
|

Re: Solr document duplicated during pagination

Lior Sapir
It will not happen but you must:
1. Have Unique ID for each document
2. Make sure you define this field in the schema.xml
 <uniqueKey>YOUR_DOC_UQ_ID_FIELD_NAME</uniqueKey>
3. If you are using multiple shards query  and not using solr cloud then
you have to make sure you are not inserting the same document into two
different shards. The uniqueness I mentioned in sections 1,2  is only for a
specific shard/core. There is no way that one solr core will enforce
uniqueness on other shards/cores unless you use solr cloud.


On Sun, Apr 10, 2016 at 2:53 PM, Anil <[hidden email]> wrote:

> HI,
>
> i am loading solr recrods for a particular query to application  cache.
>
> Lets say total number of eligible records (numFound) are 501.
>
> my solr queries would be
>
> page 1 : q=*:*&start=0&rows=100
> page 2 : q=*:*&start=100&rows=100
> page 3 : q=*:*&start=200&rows=100
> page 4 : q=*:*&start=300&rows=100
> page 5 : q=*:*&start=400&rows=100
> page 6 : q=*:*&start=500&rows=100
>
> i see page 1 & 2 has common documents and similarly in other pages as well.
> Is this correct behavior ? Please correct.
>
>
> Thanks,
> Anil
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr document duplicated during pagination

Erick Erickson
If the index is being updated while indexing, this
can happen.

But what do you mean by  "i see page 1 & 2 has
common documents and similarly in other pages as well"?

Is it the _same_ id (<unkqueKey> as Lior mentions)?
Docs are "the same" to Solr if and only if they have the
same <uniqueKey>

Best,
Erick

On Sun, Apr 10, 2016 at 6:13 AM, Lior Sapir <[hidden email]> wrote:

> It will not happen but you must:
> 1. Have Unique ID for each document
> 2. Make sure you define this field in the schema.xml
>  <uniqueKey>YOUR_DOC_UQ_ID_FIELD_NAME</uniqueKey>
> 3. If you are using multiple shards query  and not using solr cloud then
> you have to make sure you are not inserting the same document into two
> different shards. The uniqueness I mentioned in sections 1,2  is only for a
> specific shard/core. There is no way that one solr core will enforce
> uniqueness on other shards/cores unless you use solr cloud.
>
>
> On Sun, Apr 10, 2016 at 2:53 PM, Anil <[hidden email]> wrote:
>
>> HI,
>>
>> i am loading solr recrods for a particular query to application  cache.
>>
>> Lets say total number of eligible records (numFound) are 501.
>>
>> my solr queries would be
>>
>> page 1 : q=*:*&start=0&rows=100
>> page 2 : q=*:*&start=100&rows=100
>> page 3 : q=*:*&start=200&rows=100
>> page 4 : q=*:*&start=300&rows=100
>> page 5 : q=*:*&start=400&rows=100
>> page 6 : q=*:*&start=500&rows=100
>>
>> i see page 1 & 2 has common documents and similarly in other pages as well.
>> Is this correct behavior ? Please correct.
>>
>>
>> Thanks,
>> Anil
>>
Reply | Threaded
Open this post in threaded view
|

Re: Solr document duplicated during pagination

Anil-2
Yes Erick.

I have the attached the queries generated from logs.

i see many duplicate records :( . i could not see any duplicates on solr admin console.

Each run giving different number of duplicates.

Do you think Not (-) on query is an issue? please advice.

Thanks,
Anil




On 10 April 2016 at 21:28, Erick Erickson <[hidden email]> wrote:
If the index is being updated while indexing, this
can happen.

But what do you mean by  "i see page 1 & 2 has
common documents and similarly in other pages as well"?

Is it the _same_ id (<unkqueKey> as Lior mentions)?
Docs are "the same" to Solr if and only if they have the
same <uniqueKey>

Best,
Erick

On Sun, Apr 10, 2016 at 6:13 AM, Lior Sapir <[hidden email]> wrote:
> It will not happen but you must:
> 1. Have Unique ID for each document
> 2. Make sure you define this field in the schema.xml
>  <uniqueKey>YOUR_DOC_UQ_ID_FIELD_NAME</uniqueKey>
> 3. If you are using multiple shards query  and not using solr cloud then
> you have to make sure you are not inserting the same document into two
> different shards. The uniqueness I mentioned in sections 1,2  is only for a
> specific shard/core. There is no way that one solr core will enforce
> uniqueness on other shards/cores unless you use solr cloud.
>
>
> On Sun, Apr 10, 2016 at 2:53 PM, Anil <[hidden email]> wrote:
>
>> HI,
>>
>> i am loading solr recrods for a particular query to application  cache.
>>
>> Lets say total number of eligible records (numFound) are 501.
>>
>> my solr queries would be
>>
>> page 1 : q=*:*&start=0&rows=100
>> page 2 : q=*:*&start=100&rows=100
>> page 3 : q=*:*&start=200&rows=100
>> page 4 : q=*:*&start=300&rows=100
>> page 5 : q=*:*&start=400&rows=100
>> page 6 : q=*:*&start=500&rows=100
>>
>> i see page 1 & 2 has common documents and similarly in other pages as well.
>> Is this correct behavior ? Please correct.
>>
>>
>> Thanks,
>> Anil
>>


Queries.txt (35K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Solr document duplicated during pagination

Shawn Heisey-2
On 4/13/2016 4:57 AM, Anil wrote:

> Yes Erick.
>
> I have the attached the queries generated from logs.
>
> i see many duplicate records :( . i could not see any duplicates on
> solr admin console.
>
> Each run giving different number of duplicates.
>
> Do you think Not (-) on query is an issue? please advice.

There are two ways this can happen.  One is that the index has changed
between different queries, pushing or pulling results between the end of
one page and the beginning of the next page.  The other is having the
same uniqueKey value in more than one shard.

Lior Sapir indicated that SolrCloud would behave differently and
eliminate all duplicates from multiple shards, but this is *not* the
case.  Both cloud and non-cloud behave the same.  When the duplicates
are on different pages, they will not be filtered out.  Solr *will*
eliminate duplicates from all results *in the same query* ... but
different pages are different queries.

Thanks,
Shawn