Solr 7.0.1 Duplicate document appearing in search results

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr 7.0.1 Duplicate document appearing in search results

Adam Walz
In my solr schema I have set a uniqueKey of "id" where the id field is a
solr.StrField. When querying with this field as a filter I would expect to
always get 1 or 0 documents as a result. However I am getting back multiple
documents with the same "id" field, but different internal `docid`s. This
problem is intermittent and seems to resolve itself when the document is
updated. This is happening on solr 7.0.1 without SolrCloud and while only
querying a single shard without routing.

Any thoughts on what could be causing this behavior? This is a very large
single shard with 300 million documents and an index size of 750GB. I know
that is not recommended for a single shard, but could it explain these
duplicate results possibly because of the time it takes to commit, merge,
or something with tlogs?

-- Query --
http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
<http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
*382506116*&q=*:*
<http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
-- Response --

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "mm":" 1<-0% ",
      "q.alt":"*:*",
      "ps":"100",
      "echoParams":"all",
      "fl":"id,[docid],score",
      "fq":"id:file_413041895994",
      "sort":"score desc",
      "rows":"35",
      "version":"2.2",
      "q":"*:*",
      "tie":"0.01",
      "defType":"edismax",
      "qf":"id name_combined^10 name_zh-cn^10 name_shingle
name_shingle_zh-cn name_token^60 description file_content_en
file_content_fr file_content_de file_content_it file_content_es
file_content_zh-cn user_name user_email comments tags",
      "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments tags",
      "wt":"json",
      "debugQuery":"off"}},
  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
      {
        "id":"file_382506116",

        "[docid]":346266675,
        "score":1.0}]
  },{

        "id":"file_382506116",
        "[docid]":170442733,
        "score":1.0}]

  }}


-- Schema snippet --
<fields>
  <field name="id" type="string" indexed="true" stored="true"
required="true"/>
</fields>
 <uniqueKey>id</uniqueKey>

--
Adam Walz
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.0.1 Duplicate document appearing in search results

Erick Erickson
This is indeed strange. First of all, forget about explanations that involve the transaction log etc. When Lucene opens a searcher, it is only for closed segments, the tlog has nothing to do with that.

Have you ever merget indexes? The MapReduceIndexerTool, if you ever used it, does not de-duplicate. Ditto if you ever changed the <uniqueKey>. The fact that you say that this clears up when you re-index the document leads me to wonder whether you have manipulated the index outside the normal Solr framework.

IOW, I’ve never seen this before, so I suspect there’s something you did in your setup that seemed innocent at the time that lead to this (temporary) situation.

Best,
Erick

> On May 14, 2019, at 5:43 PM, Adam Walz <[hidden email]> wrote:
>
> In my solr schema I have set a uniqueKey of "id" where the id field is a
> solr.StrField. When querying with this field as a filter I would expect to
> always get 1 or 0 documents as a result. However I am getting back multiple
> documents with the same "id" field, but different internal `docid`s. This
> problem is intermittent and seems to resolve itself when the document is
> updated. This is happening on solr 7.0.1 without SolrCloud and while only
> querying a single shard without routing.
>
> Any thoughts on what could be causing this behavior? This is a very large
> single shard with 300 million documents and an index size of 750GB. I know
> that is not recommended for a single shard, but could it explain these
> duplicate results possibly because of the time it takes to commit, merge,
> or something with tlogs?
>
> -- Query --
> http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
> <http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
> *382506116*&q=*:*
> <http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*>
> -- Response --
>
> {
>  "responseHeader":{
>    "status":0,
>    "QTime":0,
>    "params":{
>      "mm":" 1<-0% ",
>      "q.alt":"*:*",
>      "ps":"100",
>      "echoParams":"all",
>      "fl":"id,[docid],score",
>      "fq":"id:file_413041895994",
>      "sort":"score desc",
>      "rows":"35",
>      "version":"2.2",
>      "q":"*:*",
>      "tie":"0.01",
>      "defType":"edismax",
>      "qf":"id name_combined^10 name_zh-cn^10 name_shingle
> name_shingle_zh-cn name_token^60 description file_content_en
> file_content_fr file_content_de file_content_it file_content_es
> file_content_zh-cn user_name user_email comments tags",
>      "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments tags",
>      "wt":"json",
>      "debugQuery":"off"}},
>  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
>      {
>        "id":"file_382506116",
>
>        "[docid]":346266675,
>        "score":1.0}]
>  },{
>
>        "id":"file_382506116",
>        "[docid]":170442733,
>        "score":1.0}]
>
>  }}
>
>
> -- Schema snippet --
> <fields>
>  <field name="id" type="string" indexed="true" stored="true"
> required="true"/>
> </fields>
> <uniqueKey>id</uniqueKey>
>
> --
> Adam Walz

Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.0.1 Duplicate document appearing in search results

Adam Walz
Thanks Erick,

We've never merged indexes. We don't use the MapReduceIndexerTool, but do
use an external map reduce process to reindex. To reindex from an empty
state we have a map reduce job which runs on a separate HBase cluster and
indexes into this shard. During this job each mapper is concurrently making
http update requests to the shard, but only 1 mapper should post a document
per unique "id".

Reindexing from scratch is done roughly every 3 months. In between that
time we have a worker external to solr which reads from an event stream and
posts http updates to the solr cluster.

The <uniqueKey> has never but updated to my knowledge, but if it has it
definitely wasn't updated in the last 3 months since the last reindexing.

Also since the last reindexing nothing in the solrconfig.xml or
managed-schema has been updated, nor has the index been manipulated outside
of the solr framework.

On Tue, May 14, 2019 at 5:24 PM Erick Erickson <[hidden email]>
wrote:

> This is indeed strange. First of all, forget about explanations that
> involve the transaction log etc. When Lucene opens a searcher, it is only
> for closed segments, the tlog has nothing to do with that.
>
> Have you ever merget indexes? The MapReduceIndexerTool, if you ever used
> it, does not de-duplicate. Ditto if you ever changed the <uniqueKey>. The
> fact that you say that this clears up when you re-index the document leads
> me to wonder whether you have manipulated the index outside the normal Solr
> framework.
>
> IOW, I’ve never seen this before, so I suspect there’s something you did
> in your setup that seemed innocent at the time that lead to this
> (temporary) situation.
>
> Best,
> Erick
>
> > On May 14, 2019, at 5:43 PM, Adam Walz <[hidden email]> wrote:
> >
> > In my solr schema I have set a uniqueKey of "id" where the id field is a
> > solr.StrField. When querying with this field as a filter I would expect
> to
> > always get 1 or 0 documents as a result. However I am getting back
> multiple
> > documents with the same "id" field, but different internal `docid`s. This
> > problem is intermittent and seems to resolve itself when the document is
> > updated. This is happening on solr 7.0.1 without SolrCloud and while only
> > querying a single shard without routing.
> >
> > Any thoughts on what could be causing this behavior? This is a very large
> > single shard with 300 million documents and an index size of 750GB. I
> know
> > that is not recommended for a single shard, but could it explain these
> > duplicate results possibly because of the time it takes to commit, merge,
> > or something with tlogs?
> >
> > -- Query --
> > http://solr:8983/solr/filesearch/select?fl=id,[docid],score&fq=id:file_
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > *382506116*&q=*:*
> > <
> http://solr1128.ve.box.net:8985/solr/filesearch/select?fl=id,[docid],score&fq=id:file_413041895994&q=*:*
> >
> > -- Response --
> >
> > {
> >  "responseHeader":{
> >    "status":0,
> >    "QTime":0,
> >    "params":{
> >      "mm":" 1<-0% ",
> >      "q.alt":"*:*",
> >      "ps":"100",
> >      "echoParams":"all",
> >      "fl":"id,[docid],score",
> >      "fq":"id:file_413041895994",
> >      "sort":"score desc",
> >      "rows":"35",
> >      "version":"2.2",
> >      "q":"*:*",
> >      "tie":"0.01",
> >      "defType":"edismax",
> >      "qf":"id name_combined^10 name_zh-cn^10 name_shingle
> > name_shingle_zh-cn name_token^60 description file_content_en
> > file_content_fr file_content_de file_content_it file_content_es
> > file_content_zh-cn user_name user_email comments tags",
> >      "pf":"description name_shingle^100 name_shingle_zh-cn^100 comments
> tags",
> >      "wt":"json",
> >      "debugQuery":"off"}},
> >  "response":{"numFound":2,"start":0,"maxScore":1.0,"docs":[
> >      {
> >        "id":"file_382506116",
> >
> >        "[docid]":346266675,
> >        "score":1.0}]
> >  },{
> >
> >        "id":"file_382506116",
> >        "[docid]":170442733,
> >        "score":1.0}]
> >
> >  }}
> >
> >
> > -- Schema snippet --
> > <fields>
> >  <field name="id" type="string" indexed="true" stored="true"
> > required="true"/>
> > </fields>
> > <uniqueKey>id</uniqueKey>
> >
> > --
> > Adam Walz
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.0.1 Duplicate document appearing in search results

Erick Erickson

> On May 14, 2019, at 7:46 PM, Adam Walz <[hidden email]> wrote:
>
> but do
> use an external map reduce process to reindex

Here’s where I’d look then. Not knowing any details of your process this may be totally wrong of course….

If there’s any step that performs a MERGEINDEX operation, _and_ somehow the same <uniqueKey> got indexed to the sub-indexes that are being merged, then there’s no deduplication on and you will have multiple docs with the same <uniqueKey>. I strongly suspect that that, or something similar, is happening. That’s how MapReduceIndexerTool operated, there were N sub-indexed produced totally independently and then a MERGEINDEX operation happened on a per-shard basis.

Or something unexpected like there being no <uniqueKey> defined in the schema somehow.

I have never of Solr failing to remove old documents when a new one with the same ID is being indexed without something like the above being the problem.

One bit of background: Lucene has no notion of <uniqueKey>, that is totally a Solr construct and is up to Solr to enforce. So anything that bypasses Solr could produces this…

FWIW,
Erick


Reply | Threaded
Open this post in threaded view
|

Re: Solr 7.0.1 Duplicate document appearing in search results

Erick Erickson


> On May 15, 2019, at 10:53 AM, Erick Erickson <[hidden email]> wrote:
>
> Or something unexpected like there being no <uniqueKey> defined in the schema somehow.

Meant to say that somehow the schemas used during your process weren’t what you thought they were and “somehow” didn’t have a <uniqueKey> defined.

That would require the same doc to be indexed twice of course.

And this also assumes that your process does something like use EmbeddedSolrServer to index docs. If it uses Lucene directly, then your process is responsible for handling duplicate <uniqueKeys> properly at the Lucene level.

Best,
Erick