NRT replicas miss hits and return duplicate hits when paging solrcloud searches

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
I have an application which implements several different searches against a
solrcloud collection.
We are using Solr 7.2 and Solr 6.1

The collection b2b-catalog-material is created with the default Near Real
Time (NRT) replicas. The collection has 2 shards each with 2 replicas.

The application launches a search and pages through all of the results up
to a maximum, typically about 1000 results and returns them to the caller.
It pages by the standard method of incrementing the start parameter by the
rows. until we retrieve the maximum we need or return all the
hits.Typically we set rows to 200

If a search matches 2000 results, the app will call solr 10 times to
retrieve 200 results per call. This is configurable.

The documents in the collection are product skus but the searchable fields
are mostly product oriented, and we have between 2 and 500 skus per
product. There are about 2,463,442 documents in the collection.

We need the results by relevancy so the application sorts the results by
score desc, and the unique id ascending as the tie breaker

We discovered that the application often returns duplicate records from a
search. I believe that this is due to the NRT replicas having slightly
different index data due to commit orders and different numbers of deleted
records. For many queries we see about 20 to 30 results duplicated. The
results from solr are sent to another system to retrieve pricing
information. This system is not yet fully populated so that out of a 1000
results we may return 350 or so. The problem is each time we called the
application with the same query we would see different results. I saw it
vary between 351 which was correct to 341 and 346. I believe that for each
"duplicate" found by the application, there is also a result that was
missed.

The numberFound from the solr Query response does not vary

This variability in the same query is unacceptable to the business. For
quite a while I thought it was in our code, or in the call to the other
system. However, we now know that it is Solr.

I created a simple test driver that calls solr and pages through the
results. It maintains a set of all the ids that we've encountered and it
will regularly find 20 or more duplicates depending upon the query.

Some observations:
The unique id is unique, it's used in other systems for this data.

If we do an optimize on the collection, the duplicates won't show up until
the next data load

I created a second collection that used the TLOG replica type, and we don't
see the problem even with repeated data loads.


The data in the collection is kept up to date by an etl process that
completely reindexes the data once a week. That would be how it would work
once in production anyway we reload it more frequently as we're testing the
app.

My boss has lost all confidence in Solrcloud. It seems that it cannot find
the same data in subsequent searches. Returning consistent results from a
search is job #1 and solrcloud is failing at that.

It looks like using TLOG replicas seems to address the issue, it appears
that you cannot trust NRT replicas to return consistent results.

The scores for many searches are fairly flat with not a lot of variability
in them, which means that a small difference in a score can change the
order of results.

We found that upgrading to 7.2 in our production servers and using tlog
replicas worked, but the alternative of optimizing after each load while a
hack does seem to address the problem too, however determining when to
optimize would be difficult to automate since we use CDCR to replicate the
data to a cloud environment and it's not easy to determine when the remote
collections are fully loaded.

The only other thing I can think of is tweaking the lucene merge algorithm
to better remove deleted documents from the index

Have others encountered this kind of inconsistency in solrcloud? I cannot
believe that we're the first to have encountered it.

How have you addressed it?

We have settled on using TLOG replicas as they provide consistent results
and don't return duplicate hits, which also means that there are no missing
hits.

Unless you need real time indexing, NRT replicas should be avoided in favor
of TLOG replicas or a mix of TLOG and PULL replicas.

I wrote a test program and verified that we actually have this issue with
all or our collections. We hadn't noticed it before because most of the
time the missing/duplicate results were 5 to 10 pages into the result set.

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Shawn Heisey-2
On 2/26/2018 10:26 AM, Webster Homer wrote:
> We need the results by relevancy so the application sorts the results by
> score desc, and the unique id ascending as the tie breaker

This is the reason for the discrepancy, and why the different replica
types don't have the same issue.

Each NRT replica can have different deleted documents than the others,
just due to the way that NRT replicas work.  Deleted documents affect
relevancy scoring.  When one replica has say 5000 deleted documents and
another has 200, or has 5000 but they're different docs, a relevancy
sort can end up different.  So when Solr goes to one replica for page 1
and another for page 2 (which is expected due to SolrCloud's internal
load balancing), you may end up with duplicate documents or documents
missing.  Because deleted documents are not counted or returned,
numFound will be consistent, as long as the index doesn't change between
the queries for pages.

If you were using a deterministic sort rather than relevancy, this
wouldn't be happening, because deleted documents have no influence on
that kind of sort.

With TLOG or PULL, the replicas are absolutely identical, so there is no
difference, unless the index is changing as you page through the results.

I think changing replica types is the only solution here.  NRT replicas
are working as they were designed -- there's no bug, even though
problems like this do sometimes turn up.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
Thanks Shawn, I had settled on this as a solution.

All our use cases for Solr is to return results in order of relevancy to
the query, so having a deterministic sort would defeat that purpose. Since
we wanted to be able to return all the results for a query, I originally
looked at using the Streaming API, but that doesn't support returning
results sorted by relevancy

I disagree with you about NRT replicas though. They may function as
designed, but since they cannot guarantee consistent results their design
is buggy, at least it is for a search engine.


On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]> wrote:

> On 2/26/2018 10:26 AM, Webster Homer wrote:
> > We need the results by relevancy so the application sorts the results by
> > score desc, and the unique id ascending as the tie breaker
>
> This is the reason for the discrepancy, and why the different replica
> types don't have the same issue.
>
> Each NRT replica can have different deleted documents than the others,
> just due to the way that NRT replicas work.  Deleted documents affect
> relevancy scoring.  When one replica has say 5000 deleted documents and
> another has 200, or has 5000 but they're different docs, a relevancy
> sort can end up different.  So when Solr goes to one replica for page 1
> and another for page 2 (which is expected due to SolrCloud's internal
> load balancing), you may end up with duplicate documents or documents
> missing.  Because deleted documents are not counted or returned,
> numFound will be consistent, as long as the index doesn't change between
> the queries for pages.
>
> If you were using a deterministic sort rather than relevancy, this
> wouldn't be happening, because deleted documents have no influence on
> that kind of sort.
>
> With TLOG or PULL, the replicas are absolutely identical, so there is no
> difference, unless the index is changing as you page through the results.
>
> I think changing replica types is the only solution here.  NRT replicas
> are working as they were designed -- there's no bug, even though
> problems like this do sometimes turn up.
>
> Thanks,
> Shawn
>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Erick Erickson
Did you try enabling distributed IDF (statsCache)? See:
https://lucene.apache.org/solr/guide/6_6/distributed-requests.html

It's may not totally fix the issue, but it's worth trying. It does
come with a performance penalty of course.

Best,
Erick

On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <[hidden email]> wrote:

> Thanks Shawn, I had settled on this as a solution.
>
> All our use cases for Solr is to return results in order of relevancy to
> the query, so having a deterministic sort would defeat that purpose. Since
> we wanted to be able to return all the results for a query, I originally
> looked at using the Streaming API, but that doesn't support returning
> results sorted by relevancy
>
> I disagree with you about NRT replicas though. They may function as
> designed, but since they cannot guarantee consistent results their design
> is buggy, at least it is for a search engine.
>
>
> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]> wrote:
>
>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>> > We need the results by relevancy so the application sorts the results by
>> > score desc, and the unique id ascending as the tie breaker
>>
>> This is the reason for the discrepancy, and why the different replica
>> types don't have the same issue.
>>
>> Each NRT replica can have different deleted documents than the others,
>> just due to the way that NRT replicas work.  Deleted documents affect
>> relevancy scoring.  When one replica has say 5000 deleted documents and
>> another has 200, or has 5000 but they're different docs, a relevancy
>> sort can end up different.  So when Solr goes to one replica for page 1
>> and another for page 2 (which is expected due to SolrCloud's internal
>> load balancing), you may end up with duplicate documents or documents
>> missing.  Because deleted documents are not counted or returned,
>> numFound will be consistent, as long as the index doesn't change between
>> the queries for pages.
>>
>> If you were using a deterministic sort rather than relevancy, this
>> wouldn't be happening, because deleted documents have no influence on
>> that kind of sort.
>>
>> With TLOG or PULL, the replicas are absolutely identical, so there is no
>> difference, unless the index is changing as you page through the results.
>>
>> I think changing replica types is the only solution here.  NRT replicas
>> are working as they were designed -- there's no bug, even though
>> problems like this do sometimes turn up.
>>
>> Thanks,
>> Shawn
>>
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
Erick,

No we didn't look at that. I will add it to the list. We have  not seen
performance issues with solr. We have much slower technologies in our
stack. This project was to replace a system that was too slow.

Thank you, I will look into it

Webster

On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <[hidden email]>
wrote:

> Did you try enabling distributed IDF (statsCache)? See:
> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>
> It's may not totally fix the issue, but it's worth trying. It does
> come with a performance penalty of course.
>
> Best,
> Erick
>
> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <[hidden email]>
> wrote:
> > Thanks Shawn, I had settled on this as a solution.
> >
> > All our use cases for Solr is to return results in order of relevancy to
> > the query, so having a deterministic sort would defeat that purpose.
> Since
> > we wanted to be able to return all the results for a query, I originally
> > looked at using the Streaming API, but that doesn't support returning
> > results sorted by relevancy
> >
> > I disagree with you about NRT replicas though. They may function as
> > designed, but since they cannot guarantee consistent results their design
> > is buggy, at least it is for a search engine.
> >
> >
> > On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]>
> wrote:
> >
> >> On 2/26/2018 10:26 AM, Webster Homer wrote:
> >> > We need the results by relevancy so the application sorts the results
> by
> >> > score desc, and the unique id ascending as the tie breaker
> >>
> >> This is the reason for the discrepancy, and why the different replica
> >> types don't have the same issue.
> >>
> >> Each NRT replica can have different deleted documents than the others,
> >> just due to the way that NRT replicas work.  Deleted documents affect
> >> relevancy scoring.  When one replica has say 5000 deleted documents and
> >> another has 200, or has 5000 but they're different docs, a relevancy
> >> sort can end up different.  So when Solr goes to one replica for page 1
> >> and another for page 2 (which is expected due to SolrCloud's internal
> >> load balancing), you may end up with duplicate documents or documents
> >> missing.  Because deleted documents are not counted or returned,
> >> numFound will be consistent, as long as the index doesn't change between
> >> the queries for pages.
> >>
> >> If you were using a deterministic sort rather than relevancy, this
> >> wouldn't be happening, because deleted documents have no influence on
> >> that kind of sort.
> >>
> >> With TLOG or PULL, the replicas are absolutely identical, so there is no
> >> difference, unless the index is changing as you page through the
> results.
> >>
> >> I think changing replica types is the only solution here.  NRT replicas
> >> are working as they were designed -- there's no bug, even though
> >> problems like this do sometimes turn up.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Emir Arnautović
Hi Webster,
Since you are returning all hits, returning the last page is almost as heavy for Solr as returning all documents. Maybe you should consider just returning one large page and completely avoid this issue.
I agree with you that this should be handled by Solr. ES solved this issue with “preference” search parameter where you can set session id as preference and it will stick to the same shards. I guess you could try similar thing on your own but that would require you to send list of shards as parameter for your search and balance it for different sessions.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 Feb 2018, at 21:03, Webster Homer <[hidden email]> wrote:
>
> Erick,
>
> No we didn't look at that. I will add it to the list. We have  not seen
> performance issues with solr. We have much slower technologies in our
> stack. This project was to replace a system that was too slow.
>
> Thank you, I will look into it
>
> Webster
>
> On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <[hidden email]>
> wrote:
>
>> Did you try enabling distributed IDF (statsCache)? See:
>> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>>
>> It's may not totally fix the issue, but it's worth trying. It does
>> come with a performance penalty of course.
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <[hidden email]>
>> wrote:
>>> Thanks Shawn, I had settled on this as a solution.
>>>
>>> All our use cases for Solr is to return results in order of relevancy to
>>> the query, so having a deterministic sort would defeat that purpose.
>> Since
>>> we wanted to be able to return all the results for a query, I originally
>>> looked at using the Streaming API, but that doesn't support returning
>>> results sorted by relevancy
>>>
>>> I disagree with you about NRT replicas though. They may function as
>>> designed, but since they cannot guarantee consistent results their design
>>> is buggy, at least it is for a search engine.
>>>
>>>
>>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]>
>> wrote:
>>>
>>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>>>>> We need the results by relevancy so the application sorts the results
>> by
>>>>> score desc, and the unique id ascending as the tie breaker
>>>>
>>>> This is the reason for the discrepancy, and why the different replica
>>>> types don't have the same issue.
>>>>
>>>> Each NRT replica can have different deleted documents than the others,
>>>> just due to the way that NRT replicas work.  Deleted documents affect
>>>> relevancy scoring.  When one replica has say 5000 deleted documents and
>>>> another has 200, or has 5000 but they're different docs, a relevancy
>>>> sort can end up different.  So when Solr goes to one replica for page 1
>>>> and another for page 2 (which is expected due to SolrCloud's internal
>>>> load balancing), you may end up with duplicate documents or documents
>>>> missing.  Because deleted documents are not counted or returned,
>>>> numFound will be consistent, as long as the index doesn't change between
>>>> the queries for pages.
>>>>
>>>> If you were using a deterministic sort rather than relevancy, this
>>>> wouldn't be happening, because deleted documents have no influence on
>>>> that kind of sort.
>>>>
>>>> With TLOG or PULL, the replicas are absolutely identical, so there is no
>>>> difference, unless the index is changing as you page through the
>> results.
>>>>
>>>> I think changing replica types is the only solution here.  NRT replicas
>>>> are working as they were designed -- there's no bug, even though
>>>> problems like this do sometimes turn up.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>>
>>>
>>> --
>>>
>>>
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended
>> recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message
>> and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and
>> does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>>
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
Emir,

Using tlog replica types addresses my immediate problem.

The secondary issue is that all of our searches show inconsistent results.
These are all normal paging use cases. We regularly test our relevancy, and
these differences creates confusion in the testers. Moreover, we are
migrating from Endeca which has very consistent results.

I'm hoping that using the global stats cache will make the other searches
more stable. I think we will eventually move to favoring tlog replicas. We
have a couple of collections where NRT makes sense, but those collections
don't need to return data in relevancy order. I think NRT should be
considered a niche use case for a search engine, tlog and pull replicas are
a much better fit for a search engine (imho)

On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
[hidden email]> wrote:

> Hi Webster,
> Since you are returning all hits, returning the last page is almost as
> heavy for Solr as returning all documents. Maybe you should consider just
> returning one large page and completely avoid this issue.
> I agree with you that this should be handled by Solr. ES solved this issue
> with “preference” search parameter where you can set session id as
> preference and it will stick to the same shards. I guess you could try
> similar thing on your own but that would require you to send list of shards
> as parameter for your search and balance it for different sessions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 26 Feb 2018, at 21:03, Webster Homer <[hidden email]> wrote:
> >
> > Erick,
> >
> > No we didn't look at that. I will add it to the list. We have  not seen
> > performance issues with solr. We have much slower technologies in our
> > stack. This project was to replace a system that was too slow.
> >
> > Thank you, I will look into it
> >
> > Webster
> >
> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <[hidden email]
> >
> > wrote:
> >
> >> Did you try enabling distributed IDF (statsCache)? See:
> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
> >>
> >> It's may not totally fix the issue, but it's worth trying. It does
> >> come with a performance penalty of course.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <[hidden email]
> >
> >> wrote:
> >>> Thanks Shawn, I had settled on this as a solution.
> >>>
> >>> All our use cases for Solr is to return results in order of relevancy
> to
> >>> the query, so having a deterministic sort would defeat that purpose.
> >> Since
> >>> we wanted to be able to return all the results for a query, I
> originally
> >>> looked at using the Streaming API, but that doesn't support returning
> >>> results sorted by relevancy
> >>>
> >>> I disagree with you about NRT replicas though. They may function as
> >>> designed, but since they cannot guarantee consistent results their
> design
> >>> is buggy, at least it is for a search engine.
> >>>
> >>>
> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]>
> >> wrote:
> >>>
> >>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
> >>>>> We need the results by relevancy so the application sorts the results
> >> by
> >>>>> score desc, and the unique id ascending as the tie breaker
> >>>>
> >>>> This is the reason for the discrepancy, and why the different replica
> >>>> types don't have the same issue.
> >>>>
> >>>> Each NRT replica can have different deleted documents than the others,
> >>>> just due to the way that NRT replicas work.  Deleted documents affect
> >>>> relevancy scoring.  When one replica has say 5000 deleted documents
> and
> >>>> another has 200, or has 5000 but they're different docs, a relevancy
> >>>> sort can end up different.  So when Solr goes to one replica for page
> 1
> >>>> and another for page 2 (which is expected due to SolrCloud's internal
> >>>> load balancing), you may end up with duplicate documents or documents
> >>>> missing.  Because deleted documents are not counted or returned,
> >>>> numFound will be consistent, as long as the index doesn't change
> between
> >>>> the queries for pages.
> >>>>
> >>>> If you were using a deterministic sort rather than relevancy, this
> >>>> wouldn't be happening, because deleted documents have no influence on
> >>>> that kind of sort.
> >>>>
> >>>> With TLOG or PULL, the replicas are absolutely identical, so there is
> no
> >>>> difference, unless the index is changing as you page through the
> >> results.
> >>>>
> >>>> I think changing replica types is the only solution here.  NRT
> replicas
> >>>> are working as they were designed -- there's no bug, even though
> >>>> problems like this do sometimes turn up.
> >>>>
> >>>> Thanks,
> >>>> Shawn
> >>>>
> >>>>
> >>>
> >>> --
> >>>
> >>>
> >>> This message and any attachment are confidential and may be privileged
> or
> >>> otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >>> you must not copy this message or attachment or disclose the contents
> to
> >>> any other person. If you have received this transmission in error,
> please
> >>> notify the sender immediately and delete the message and any attachment
> >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
> >>> subsidiaries do not accept liability for any omissions or errors in
> this
> >>> message which may arise as a result of E-Mail-transmission or for
> damages
> >>> resulting from any unauthorized changes of the content of this message
> >> and
> >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >>> subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >>> not accept liability for any damages caused by any virus transmitted
> >>> therewith.
> >>>
> >>> Click http://www.emdgroup.com/disclaimer to access the German, French,
> >>> Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
I am trying to test if enabling stats cache as suggested by Eric would also
address this issue. I added this to my solrconfig.xml

 <statsCache class="org.apache.solr.search.stats.ExactSharedStatsCache"/>

I executed queries and saw no differences. Then I re-indexed the data,
again I saw no differences in behavior.
Then I found this,  SOLR-10952. It seems we need to disable the
queryResultCache for the global stats cache to work.
I've never disabled this before. I edited the solrconfig.xml setting the
sizes to 0. I'm not sure if this is how to disable the cache or not.

    <queryResultCache class="solr.LRUCache"
                     size="0"
                     initialSize="0"
                     autowarmCount="0"/>

I also set this:
   <queryResultMaxDocsCached>0</queryResultMaxDocsCached>

Then uploaded the solrconfig.xml and reloaded the collection. It sill made
no difference. Do I need to restart solr for this to take effect?
When I look in the admin console, the queryResultCache still seems to have
the old settings.

Does enabling statsCache require a solr restart too? Does enabling the
statsCache require that the data be re-indexed? The documentation on this
feature is skimpy.
Is there a way to see if it's enabled in the Admin Console?

On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer <[hidden email]>
wrote:

> Emir,
>
> Using tlog replica types addresses my immediate problem.
>
> The secondary issue is that all of our searches show inconsistent results.
> These are all normal paging use cases. We regularly test our relevancy, and
> these differences creates confusion in the testers. Moreover, we are
> migrating from Endeca which has very consistent results.
>
> I'm hoping that using the global stats cache will make the other searches
> more stable. I think we will eventually move to favoring tlog replicas. We
> have a couple of collections where NRT makes sense, but those collections
> don't need to return data in relevancy order. I think NRT should be
> considered a niche use case for a search engine, tlog and pull replicas are
> a much better fit for a search engine (imho)
>
> On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Webster,
>> Since you are returning all hits, returning the last page is almost as
>> heavy for Solr as returning all documents. Maybe you should consider just
>> returning one large page and completely avoid this issue.
>> I agree with you that this should be handled by Solr. ES solved this
>> issue with “preference” search parameter where you can set session id as
>> preference and it will stick to the same shards. I guess you could try
>> similar thing on your own but that would require you to send list of shards
>> as parameter for your search and balance it for different sessions.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 26 Feb 2018, at 21:03, Webster Homer <[hidden email]> wrote:
>> >
>> > Erick,
>> >
>> > No we didn't look at that. I will add it to the list. We have  not seen
>> > performance issues with solr. We have much slower technologies in our
>> > stack. This project was to replace a system that was too slow.
>> >
>> > Thank you, I will look into it
>> >
>> > Webster
>> >
>> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
>> [hidden email]>
>> > wrote:
>> >
>> >> Did you try enabling distributed IDF (statsCache)? See:
>> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> >>
>> >> It's may not totally fix the issue, but it's worth trying. It does
>> >> come with a performance penalty of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
>> [hidden email]>
>> >> wrote:
>> >>> Thanks Shawn, I had settled on this as a solution.
>> >>>
>> >>> All our use cases for Solr is to return results in order of relevancy
>> to
>> >>> the query, so having a deterministic sort would defeat that purpose.
>> >> Since
>> >>> we wanted to be able to return all the results for a query, I
>> originally
>> >>> looked at using the Streaming API, but that doesn't support returning
>> >>> results sorted by relevancy
>> >>>
>> >>> I disagree with you about NRT replicas though. They may function as
>> >>> designed, but since they cannot guarantee consistent results their
>> design
>> >>> is buggy, at least it is for a search engine.
>> >>>
>> >>>
>> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey <[hidden email]>
>> >> wrote:
>> >>>
>> >>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>> >>>>> We need the results by relevancy so the application sorts the
>> results
>> >> by
>> >>>>> score desc, and the unique id ascending as the tie breaker
>> >>>>
>> >>>> This is the reason for the discrepancy, and why the different replica
>> >>>> types don't have the same issue.
>> >>>>
>> >>>> Each NRT replica can have different deleted documents than the
>> others,
>> >>>> just due to the way that NRT replicas work.  Deleted documents affect
>> >>>> relevancy scoring.  When one replica has say 5000 deleted documents
>> and
>> >>>> another has 200, or has 5000 but they're different docs, a relevancy
>> >>>> sort can end up different.  So when Solr goes to one replica for
>> page 1
>> >>>> and another for page 2 (which is expected due to SolrCloud's internal
>> >>>> load balancing), you may end up with duplicate documents or documents
>> >>>> missing.  Because deleted documents are not counted or returned,
>> >>>> numFound will be consistent, as long as the index doesn't change
>> between
>> >>>> the queries for pages.
>> >>>>
>> >>>> If you were using a deterministic sort rather than relevancy, this
>> >>>> wouldn't be happening, because deleted documents have no influence on
>> >>>> that kind of sort.
>> >>>>
>> >>>> With TLOG or PULL, the replicas are absolutely identical, so there
>> is no
>> >>>> difference, unless the index is changing as you page through the
>> >> results.
>> >>>>
>> >>>> I think changing replica types is the only solution here.  NRT
>> replicas
>> >>>> are working as they were designed -- there's no bug, even though
>> >>>> problems like this do sometimes turn up.
>> >>>>
>> >>>> Thanks,
>> >>>> Shawn
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>>
>> >>>
>> >>> This message and any attachment are confidential and may be
>> privileged or
>> >>> otherwise protected from disclosure. If you are not the intended
>> >> recipient,
>> >>> you must not copy this message or attachment or disclose the contents
>> to
>> >>> any other person. If you have received this transmission in error,
>> please
>> >>> notify the sender immediately and delete the message and any
>> attachment
>> >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> >>> subsidiaries do not accept liability for any omissions or errors in
>> this
>> >>> message which may arise as a result of E-Mail-transmission or for
>> damages
>> >>> resulting from any unauthorized changes of the content of this message
>> >> and
>> >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> >>> subsidiaries do not guarantee that this message is free of viruses and
>> >> does
>> >>> not accept liability for any damages caused by any virus transmitted
>> >>> therewith.
>> >>>
>> >>> Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>> >>> Spanish and Portuguese versions of this disclaimer.
>> >>
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged
>> or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error,
>> please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Shawn Heisey-2
On 3/2/2018 9:28 AM, Webster Homer wrote:
> I've never disabled this before. I edited the solrconfig.xml setting the
> sizes to 0. I'm not sure if this is how to disable the cache or not.
>
>      <queryResultCache class="solr.LRUCache"
>                       size="0"
>                       initialSize="0"
>                       autowarmCount="0"/>

To completely disable a cache, either comment it out or remove it from
the config.  I do not know whether setting the size to 0 will actually
work or not.

> Does enabling statsCache require a solr restart too? Does enabling the
> statsCache require that the data be re-indexed? The documentation on this
> feature is skimpy.

Most changes to solrconfig.xml just require a reload.  I would expect
any cache configurations to fall into that category.

> Is there a way to see if it's enabled in the Admin Console?

I don't know anything about the statsCache.  If you don't see it in the
Plugins/Stats tab, that's probably something that was forgotten, and
needs to be added to the admin UI.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

RE: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

Becky Bonner
In reply to this post by WebsterHomer
We are trying to setup one solr server for several applications each with a different collection.  Is there a way to have have 2 collections under one folder and the url be something like this:
http://mysolrinstance.com/solr/myParent1/collection1
http://mysolrinstance.com/solr/myParent1/collection2
http://mysolrinstance.com/solr/myParent2
http://mysolrinstance.com/solr/myParent3


We organized it like that under the solr folder but the URLs to the collections do not include the "myParent1".
This makes the names of my collections more confusing because you can't tell what application they belong to.  It wasn’t a problem until we had 2 collections for one of the apps.




-----Original Message-----
From: Webster Homer [mailto:[hidden email]]
Sent: Friday, March 2, 2018 10:29 AM
To: [hidden email]
Subject: Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

I am trying to test if enabling stats cache as suggested by Eric would also address this issue. I added this to my solrconfig.xml

 <statsCache class="org.apache.solr.search.stats.ExactSharedStatsCache"/>

I executed queries and saw no differences. Then I re-indexed the data, again I saw no differences in behavior.
Then I found this,  SOLR-10952. It seems we need to disable the queryResultCache for the global stats cache to work.
I've never disabled this before. I edited the solrconfig.xml setting the sizes to 0. I'm not sure if this is how to disable the cache or not.

    <queryResultCache class="solr.LRUCache"
                     size="0"
                     initialSize="0"
                     autowarmCount="0"/>

I also set this:
   <queryResultMaxDocsCached>0</queryResultMaxDocsCached>

Then uploaded the solrconfig.xml and reloaded the collection. It sill made no difference. Do I need to restart solr for this to take effect?
When I look in the admin console, the queryResultCache still seems to have the old settings.

Does enabling statsCache require a solr restart too? Does enabling the statsCache require that the data be re-indexed? The documentation on this feature is skimpy.
Is there a way to see if it's enabled in the Admin Console?

On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer <[hidden email]>
wrote:

> Emir,
>
> Using tlog replica types addresses my immediate problem.
>
> The secondary issue is that all of our searches show inconsistent results.
> These are all normal paging use cases. We regularly test our
> relevancy, and these differences creates confusion in the testers.
> Moreover, we are migrating from Endeca which has very consistent results.
>
> I'm hoping that using the global stats cache will make the other
> searches more stable. I think we will eventually move to favoring tlog
> replicas. We have a couple of collections where NRT makes sense, but
> those collections don't need to return data in relevancy order. I
> think NRT should be considered a niche use case for a search engine,
> tlog and pull replicas are a much better fit for a search engine
> (imho)
>
> On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> [hidden email]> wrote:
>
>> Hi Webster,
>> Since you are returning all hits, returning the last page is almost
>> as heavy for Solr as returning all documents. Maybe you should
>> consider just returning one large page and completely avoid this issue.
>> I agree with you that this should be handled by Solr. ES solved this
>> issue with “preference” search parameter where you can set session id
>> as preference and it will stick to the same shards. I guess you could
>> try similar thing on your own but that would require you to send list
>> of shards as parameter for your search and balance it for different sessions.
>>
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
>> Elasticsearch Consulting Support Training - http://sematext.com/
>>
>>
>>
>> > On 26 Feb 2018, at 21:03, Webster Homer <[hidden email]> wrote:
>> >
>> > Erick,
>> >
>> > No we didn't look at that. I will add it to the list. We have  not
>> > seen performance issues with solr. We have much slower technologies
>> > in our stack. This project was to replace a system that was too slow.
>> >
>> > Thank you, I will look into it
>> >
>> > Webster
>> >
>> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
>> [hidden email]>
>> > wrote:
>> >
>> >> Did you try enabling distributed IDF (statsCache)? See:
>> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
>> >>
>> >> It's may not totally fix the issue, but it's worth trying. It does
>> >> come with a performance penalty of course.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
>> [hidden email]>
>> >> wrote:
>> >>> Thanks Shawn, I had settled on this as a solution.
>> >>>
>> >>> All our use cases for Solr is to return results in order of
>> >>> relevancy
>> to
>> >>> the query, so having a deterministic sort would defeat that purpose.
>> >> Since
>> >>> we wanted to be able to return all the results for a query, I
>> originally
>> >>> looked at using the Streaming API, but that doesn't support
>> >>> returning results sorted by relevancy
>> >>>
>> >>> I disagree with you about NRT replicas though. They may function
>> >>> as designed, but since they cannot guarantee consistent results
>> >>> their
>> design
>> >>> is buggy, at least it is for a search engine.
>> >>>
>> >>>
>> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey
>> >>> <[hidden email]>
>> >> wrote:
>> >>>
>> >>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
>> >>>>> We need the results by relevancy so the application sorts the
>> results
>> >> by
>> >>>>> score desc, and the unique id ascending as the tie breaker
>> >>>>
>> >>>> This is the reason for the discrepancy, and why the different
>> >>>> replica types don't have the same issue.
>> >>>>
>> >>>> Each NRT replica can have different deleted documents than the
>> others,
>> >>>> just due to the way that NRT replicas work.  Deleted documents
>> >>>> affect relevancy scoring.  When one replica has say 5000 deleted
>> >>>> documents
>> and
>> >>>> another has 200, or has 5000 but they're different docs, a
>> >>>> relevancy sort can end up different.  So when Solr goes to one
>> >>>> replica for
>> page 1
>> >>>> and another for page 2 (which is expected due to SolrCloud's
>> >>>> internal load balancing), you may end up with duplicate
>> >>>> documents or documents missing.  Because deleted documents are
>> >>>> not counted or returned, numFound will be consistent, as long as
>> >>>> the index doesn't change
>> between
>> >>>> the queries for pages.
>> >>>>
>> >>>> If you were using a deterministic sort rather than relevancy,
>> >>>> this wouldn't be happening, because deleted documents have no
>> >>>> influence on that kind of sort.
>> >>>>
>> >>>> With TLOG or PULL, the replicas are absolutely identical, so
>> >>>> there
>> is no
>> >>>> difference, unless the index is changing as you page through the
>> >> results.
>> >>>>
>> >>>> I think changing replica types is the only solution here.  NRT
>> replicas
>> >>>> are working as they were designed -- there's no bug, even though
>> >>>> problems like this do sometimes turn up.
>> >>>>
>> >>>> Thanks,
>> >>>> Shawn
>> >>>>
>> >>>>
>> >>>
>> >>> --
>> >>>
>> >>>
>> >>> This message and any attachment are confidential and may be
>> privileged or
>> >>> otherwise protected from disclosure. If you are not the intended
>> >> recipient,
>> >>> you must not copy this message or attachment or disclose the
>> >>> contents
>> to
>> >>> any other person. If you have received this transmission in
>> >>> error,
>> please
>> >>> notify the sender immediately and delete the message and any
>> attachment
>> >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>> >>> subsidiaries do not accept liability for any omissions or errors
>> >>> in
>> this
>> >>> message which may arise as a result of E-Mail-transmission or for
>> damages
>> >>> resulting from any unauthorized changes of the content of this
>> >>> message
>> >> and
>> >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
>> >>> its subsidiaries do not guarantee that this message is free of
>> >>> viruses and
>> >> does
>> >>> not accept liability for any damages caused by any virus
>> >>> transmitted therewith.
>> >>>
>> >>> Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>> >>> Spanish and Portuguese versions of this disclaimer.
>> >>
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be
>> > privileged
>> or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the
>> > contents to any other person. If you have received this
>> > transmission in error,
>> please
>> > notify the sender immediately and delete the message and any
>> > attachment from your system. Merck KGaA, Darmstadt, Germany and any
>> > of its subsidiaries do not accept liability for any omissions or
>> > errors in this message which may arise as a result of
>> > E-Mail-transmission or for
>> damages
>> > resulting from any unauthorized changes of the content of this
>> > message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
>> > its subsidiaries do not guarantee that this message is free of
>> > viruses and
>> does
>> > not accept liability for any damages caused by any virus
>> > transmitted therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German,
>> > French, Spanish and Portuguese versions of this disclaimer.
>>
>>
>

--


This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
In reply to this post by Shawn Heisey-2
Thanks Shawn.

Commenting it out works to remove it. If I change the values e.g. change
the 512 to 0, it does require a restart to take effect.

Tested using statsCache set to
org.apache.solr.search.stats.ExactSharedStatsCache,
with the queryResultCache disabled, and I still see the problem with NRT
replicas. So using TLOG replicas still looks like the best work around for
the NRT issue

On Fri, Mar 2, 2018 at 10:44 AM, Shawn Heisey <[hidden email]> wrote:

> On 3/2/2018 9:28 AM, Webster Homer wrote:
>
>> I've never disabled this before. I edited the solrconfig.xml setting the
>> sizes to 0. I'm not sure if this is how to disable the cache or not.
>>
>>      <queryResultCache class="solr.LRUCache"
>>                       size="0"
>>                       initialSize="0"
>>                       autowarmCount="0"/>
>>
>
> To completely disable a cache, either comment it out or remove it from the
> config.  I do not know whether setting the size to 0 will actually work or
> not.
>
> Does enabling statsCache require a solr restart too? Does enabling the
>> statsCache require that the data be re-indexed? The documentation on this
>> feature is skimpy.
>>
>
> Most changes to solrconfig.xml just require a reload.  I would expect any
> cache configurations to fall into that category.
>
> Is there a way to see if it's enabled in the Admin Console?
>>
>
> I don't know anything about the statsCache.  If you don't see it in the
> Plugins/Stats tab, that's probably something that was forgotten, and needs
> to be added to the admin UI.
>
> Thanks,
> Shawn
>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: NRT replicas miss hits and return duplicate hits when paging solrcloud searches

WebsterHomer
In reply to this post by Becky Bonner
Becky,
This should have been its own question.

Solrcloud is different from standalone solr, the configurations live in
Zookeeper and the index is created under SOLR_HOME. You might want to
rethink your solution, What problem are you trying to solve with that
layout? Would it be solved by creating the Parent1 collection with 2 shards?

On Fri, Mar 2, 2018 at 10:56 AM, Becky Bonner <[hidden email]> wrote:

> We are trying to setup one solr server for several applications each with
> a different collection.  Is there a way to have have 2 collections under
> one folder and the url be something like this:
> http://mysolrinstance.com/solr/myParent1/collection1
> http://mysolrinstance.com/solr/myParent1/collection2
> http://mysolrinstance.com/solr/myParent2
> http://mysolrinstance.com/solr/myParent3
>
>
> We organized it like that under the solr folder but the URLs to the
> collections do not include the "myParent1".
> This makes the names of my collections more confusing because you can't
> tell what application they belong to.  It wasn’t a problem until we had 2
> collections for one of the apps.
>
>
>
>
> -----Original Message-----
> From: Webster Homer [mailto:[hidden email]]
> Sent: Friday, March 2, 2018 10:29 AM
> To: [hidden email]
> Subject: Re: NRT replicas miss hits and return duplicate hits when paging
> solrcloud searches
>
> I am trying to test if enabling stats cache as suggested by Eric would
> also address this issue. I added this to my solrconfig.xml
>
>  <statsCache class="org.apache.solr.search.stats.ExactSharedStatsCache"/>
>
> I executed queries and saw no differences. Then I re-indexed the data,
> again I saw no differences in behavior.
> Then I found this,  SOLR-10952. It seems we need to disable the
> queryResultCache for the global stats cache to work.
> I've never disabled this before. I edited the solrconfig.xml setting the
> sizes to 0. I'm not sure if this is how to disable the cache or not.
>
>     <queryResultCache class="solr.LRUCache"
>                      size="0"
>                      initialSize="0"
>                      autowarmCount="0"/>
>
> I also set this:
>    <queryResultMaxDocsCached>0</queryResultMaxDocsCached>
>
> Then uploaded the solrconfig.xml and reloaded the collection. It sill made
> no difference. Do I need to restart solr for this to take effect?
> When I look in the admin console, the queryResultCache still seems to have
> the old settings.
>
> Does enabling statsCache require a solr restart too? Does enabling the
> statsCache require that the data be re-indexed? The documentation on this
> feature is skimpy.
> Is there a way to see if it's enabled in the Admin Console?
>
> On Tue, Feb 27, 2018 at 9:31 AM, Webster Homer <[hidden email]>
> wrote:
>
> > Emir,
> >
> > Using tlog replica types addresses my immediate problem.
> >
> > The secondary issue is that all of our searches show inconsistent
> results.
> > These are all normal paging use cases. We regularly test our
> > relevancy, and these differences creates confusion in the testers.
> > Moreover, we are migrating from Endeca which has very consistent results.
> >
> > I'm hoping that using the global stats cache will make the other
> > searches more stable. I think we will eventually move to favoring tlog
> > replicas. We have a couple of collections where NRT makes sense, but
> > those collections don't need to return data in relevancy order. I
> > think NRT should be considered a niche use case for a search engine,
> > tlog and pull replicas are a much better fit for a search engine
> > (imho)
> >
> > On Tue, Feb 27, 2018 at 4:01 AM, Emir Arnautović <
> > [hidden email]> wrote:
> >
> >> Hi Webster,
> >> Since you are returning all hits, returning the last page is almost
> >> as heavy for Solr as returning all documents. Maybe you should
> >> consider just returning one large page and completely avoid this issue.
> >> I agree with you that this should be handled by Solr. ES solved this
> >> issue with “preference” search parameter where you can set session id
> >> as preference and it will stick to the same shards. I guess you could
> >> try similar thing on your own but that would require you to send list
> >> of shards as parameter for your search and balance it for different
> sessions.
> >>
> >> HTH,
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >> > On 26 Feb 2018, at 21:03, Webster Homer <[hidden email]>
> wrote:
> >> >
> >> > Erick,
> >> >
> >> > No we didn't look at that. I will add it to the list. We have  not
> >> > seen performance issues with solr. We have much slower technologies
> >> > in our stack. This project was to replace a system that was too slow.
> >> >
> >> > Thank you, I will look into it
> >> >
> >> > Webster
> >> >
> >> > On Mon, Feb 26, 2018 at 1:13 PM, Erick Erickson <
> >> [hidden email]>
> >> > wrote:
> >> >
> >> >> Did you try enabling distributed IDF (statsCache)? See:
> >> >> https://lucene.apache.org/solr/guide/6_6/distributed-requests.html
> >> >>
> >> >> It's may not totally fix the issue, but it's worth trying. It does
> >> >> come with a performance penalty of course.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, Feb 26, 2018 at 11:00 AM, Webster Homer <
> >> [hidden email]>
> >> >> wrote:
> >> >>> Thanks Shawn, I had settled on this as a solution.
> >> >>>
> >> >>> All our use cases for Solr is to return results in order of
> >> >>> relevancy
> >> to
> >> >>> the query, so having a deterministic sort would defeat that purpose.
> >> >> Since
> >> >>> we wanted to be able to return all the results for a query, I
> >> originally
> >> >>> looked at using the Streaming API, but that doesn't support
> >> >>> returning results sorted by relevancy
> >> >>>
> >> >>> I disagree with you about NRT replicas though. They may function
> >> >>> as designed, but since they cannot guarantee consistent results
> >> >>> their
> >> design
> >> >>> is buggy, at least it is for a search engine.
> >> >>>
> >> >>>
> >> >>> On Mon, Feb 26, 2018 at 12:20 PM, Shawn Heisey
> >> >>> <[hidden email]>
> >> >> wrote:
> >> >>>
> >> >>>> On 2/26/2018 10:26 AM, Webster Homer wrote:
> >> >>>>> We need the results by relevancy so the application sorts the
> >> results
> >> >> by
> >> >>>>> score desc, and the unique id ascending as the tie breaker
> >> >>>>
> >> >>>> This is the reason for the discrepancy, and why the different
> >> >>>> replica types don't have the same issue.
> >> >>>>
> >> >>>> Each NRT replica can have different deleted documents than the
> >> others,
> >> >>>> just due to the way that NRT replicas work.  Deleted documents
> >> >>>> affect relevancy scoring.  When one replica has say 5000 deleted
> >> >>>> documents
> >> and
> >> >>>> another has 200, or has 5000 but they're different docs, a
> >> >>>> relevancy sort can end up different.  So when Solr goes to one
> >> >>>> replica for
> >> page 1
> >> >>>> and another for page 2 (which is expected due to SolrCloud's
> >> >>>> internal load balancing), you may end up with duplicate
> >> >>>> documents or documents missing.  Because deleted documents are
> >> >>>> not counted or returned, numFound will be consistent, as long as
> >> >>>> the index doesn't change
> >> between
> >> >>>> the queries for pages.
> >> >>>>
> >> >>>> If you were using a deterministic sort rather than relevancy,
> >> >>>> this wouldn't be happening, because deleted documents have no
> >> >>>> influence on that kind of sort.
> >> >>>>
> >> >>>> With TLOG or PULL, the replicas are absolutely identical, so
> >> >>>> there
> >> is no
> >> >>>> difference, unless the index is changing as you page through the
> >> >> results.
> >> >>>>
> >> >>>> I think changing replica types is the only solution here.  NRT
> >> replicas
> >> >>>> are working as they were designed -- there's no bug, even though
> >> >>>> problems like this do sometimes turn up.
> >> >>>>
> >> >>>> Thanks,
> >> >>>> Shawn
> >> >>>>
> >> >>>>
> >> >>>
> >> >>> --
> >> >>>
> >> >>>
> >> >>> This message and any attachment are confidential and may be
> >> privileged or
> >> >>> otherwise protected from disclosure. If you are not the intended
> >> >> recipient,
> >> >>> you must not copy this message or attachment or disclose the
> >> >>> contents
> >> to
> >> >>> any other person. If you have received this transmission in
> >> >>> error,
> >> please
> >> >>> notify the sender immediately and delete the message and any
> >> attachment
> >> >>> from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> >>> subsidiaries do not accept liability for any omissions or errors
> >> >>> in
> >> this
> >> >>> message which may arise as a result of E-Mail-transmission or for
> >> damages
> >> >>> resulting from any unauthorized changes of the content of this
> >> >>> message
> >> >> and
> >> >>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> >> >>> its subsidiaries do not guarantee that this message is free of
> >> >>> viruses and
> >> >> does
> >> >>> not accept liability for any damages caused by any virus
> >> >>> transmitted therewith.
> >> >>>
> >> >>> Click http://www.emdgroup.com/disclaimer to access the German,
> >> French,
> >> >>> Spanish and Portuguese versions of this disclaimer.
> >> >>
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> >> > privileged
> >> or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the
> >> > contents to any other person. If you have received this
> >> > transmission in error,
> >> please
> >> > notify the sender immediately and delete the message and any
> >> > attachment from your system. Merck KGaA, Darmstadt, Germany and any
> >> > of its subsidiaries do not accept liability for any omissions or
> >> > errors in this message which may arise as a result of
> >> > E-Mail-transmission or for
> >> damages
> >> > resulting from any unauthorized changes of the content of this
> >> > message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> >> > its subsidiaries do not guarantee that this message is free of
> >> > viruses and
> >> does
> >> > not accept liability for any damages caused by any virus
> >> > transmitted therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> >> > French, Spanish and Portuguese versions of this disclaimer.
> >>
> >>
> >
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.