Why are cursor mark queries recommended over regular start, rows combination?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
S G
Reply | Threaded
Open this post in threaded view
|

Why are cursor mark queries recommended over regular start, rows combination?

S G
Hi,

We have use-cases where some queries will return about 100k to 500k records.
As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html,
it seems that using start=x, rows=y is a bad combination performance wise.

1) However, it is not clear to me why the alternative: "cursor-query" is
cheaper or recommended. It would have to run the same kind of workload as
the normal start=x, rows=y combination, no?

2) Also, it is not clear if the cursory-query runs on a single shard or
uses the same scatter gather as regular queries to read from all the shards?

3) Lastly, it is not clear the role of export handler. It seems that the
export handler would also have to do exactly the same kind of thing as
start=0 and rows=1000,000. And that again means bad performance.

What is the difference between all of the 3


Thanks
SG
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Erick Erickson
<1> consider start=100&rows=10. In the absence of cursorMark, Solr has
to sort the top 110 documents in order to throw away the first 100
since the last document scored could be in the top 110 and there's no
way to know that ahead of time. For 110 that's not very expensive, but
when the list is in the 100s of K, it gets significantly expensive
both in terms of CPU and memory. Now multiply that by the number of
shards (i.e. one replica from each shard would have to return the top
110 document IDs and scores in order for the aggregator to sort out
the true top 10) and it gets really expensive.

CursorMark essentially passes the score back (well, all the sort
criterion's last values). I'll skip a lot of details here that make
this more complex, but assume cursorMark is the score of the 100th
document (yes, there's code in there to handle multiple identical
scores and a lot of other stuff, that's why it's required to have
uniqueKey in the sort). Now each node can say "if a doc has a score >
cursorMark, I can throw it out immediately 'cause it was returned
already", and now each shard just keeps a list 10 docs long.

<2> CursorMark is SolrCloud compatible.

<3> First, streaming requests can only return docValues="true"
fields.Second, most streaming operations require sorting on something
besides score. Within those constraints, streaming will be _much_
faster and more efficient than cursorMark. Without tuning I saw 200K
rows/second returned for streaming, the bottleneck will be the speed
that the client can read from the network. First of all you only
execute one query rather than one query per N rows. Second, in the
cursorMark case, to return a document you and assuming that any field
you return is docValues=false
> read it from disk
> decompress it
> fetch the stored fields

with streaming, since we're using docValues fields, the disk
seek/read/decompress steps are skipped.

Best,
Erick

On Mon, Mar 12, 2018 at 5:18 PM, S G <[hidden email]> wrote:

> Hi,
>
> We have use-cases where some queries will return about 100k to 500k records.
> As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html,
> it seems that using start=x, rows=y is a bad combination performance wise.
>
> 1) However, it is not clear to me why the alternative: "cursor-query" is
> cheaper or recommended. It would have to run the same kind of workload as
> the normal start=x, rows=y combination, no?
>
> 2) Also, it is not clear if the cursory-query runs on a single shard or
> uses the same scatter gather as regular queries to read from all the shards?
>
> 3) Lastly, it is not clear the role of export handler. It seems that the
> export handler would also have to do exactly the same kind of thing as
> start=0 and rows=1000,000. And that again means bad performance.
>
> What is the difference between all of the 3
>
>
> Thanks
> SG
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Shawn Heisey-2
In reply to this post by S G
On 3/12/2018 6:18 PM, S G wrote:
> We have use-cases where some queries will return about 100k to 500k records.
> As per https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html,
> it seems that using start=x, rows=y is a bad combination performance wise.
>
> 1) However, it is not clear to me why the alternative: "cursor-query" is
> cheaper or recommended. It would have to run the same kind of workload as
> the normal start=x, rows=y combination, no?

No.  Through the use of cleverly designed filters, cursorMark is able to
dramatically reduce the amount of information that Solr has to sift
through when paging deeply into results.  Because of the way it works,
cursorMark does not offer any way jump directly to page 25000 -- you
have to get the previous 24999 pages first.  But the retrieval time of
every one of those pages is going to be about the same as page 1.

If you use start/rows, the retrieval time of every subsequent page is
going to increase, and by the time the page numbers start getting big,
the response time for every page is going to be VERY large.

Hoss, who created cursorMark, explains it all pretty well in this article:

https://lucidworks.com/2013/12/12/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/

> 2) Also, it is not clear if the cursory-query runs on a single shard or
> uses the same scatter gather as regular queries to read from all the shards?

The cursorMark feature works on sharded indexes.  In fact, that's where
it offers the best performance improvement over start/rows.

> 3) Lastly, it is not clear the role of export handler. It seems that the
> export handler would also have to do exactly the same kind of thing as
> start=0 and rows=1000,000. And that again means bad performance.

The standard search handlers must gather all of the information
(documents, etc) in the response into memory all at once, then send that
information to the entity that made the request.  This is why the rows
parameter defaults to 10.  By limiting the amount of information in a
response, that response is sent faster and consumes less memory.

The export handler works differently.  I haven't researched this, but I
*THINK* what it does is gathers documents matching the query and sort
parameters a little bit at a time, writes that response information out
to the HTTP/TCP socket, and then throws the source data away.  By
repeating this cycle many times, it can send millions of results without
consuming huge amounts of memory.  The HTTP standard supports this kind
of open-ended transfer of data.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Chris Hostetter-3
In reply to this post by Erick Erickson

: > 3) Lastly, it is not clear the role of export handler. It seems that the
: > export handler would also have to do exactly the same kind of thing as
: > start=0 and rows=1000,000. And that again means bad performance.
       
: <3> First, streaming requests can only return docValues="true"
: fields.Second, most streaming operations require sorting on something
: besides score. Within those constraints, streaming will be _much_
: faster and more efficient than cursorMark. Without tuning I saw 200K
: rows/second returned for streaming, the bottleneck will be the speed
: that the client can read from the network. First of all you only
: execute one query rather than one query per N rows. Second, in the
: cursorMark case, to return a document you and assuming that any field
: you return is docValues=false

Just to clarify, there is big difference between the /export handler
and "streaming expressions"

Unless something has changed drasticly in the past few releases, the
/export handler does *NOT* support exporting a full *collection* in solr
cloud -- it only operates on an individual core (aka: shard/replica).  

Streaming expressions is a feature that does work in Cloud mode, and can
make calls to the /export handler on a replica of each shard in order to
process the data of an entire collection -- but when doing so it has to
aggregate the *ALL* the results from every shard in memory on the
coordinating node -- meaning that (in addition to the docvalues caveat)
streaming expressions requires you to "spend" a lot of ram usage on one
node as a trade off for spending more time & multiple requests to get teh
same data from cursorMark...

https://lucene.apache.org/solr/guide/exporting-result-sets.html
https://lucene.apache.org/solr/guide/streaming-expressions.html

An additional perk of cursorMakr that may be relevant to the OP is that
you can "stop" tailing a cursor at anytime (ie: if you're post processing
the results client side and decide you have "enough" results) but a simila
feature isn't available (AFAICT) from streaming expressions...

https://lucene.apache.org/solr/guide/pagination-of-results.html#tailing-a-cursor


-Hoss
http://www.lucidworks.com/
S G
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

S G
Thanks everybody. This is lot of good information.
And we should try to update this in the documentation too to help users
make the right choice.
I can take a stab at this if someone can point me how to update the
documentation.

Thanks
SG


On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <[hidden email]>
wrote:

>
> : > 3) Lastly, it is not clear the role of export handler. It seems that
> the
> : > export handler would also have to do exactly the same kind of thing as
> : > start=0 and rows=1000,000. And that again means bad performance.
>
> : <3> First, streaming requests can only return docValues="true"
> : fields.Second, most streaming operations require sorting on something
> : besides score. Within those constraints, streaming will be _much_
> : faster and more efficient than cursorMark. Without tuning I saw 200K
> : rows/second returned for streaming, the bottleneck will be the speed
> : that the client can read from the network. First of all you only
> : execute one query rather than one query per N rows. Second, in the
> : cursorMark case, to return a document you and assuming that any field
> : you return is docValues=false
>
> Just to clarify, there is big difference between the /export handler
> and "streaming expressions"
>
> Unless something has changed drasticly in the past few releases, the
> /export handler does *NOT* support exporting a full *collection* in solr
> cloud -- it only operates on an individual core (aka: shard/replica).
>
> Streaming expressions is a feature that does work in Cloud mode, and can
> make calls to the /export handler on a replica of each shard in order to
> process the data of an entire collection -- but when doing so it has to
> aggregate the *ALL* the results from every shard in memory on the
> coordinating node -- meaning that (in addition to the docvalues caveat)
> streaming expressions requires you to "spend" a lot of ram usage on one
> node as a trade off for spending more time & multiple requests to get teh
> same data from cursorMark...
>
> https://lucene.apache.org/solr/guide/exporting-result-sets.html
> https://lucene.apache.org/solr/guide/streaming-expressions.html
>
> An additional perk of cursorMakr that may be relevant to the OP is that
> you can "stop" tailing a cursor at anytime (ie: if you're post processing
> the results client side and decide you have "enough" results) but a simila
> feature isn't available (AFAICT) from streaming expressions...
>
> https://lucene.apache.org/solr/guide/pagination-of-
> results.html#tailing-a-cursor
>
>
> -Hoss
> http://www.lucidworks.com/
>
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Erick Erickson
I'm pretty sure you can use Streaming Expressions to get all the rows
back from a sharded collection without chewing up lots of memory.

Try:
search(collection,
             q="id:*",
             fl="id",
             sort="id asc",
            qt="/export")

on a sharded SolrCloud installation, I believe you'll get all the rows back.

NOTE:
1> Some while ago you couldn't _stop_ the stream part way through.
down in the SolrJ world you could read from a stream for a while and
call close on it but that would just spin in the background until it
reached EOF. Search the JIRA list if you need (can't find the JIRA
right now, 6.6 IIRC is OK and, of course, 7.3).

This shouldn't chew up memory since the streams are sorted, so what
you get in the response is the ordered set of tuples.

Some of the join streams _do_ have to hold all the results in memory,
so look at the docs if you wind up using those.


Best,
Erick

On Wed, Mar 14, 2018 at 9:20 AM, S G <[hidden email]> wrote:

> Thanks everybody. This is lot of good information.
> And we should try to update this in the documentation too to help users
> make the right choice.
> I can take a stab at this if someone can point me how to update the
> documentation.
>
> Thanks
> SG
>
>
> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <[hidden email]>
> wrote:
>
>>
>> : > 3) Lastly, it is not clear the role of export handler. It seems that
>> the
>> : > export handler would also have to do exactly the same kind of thing as
>> : > start=0 and rows=1000,000. And that again means bad performance.
>>
>> : <3> First, streaming requests can only return docValues="true"
>> : fields.Second, most streaming operations require sorting on something
>> : besides score. Within those constraints, streaming will be _much_
>> : faster and more efficient than cursorMark. Without tuning I saw 200K
>> : rows/second returned for streaming, the bottleneck will be the speed
>> : that the client can read from the network. First of all you only
>> : execute one query rather than one query per N rows. Second, in the
>> : cursorMark case, to return a document you and assuming that any field
>> : you return is docValues=false
>>
>> Just to clarify, there is big difference between the /export handler
>> and "streaming expressions"
>>
>> Unless something has changed drasticly in the past few releases, the
>> /export handler does *NOT* support exporting a full *collection* in solr
>> cloud -- it only operates on an individual core (aka: shard/replica).
>>
>> Streaming expressions is a feature that does work in Cloud mode, and can
>> make calls to the /export handler on a replica of each shard in order to
>> process the data of an entire collection -- but when doing so it has to
>> aggregate the *ALL* the results from every shard in memory on the
>> coordinating node -- meaning that (in addition to the docvalues caveat)
>> streaming expressions requires you to "spend" a lot of ram usage on one
>> node as a trade off for spending more time & multiple requests to get teh
>> same data from cursorMark...
>>
>> https://lucene.apache.org/solr/guide/exporting-result-sets.html
>> https://lucene.apache.org/solr/guide/streaming-expressions.html
>>
>> An additional perk of cursorMakr that may be relevant to the OP is that
>> you can "stop" tailing a cursor at anytime (ie: if you're post processing
>> the results client side and decide you have "enough" results) but a simila
>> feature isn't available (AFAICT) from streaming expressions...
>>
>> https://lucene.apache.org/solr/guide/pagination-of-
>> results.html#tailing-a-cursor
>>
>>
>> -Hoss
>> http://www.lucidworks.com/
>>
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Jason Gerlowski
> I can take a stab at this if someone can point me how to update the documentation.


Hey SG,

Please do, that'd be awesome.

Thanks to some work done by Cassandra Targett a release or two ago,
the Solr Ref Guide documentation now lives in the same codebase as the
Solr/Lucene code itself, and the process for updating it is the same
as suggesting a change to the code:


1. Open a JIRA issue detailing the improvement you'd like to make
2. Find the relevant ref guide pages to update, making the changes
you're proposing.
3. Upload a patch to your JIRA and ask for someone to take a look.
(You can tag me on this issue if you'd like).


Some more specific links you might find helpful:

- JIRA: https://issues.apache.org/jira/projects/SOLR/issues
- Pointers on JIRA conventions, creating patches:
https://wiki.apache.org/solr/HowToContribute
- Root directory for the Solr Ref-Guide code:
https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide
- https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html

Best,

Jason

On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson <[hidden email]> wrote:

> I'm pretty sure you can use Streaming Expressions to get all the rows
> back from a sharded collection without chewing up lots of memory.
>
> Try:
> search(collection,
>              q="id:*",
>              fl="id",
>              sort="id asc",
>             qt="/export")
>
> on a sharded SolrCloud installation, I believe you'll get all the rows back.
>
> NOTE:
> 1> Some while ago you couldn't _stop_ the stream part way through.
> down in the SolrJ world you could read from a stream for a while and
> call close on it but that would just spin in the background until it
> reached EOF. Search the JIRA list if you need (can't find the JIRA
> right now, 6.6 IIRC is OK and, of course, 7.3).
>
> This shouldn't chew up memory since the streams are sorted, so what
> you get in the response is the ordered set of tuples.
>
> Some of the join streams _do_ have to hold all the results in memory,
> so look at the docs if you wind up using those.
>
>
> Best,
> Erick
>
> On Wed, Mar 14, 2018 at 9:20 AM, S G <[hidden email]> wrote:
>> Thanks everybody. This is lot of good information.
>> And we should try to update this in the documentation too to help users
>> make the right choice.
>> I can take a stab at this if someone can point me how to update the
>> documentation.
>>
>> Thanks
>> SG
>>
>>
>> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <[hidden email]>
>> wrote:
>>
>>>
>>> : > 3) Lastly, it is not clear the role of export handler. It seems that
>>> the
>>> : > export handler would also have to do exactly the same kind of thing as
>>> : > start=0 and rows=1000,000. And that again means bad performance.
>>>
>>> : <3> First, streaming requests can only return docValues="true"
>>> : fields.Second, most streaming operations require sorting on something
>>> : besides score. Within those constraints, streaming will be _much_
>>> : faster and more efficient than cursorMark. Without tuning I saw 200K
>>> : rows/second returned for streaming, the bottleneck will be the speed
>>> : that the client can read from the network. First of all you only
>>> : execute one query rather than one query per N rows. Second, in the
>>> : cursorMark case, to return a document you and assuming that any field
>>> : you return is docValues=false
>>>
>>> Just to clarify, there is big difference between the /export handler
>>> and "streaming expressions"
>>>
>>> Unless something has changed drasticly in the past few releases, the
>>> /export handler does *NOT* support exporting a full *collection* in solr
>>> cloud -- it only operates on an individual core (aka: shard/replica).
>>>
>>> Streaming expressions is a feature that does work in Cloud mode, and can
>>> make calls to the /export handler on a replica of each shard in order to
>>> process the data of an entire collection -- but when doing so it has to
>>> aggregate the *ALL* the results from every shard in memory on the
>>> coordinating node -- meaning that (in addition to the docvalues caveat)
>>> streaming expressions requires you to "spend" a lot of ram usage on one
>>> node as a trade off for spending more time & multiple requests to get teh
>>> same data from cursorMark...
>>>
>>> https://lucene.apache.org/solr/guide/exporting-result-sets.html
>>> https://lucene.apache.org/solr/guide/streaming-expressions.html
>>>
>>> An additional perk of cursorMakr that may be relevant to the OP is that
>>> you can "stop" tailing a cursor at anytime (ie: if you're post processing
>>> the results client side and decide you have "enough" results) but a simila
>>> feature isn't available (AFAICT) from streaming expressions...
>>>
>>> https://lucene.apache.org/solr/guide/pagination-of-
>>> results.html#tailing-a-cursor
>>>
>>>
>>> -Hoss
>>> http://www.lucidworks.com/
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

WebsterHomer
Just FYI I had a project recently where I tried to use cursorMark in
Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even
return consistent numberFound values. I posted about it in this forum.
Using the start and rows arguments in SolrQuery did work reliably so I
abandoned cursorMark as just too buggy

I had originally wanted to try using streaming expressions, but they don't
return results ordered by relevancy, a major limitation for a search
engine, in my opinion.

On Tue, Mar 20, 2018 at 11:47 AM, Jason Gerlowski <[hidden email]>
wrote:

> > I can take a stab at this if someone can point me how to update the
> documentation.
>
>
> Hey SG,
>
> Please do, that'd be awesome.
>
> Thanks to some work done by Cassandra Targett a release or two ago,
> the Solr Ref Guide documentation now lives in the same codebase as the
> Solr/Lucene code itself, and the process for updating it is the same
> as suggesting a change to the code:
>
>
> 1. Open a JIRA issue detailing the improvement you'd like to make
> 2. Find the relevant ref guide pages to update, making the changes
> you're proposing.
> 3. Upload a patch to your JIRA and ask for someone to take a look.
> (You can tag me on this issue if you'd like).
>
>
> Some more specific links you might find helpful:
>
> - JIRA: https://issues.apache.org/jira/projects/SOLR/issues
> - Pointers on JIRA conventions, creating patches:
> https://wiki.apache.org/solr/HowToContribute
> - Root directory for the Solr Ref-Guide code:
> https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide
> - https://lucene.apache.org/solr/guide/7_2/pagination-of-results.html
>
> Best,
>
> Jason
>
> On Wed, Mar 14, 2018 at 2:53 PM, Erick Erickson <[hidden email]>
> wrote:
> > I'm pretty sure you can use Streaming Expressions to get all the rows
> > back from a sharded collection without chewing up lots of memory.
> >
> > Try:
> > search(collection,
> >              q="id:*",
> >              fl="id",
> >              sort="id asc",
> >             qt="/export")
> >
> > on a sharded SolrCloud installation, I believe you'll get all the rows
> back.
> >
> > NOTE:
> > 1> Some while ago you couldn't _stop_ the stream part way through.
> > down in the SolrJ world you could read from a stream for a while and
> > call close on it but that would just spin in the background until it
> > reached EOF. Search the JIRA list if you need (can't find the JIRA
> > right now, 6.6 IIRC is OK and, of course, 7.3).
> >
> > This shouldn't chew up memory since the streams are sorted, so what
> > you get in the response is the ordered set of tuples.
> >
> > Some of the join streams _do_ have to hold all the results in memory,
> > so look at the docs if you wind up using those.
> >
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 14, 2018 at 9:20 AM, S G <[hidden email]> wrote:
> >> Thanks everybody. This is lot of good information.
> >> And we should try to update this in the documentation too to help users
> >> make the right choice.
> >> I can take a stab at this if someone can point me how to update the
> >> documentation.
> >>
> >> Thanks
> >> SG
> >>
> >>
> >> On Tue, Mar 13, 2018 at 2:04 PM, Chris Hostetter <
> [hidden email]>
> >> wrote:
> >>
> >>>
> >>> : > 3) Lastly, it is not clear the role of export handler. It seems
> that
> >>> the
> >>> : > export handler would also have to do exactly the same kind of
> thing as
> >>> : > start=0 and rows=1000,000. And that again means bad performance.
> >>>
> >>> : <3> First, streaming requests can only return docValues="true"
> >>> : fields.Second, most streaming operations require sorting on something
> >>> : besides score. Within those constraints, streaming will be _much_
> >>> : faster and more efficient than cursorMark. Without tuning I saw 200K
> >>> : rows/second returned for streaming, the bottleneck will be the speed
> >>> : that the client can read from the network. First of all you only
> >>> : execute one query rather than one query per N rows. Second, in the
> >>> : cursorMark case, to return a document you and assuming that any field
> >>> : you return is docValues=false
> >>>
> >>> Just to clarify, there is big difference between the /export handler
> >>> and "streaming expressions"
> >>>
> >>> Unless something has changed drasticly in the past few releases, the
> >>> /export handler does *NOT* support exporting a full *collection* in
> solr
> >>> cloud -- it only operates on an individual core (aka: shard/replica).
> >>>
> >>> Streaming expressions is a feature that does work in Cloud mode, and
> can
> >>> make calls to the /export handler on a replica of each shard in order
> to
> >>> process the data of an entire collection -- but when doing so it has to
> >>> aggregate the *ALL* the results from every shard in memory on the
> >>> coordinating node -- meaning that (in addition to the docvalues caveat)
> >>> streaming expressions requires you to "spend" a lot of ram usage on one
> >>> node as a trade off for spending more time & multiple requests to get
> teh
> >>> same data from cursorMark...
> >>>
> >>> https://lucene.apache.org/solr/guide/exporting-result-sets.html
> >>> https://lucene.apache.org/solr/guide/streaming-expressions.html
> >>>
> >>> An additional perk of cursorMakr that may be relevant to the OP is that
> >>> you can "stop" tailing a cursor at anytime (ie: if you're post
> processing
> >>> the results client side and decide you have "enough" results) but a
> simila
> >>> feature isn't available (AFAICT) from streaming expressions...
> >>>
> >>> https://lucene.apache.org/solr/guide/pagination-of-
> >>> results.html#tailing-a-cursor
> >>>
> >>>
> >>> -Hoss
> >>> http://www.lucidworks.com/
> >>>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.
Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

Shawn Heisey-2
On 3/23/2018 3:47 PM, Webster Homer wrote:
> Just FYI I had a project recently where I tried to use cursorMark in
> Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even
> return consistent numberFound values. I posted about it in this forum.
> Using the start and rows arguments in SolrQuery did work reliably so I
> abandoned cursorMark as just too buggy
>
> I had originally wanted to try using streaming expressions, but they don't
> return results ordered by relevancy, a major limitation for a search
> engine, in my opinion.

The problems that can affect cursorMark are also problems when using
start/rows pagination.

You've mentioned relevancy ordering, so I think this is what you're
running into:

Trying to use relevancy ranking on SolrCloud with NRT replicas can break
pagination.  The problem happens both with cursorMark and start/rows. 
NRT replicas in a SolrCloud index can have different numbers of deleted
documents.  Even though deleted documents do not appear in search
results, they ARE still part of the index, and can affect scoring. 
Since SolrCloud load balances requests across replicas, page 1 may use
different replicas than page 2, and end up with different scoring, which
can affect the order of results and change which page number they end up
on.  Using TLOG or PULL replicas (available since 7.0) usually fixes
that problem, because different replicas are 100% identical with those
replica types.

Changing the index in the middle of trying to page through results can
also cause issues with pagination.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Why are cursor mark queries recommended over regular start, rows combination?

WebsterHomer
Shawn,
Thanks. It's been a while now, but we did find issues with both cursorMark
AND start/rows. the effect was much more obvious with cursorMark.
We were able to address this by switching to use TLOG replicas. These give
consistent results. It's nice to know that the cursorMark problems were
related to relevancy retrieval order.

We found one major drawback with TLOG replicas, and that was that CDCR was
broken for TLOG replicas. There is a Jira on this, and it is being
addressed. NRT may have a use case, but I think that reproducible correct
results should trump performance everytime. We use Solr as a search engine,
we almost always want to retrieve results in order of relevancy.

I think that we will phase out the use of NRT replicas in favor of TLOG
replicas

On Fri, Mar 23, 2018 at 7:04 PM, Shawn Heisey <[hidden email]> wrote:

> On 3/23/2018 3:47 PM, Webster Homer wrote:
> > Just FYI I had a project recently where I tried to use cursorMark in
> > Solrcloud and solr 7.2.0 and it was very unreliable. It couldn't even
> > return consistent numberFound values. I posted about it in this forum.
> > Using the start and rows arguments in SolrQuery did work reliably so I
> > abandoned cursorMark as just too buggy
> >
> > I had originally wanted to try using streaming expressions, but they
> don't
> > return results ordered by relevancy, a major limitation for a search
> > engine, in my opinion.
>
> The problems that can affect cursorMark are also problems when using
> start/rows pagination.
>
> You've mentioned relevancy ordering, so I think this is what you're
> running into:
>
> Trying to use relevancy ranking on SolrCloud with NRT replicas can break
> pagination.  The problem happens both with cursorMark and start/rows.
> NRT replicas in a SolrCloud index can have different numbers of deleted
> documents.  Even though deleted documents do not appear in search
> results, they ARE still part of the index, and can affect scoring.
> Since SolrCloud load balances requests across replicas, page 1 may use
> different replicas than page 2, and end up with different scoring, which
> can affect the order of results and change which page number they end up
> on.  Using TLOG or PULL replicas (available since 7.0) usually fixes
> that problem, because different replicas are 100% identical with those
> replica types.
>
> Changing the index in the middle of trying to page through results can
> also cause issues with pagination.
>
> Thanks,
> Shawn
>
>

--


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.