Different docs order in different replicas of the same shard

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Different docs order in different replicas of the same shard

SOLR4189
I use SOLR-6.5.1 and I want to start to use replicas.

For it I want to understand something:

1) Can asynchronous forwarding document from leader to all replicas or some
another reasons cause that replica A may see update X then Y, and replica B
may see update Y then X?
If yes, thus a particular document in replicaA might sort differently
relative to a document from replicaB if they have the same score (in the
same order as they were stored in the index). Is it an edge case?

2) What does it mean  Custom update chain post-processors may never be
invoked on a recovering replica
<https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>  ,
if all my UpdateProcessors are post-processors (i.e. are after
DistributedUpdateProcessor)? Will all buffered update requests in recovery
be indexed in replica without my features?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

Erick Erickson
For (1), it's not a problem. Every update goes through the leader,
where it gets a version stamp (the _version_ field). So if doc1 is
updated twice the leader will assign a version stamp. Call the updated
doc1.1 and doc1.2. If replica X sees doc1.2 first, it indexes it. If
it subsequently sees doc1.1, it'll reject it and the caller will have
to decide what to do. If the caller thinks their copy should really be
the "one true copy", it can re-submit the doc and it'll be assigned a
new version (say doc1.3) and when replicaX sees it it'll be indexed.

If replica X sees them in order (doc1.1, then doc1.2), then the second
doc replaces the first.

The point is that you can guarantee consistency, i.e. all replicas
have the same document.

Sorting is a different thing though, and the _same_ document can be
sorted differently depending on which replica it's on. This is for two
reasons:
1> deleted docs still contribute to scoring until they're "merged
away" as part of normal indexing, therefore the score may be slightly
different for the same doc, depending on the replica.
2> tied scores are broken by the internal Lucene document ID, and due
to merging the internal ID of two docs relative to each other may be
different on different replicas.

<1> is "just how it works"
<2> can be handled by always specifying a deterministic sort if all
other sorts result in a tie, the "id" field is a good one to use.

There's a lot more detail here:
https://medium.com/@sarkaramrit2/getting-different-results-while-issuing-a-query-multiple-times-in-solrcloud-632103096076

Best,
Erick

On Fri, May 25, 2018 at 6:28 AM, SOLR4189 <[hidden email]> wrote:

> I use SOLR-6.5.1 and I want to start to use replicas.
>
> For it I want to understand something:
>
> 1) Can asynchronous forwarding document from leader to all replicas or some
> another reasons cause that replica A may see update X then Y, and replica B
> may see update Y then X?
> If yes, thus a particular document in replicaA might sort differently
> relative to a document from replicaB if they have the same score (in the
> same order as they were stored in the index). Is it an edge case?
>
> 2) What does it mean  Custom update chain post-processors may never be
> invoked on a recovering replica
> <https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>  ,
> if all my UpdateProcessors are post-processors (i.e. are after
> DistributedUpdateProcessor)? Will all buffered update requests in recovery
> be indexed in replica without my features?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

Shawn Heisey-2
In reply to this post by SOLR4189
On 5/25/2018 7:28 AM, SOLR4189 wrote:

> I use SOLR-6.5.1 and I want to start to use replicas.
>
> For it I want to understand something:
>
> 1) Can asynchronous forwarding document from leader to all replicas or some
> another reasons cause that replica A may see update X then Y, and replica B
> may see update Y then X?
> If yes, thus a particular document in replicaA might sort differently
> relative to a document from replicaB if they have the same score (in the
> same order as they were stored in the index). Is it an edge case?

I can't speak about whether it's possible to have updates re-ordered. 
It probably is possible.  But whether it's possible or not, there's
absolutely no guarantee that Lucene document ordering will be identical
between NRT replicas.  NRT is the only replica type that Solr 6.x has,
and is the default type on Solr 7.x.  One replica can have different
numbers of deleted documents than another replica, and may not merge
segments in exactly the same way as another replica.

Because deleted documents can affect score calculation, and one replica
may have different deleted documents than another replica, the default
sort order (relevancy ranking) can differ between replicas.

A workaround to these issues is to always use an explicit field-based
sort.  Deleted documents and the Lucene document order do not affect
that kind of sort.

> 2) What does it mean  Custom update chain post-processors may never be
> invoked on a recovering replica
> <https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>

The name of the update chain that was originally used during the
indexing is not stored in the transaction log, so when the transaction
log is replayed, the update chain is not called.

> if all my UpdateProcessors are post-processors (i.e. are after
> DistributedUpdateProcessor)? Will all buffered update requests in recovery
> be indexed in replica without my features?

General advice: In most cases, a post-processor is NOT a good idea.

Changes made to the input document by update processors placed *before*
DistributedUpdateProcessor will be recorded in the transaction log, and
will be identical on all replicas.  Because the transaction log DOES
have the results of the processor, and all replicas are guaranteed to be
the same, this is almost always what you want.

Placing an update processor before DistributedUpdateProcessor ensures
that it is only run once for every document.  If it is placed after
DistributedUpdateProcessor, it will execute once for every replica on
every document.  That can be a big problem if the update processor runs
slowly or consumes a lot of memory/CPU resources.

Because post-processors run independently on every replica, they can
result in different data on each replica. For instance, if you use the
UUID processor after DistributedUpdateProcessor, every replica will end
up with a different UUID for the same document.  Similarly, the
timestamp processor can record a different timestamp on every replica
for the same document, because each replica might do its indexing at a
slightly different time.  Timestamps in a Solr index have millisecond
precision.

If you actually do intend to have different data in a field on different
replicas, then you might want a post-processor.  But this requirement is
VERY rare.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

SOLR4189
You are right, BUT I have two indexers (one in WCF service and one in HADOOP)
and in two my indexers I'm using atomic updates in each document. According
to  Atomic Update Processor Factory
<https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>  
and according to your solution (to set all my processors before
DistributedUpdateProcessor), all my processors will run on partial documents
only, but I need on full documents. So, what can I do in this situation?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

Shawn Heisey-2
On 5/25/2018 11:07 AM, SOLR4189 wrote:
> You are right, BUT I have two indexers (one in WCF service and one in HADOOP)
> and in two my indexers I'm using atomic updates in each document. According
> to  Atomic Update Processor Factory
> <https://lucene.apache.org/solr/guide/7_2/update-request-processors.html>  
> and according to your solution (to set all my processors before
> DistributedUpdateProcessor), all my processors will run on partial documents
> only, but I need on full documents. So, what can I do in this situation?

For every best-practice recommendation, there's always at least one
situation where it won't work.

Your situation sounds like one that will require a fix for SOLR-8030.  I
have commented on that issue with some details about your use case.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

SOLR4189
This post was updated on .
I thought about next solution for the my problem: Atomic Update + Replicas.

I can set my UpdateProcessorsChain  in the next order:
<MergerUpdateProcessor/>
<CustomUpdateProcessor-1/>
      .            .              .
<CustomUpdateProcessor-n/>
<DistributedUpdateProcessor>
<RunUpdateProcessor/>

MergerUpdateProcessor will use getUpdatedDocument function of
DistibutedUpdateProcessor
<https://github.com/apache/lucene-solr/blob/master/solr/core/src/java/org/apache/solr/update/processor/DistributedUpdateProcessor.java
.

public void processAdd(AddUpdateCommand cmd) {
    if (!AtomicUpdateDocumentMerger.isAtomicUpdate(cmd)) {
                super.processAdd(cmd);
                return;

        }
        AtomicUpdateDocumentMerger docMerger = new
AtomicUpdateDocumentMerger(cmd.getReq);
               
    Set<String> inPlaceUpdatedFields =
AtomicUpdateDocumentMerger.computeInPlaceUpdatableFields(cmd);
    if (inPlaceUpdatedFields.size() > 0) { // non-empty means this is
suitable for in-place updates
      if (docMerger.doInPlaceUpdateMerge(cmd, inPlaceUpdatedFields)) {
        super.processAdd(cmd);
        return;

      } else {
        // in-place update failed, so fall through and re-try the same with
a full atomic update
      }
    }  
    // full (non-inplace) atomic update
    SolrInputDocument sdoc = cmd.getSolrInputDocument();
    BytesRef id = cmd.getIndexedId();
    SolrInputDocument oldDoc =
RealTimeGetComponent.getInputDocument(cmd.getReq().getCore(), id);
    if (oldDoc == null) {
      // create a new doc by default if an old one wasn't found
      if (cmd.getVersion() <= 0) {
        oldDoc = new SolrInputDocument();
      } else {
        // could just let the optimistic locking throw the error
        throw new SolrException(ErrorCode.CONFLICT, "Document not found for
update.  id=" + cmd.getPrintableId());
      }
    } else {
      oldDoc.remove(CommonParams.VERSION_FIELD);
    }
    cmd.solrDoc = docMerger.merge(sdoc, oldDoc);    
        super.processAdd(cmd);
 }

What do you think about my solution (all changes in source code are marked
in bold)? I checked it in my test environment, and it worked fine. Maybe do
I miss something? Edge cases?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Different docs order in different replicas of the same shard

SOLR4189
In reply to this post by Shawn Heisey-2
I think that I found very simple solution: to set my updateProcessorsChain to
default="true" and it is solving all my problems without moving all
post-updateprocessors to be pre-updateprocessors. What do you think about
it?






--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html