CDCR performance issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

CDCR performance issues

Tom Peters
I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
 
This is what I'm getting back in terms of OPS:

    curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
    {
      "responseHeader": {
        "status": 0,
        "QTime": 0
      },
      "operationsPerSecond": [
        "zook01,zook02,zook03/solr",
        [
          "mycollection",
          [
            "all",
            49.10140553500938,
            "adds",
            10.27612635309587,
            "deletes",
            38.82527896994054
          ]
        ]
      ]
    }

The source and target collections are in separate data centers.

Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center
show decent enough network performance: ~181 Mbit/s

I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.

Any suggestions on potential settings to tune to improve the performance?

Thanks

--

Here's some relevant log lines from the source data center's leader:

    2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
    2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
    2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
    2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection


And what the log looks like in the target:

    2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Tom Peters
So I'm continuing to look into this and not making much headway, but I have additional questions now as well.

I restarted the nodes in the source data center to see if it would have any impact. It appeared to initiate another bootstrap with the target. The lag and queueSize were brought back down to zero.

Over the next two hours the queueSize has grown back to 106,122 (as reported by solr/mycollection/cdcr?action=QUEUES). When I actually look at what we sent to Solr though, I only deleted or added a total of 3,805 documents. Could this be part of the problem? Should queueSize be representative of the total number of document updates, or are there other updates under the hood that I wouldn't see that would still need to be tracked by Solr.

Also, if there are any other suggestions on my original issue which is that the CDCR cannot keep up despite the relatively low number of updates (3805 over two hours).

Thanks.

> On Mar 7, 2018, at 6:19 PM, Tom Peters <[hidden email]> wrote:
>
> I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
>
> This is what I'm getting back in terms of OPS:
>
>    curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>    {
>      "responseHeader": {
>        "status": 0,
>        "QTime": 0
>      },
>      "operationsPerSecond": [
>        "zook01,zook02,zook03/solr",
>        [
>          "mycollection",
>          [
>            "all",
>            49.10140553500938,
>            "adds",
>            10.27612635309587,
>            "deletes",
>            38.82527896994054
>          ]
>        ]
>      ]
>    }
>
> The source and target collections are in separate data centers.
>
> Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center
> show decent enough network performance: ~181 Mbit/s
>
> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.
>
> Any suggestions on potential settings to tune to improve the performance?
>
> Thanks
>
> --
>
> Here's some relevant log lines from the source data center's leader:
>
>    2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>    2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>    2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>    2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>
>
> And what the log looks like in the target:
>
>    2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>
>
>
> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

RE: CDCR performance issues

Davis, Daniel (NIH/NLM) [C]
In reply to this post by Tom Peters
These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP sockets, so general guidelines do apply.

Check the round-trip time between the data centers using ping or TCP ping.   Throughput tests may be high, but if Solr has to wait for a response to a request before sending the next action, then just like any network protocol that does that, it will get slow.

I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check whether some proxy/load balancer between data centers is causing it to be a single connection per operation.   That will *kill* performance.   Some proxies default to HTTP/1.0 (open, send request, server send response, close), and that will hurt.

Why you should listen to me even without SolrCloud knowledge - checkout paper "Latency performance of SOAP Implementations".   Same distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of Apache Axis 1.1 by 250ms per call with 1-line of code.

-----Original Message-----
From: Tom Peters [mailto:[hidden email]]
Sent: Wednesday, March 7, 2018 6:19 PM
To: [hidden email]
Subject: CDCR performance issues

I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
 
This is what I'm getting back in terms of OPS:

    curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
    {
      "responseHeader": {
        "status": 0,
        "QTime": 0
      },
      "operationsPerSecond": [
        "zook01,zook02,zook03/solr",
        [
          "mycollection",
          [
            "all",
            49.10140553500938,
            "adds",
            10.27612635309587,
            "deletes",
            38.82527896994054
          ]
        ]
      ]
    }

The source and target collections are in separate data centers.

Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s

I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.

Any suggestions on potential settings to tune to improve the performance?

Thanks

--

Here's some relevant log lines from the source data center's leader:

    2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
    2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
    2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
    2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
    2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection


And what the log looks like in the target:

    2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
    2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Tom Peters
Thanks. This was helpful. I did some tcpdumps and I'm noticing that the requests to the target data center are not batched in any way. Each update comes in as an independent update. Some follow-up questions:

1. Is it accurate that updates are not actually batched in transit from the source to the target and instead each document is posted separately?

2. Are they done synchronously? I assume yes (since you wouldn't want operations applied out of order)

3. If they are done synchronously, and are not batched in any way, does that mean that the best performance I can expect would be roughly how long it takes to round-trip a single document? ie. If my average ping is 25ms, then I can expect a peak performance of roughly 40 ops/s.

Thanks



> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <[hidden email]> wrote:
>
> These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP sockets, so general guidelines do apply.
>
> Check the round-trip time between the data centers using ping or TCP ping.   Throughput tests may be high, but if Solr has to wait for a response to a request before sending the next action, then just like any network protocol that does that, it will get slow.
>
> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check whether some proxy/load balancer between data centers is causing it to be a single connection per operation.   That will *kill* performance.   Some proxies default to HTTP/1.0 (open, send request, server send response, close), and that will hurt.
>
> Why you should listen to me even without SolrCloud knowledge - checkout paper "Latency performance of SOAP Implementations".   Same distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of Apache Axis 1.1 by 250ms per call with 1-line of code.
>
> -----Original Message-----
> From: Tom Peters [mailto:[hidden email]]
> Sent: Wednesday, March 7, 2018 6:19 PM
> To: [hidden email]
> Subject: CDCR performance issues
>
> I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
>
> This is what I'm getting back in terms of OPS:
>
>    curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>    {
>      "responseHeader": {
>        "status": 0,
>        "QTime": 0
>      },
>      "operationsPerSecond": [
>        "zook01,zook02,zook03/solr",
>        [
>          "mycollection",
>          [
>            "all",
>            49.10140553500938,
>            "adds",
>            10.27612635309587,
>            "deletes",
>            38.82527896994054
>          ]
>        ]
>      ]
>    }
>
> The source and target collections are in separate data centers.
>
> Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s
>
> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.
>
> Any suggestions on potential settings to tune to improve the performance?
>
> Thanks
>
> --
>
> Here's some relevant log lines from the source data center's leader:
>
>    2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>    2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>    2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>    2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>    2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>
>
> And what the log looks like in the target:
>
>    2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>    2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>
>
>
> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

spoonerk
please unsubscribe i tried to manaually unsubscribe


On 3/9/2018 12:59 PM, Tom Peters wrote:

> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the requests to the target data center are not batched in any way. Each update comes in as an independent update. Some follow-up questions:
>
> 1. Is it accurate that updates are not actually batched in transit from the source to the target and instead each document is posted separately?
>
> 2. Are they done synchronously? I assume yes (since you wouldn't want operations applied out of order)
>
> 3. If they are done synchronously, and are not batched in any way, does that mean that the best performance I can expect would be roughly how long it takes to round-trip a single document? ie. If my average ping is 25ms, then I can expect a peak performance of roughly 40 ops/s.
>
> Thanks
>
>
>
>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <[hidden email]> wrote:
>>
>> These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP sockets, so general guidelines do apply.
>>
>> Check the round-trip time between the data centers using ping or TCP ping.   Throughput tests may be high, but if Solr has to wait for a response to a request before sending the next action, then just like any network protocol that does that, it will get slow.
>>
>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check whether some proxy/load balancer between data centers is causing it to be a single connection per operation.   That will *kill* performance.   Some proxies default to HTTP/1.0 (open, send request, server send response, close), and that will hurt.
>>
>> Why you should listen to me even without SolrCloud knowledge - checkout paper "Latency performance of SOAP Implementations".   Same distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of Apache Axis 1.1 by 250ms per call with 1-line of code.
>>
>> -----Original Message-----
>> From: Tom Peters [mailto:[hidden email]]
>> Sent: Wednesday, March 7, 2018 6:19 PM
>> To: [hidden email]
>> Subject: CDCR performance issues
>>
>> I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
>>
>> This is what I'm getting back in terms of OPS:
>>
>>     curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>     {
>>       "responseHeader": {
>>         "status": 0,
>>         "QTime": 0
>>       },
>>       "operationsPerSecond": [
>>         "zook01,zook02,zook03/solr",
>>         [
>>           "mycollection",
>>           [
>>             "all",
>>             49.10140553500938,
>>             "adds",
>>             10.27612635309587,
>>             "deletes",
>>             38.82527896994054
>>           ]
>>         ]
>>       ]
>>     }
>>
>> The source and target collections are in separate data centers.
>>
>> Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s
>>
>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.
>>
>> Any suggestions on potential settings to tune to improve the performance?
>>
>> Thanks
>>
>> --
>>
>> Here's some relevant log lines from the source data center's leader:
>>
>>     2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>     2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>>     2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>     2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>     2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>     2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>     2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>
>>
>> And what the log looks like in the target:
>>
>>     2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>     2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>
>>
>>
>> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
>
>
> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.

Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Erick Erickson
John:

_What_ did you try and how did it fail?

Please follow the instructions here:
http://lucene.apache.org/solr/community.html#mailing-lists-irc

. You must use the _exact_ same e-mail as you used to subscribe.


If the initial try doesn't work and following the suggestions at the
"problems" link doesn't work for you, let us know. But note you need
to show us the _entire_ return header to allow anyone to diagnose the
problem.


Best,

Erick

On Fri, Mar 9, 2018 at 1:00 PM, john spooner <[hidden email]> wrote:

> please unsubscribe i tried to manaually unsubscribe
>
>
>
> On 3/9/2018 12:59 PM, Tom Peters wrote:
>>
>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the
>> requests to the target data center are not batched in any way. Each update
>> comes in as an independent update. Some follow-up questions:
>>
>> 1. Is it accurate that updates are not actually batched in transit from
>> the source to the target and instead each document is posted separately?
>>
>> 2. Are they done synchronously? I assume yes (since you wouldn't want
>> operations applied out of order)
>>
>> 3. If they are done synchronously, and are not batched in any way, does
>> that mean that the best performance I can expect would be roughly how long
>> it takes to round-trip a single document? ie. If my average ping is 25ms,
>> then I can expect a peak performance of roughly 40 ops/s.
>>
>> Thanks
>>
>>
>>
>>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C]
>>> <[hidden email]> wrote:
>>>
>>> These are general guidelines, I've done loads of networking, but may be
>>> less familiar with SolrCloud  and CDCR architecture.  However, I know it's
>>> all TCP sockets, so general guidelines do apply.
>>>
>>> Check the round-trip time between the data centers using ping or TCP
>>> ping.   Throughput tests may be high, but if Solr has to wait for a response
>>> to a request before sending the next action, then just like any network
>>> protocol that does that, it will get slow.
>>>
>>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check
>>> whether some proxy/load balancer between data centers is causing it to be a
>>> single connection per operation.   That will *kill* performance.   Some
>>> proxies default to HTTP/1.0 (open, send request, server send response,
>>> close), and that will hurt.
>>>
>>> Why you should listen to me even without SolrCloud knowledge - checkout
>>> paper "Latency performance of SOAP Implementations".   Same distribution of
>>> skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still
>>> improved response time of Apache Axis 1.1 by 250ms per call with 1-line of
>>> code.
>>>
>>> -----Original Message-----
>>> From: Tom Peters [mailto:[hidden email]]
>>> Sent: Wednesday, March 7, 2018 6:19 PM
>>> To: [hidden email]
>>> Subject: CDCR performance issues
>>>
>>> I'm having issues with the target collection staying up-to-date with
>>> indexing from the source collection using CDCR.
>>>
>>> This is what I'm getting back in terms of OPS:
>>>
>>>     curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>>     {
>>>       "responseHeader": {
>>>         "status": 0,
>>>         "QTime": 0
>>>       },
>>>       "operationsPerSecond": [
>>>         "zook01,zook02,zook03/solr",
>>>         [
>>>           "mycollection",
>>>           [
>>>             "all",
>>>             49.10140553500938,
>>>             "adds",
>>>             10.27612635309587,
>>>             "deletes",
>>>             38.82527896994054
>>>           ]
>>>         ]
>>>       ]
>>>     }
>>>
>>> The source and target collections are in separate data centers.
>>>
>>> Doing a network test between the leader node in the source data center
>>> and the ZooKeeper nodes in the target data center show decent enough network
>>> performance: ~181 Mbit/s
>>>
>>> I've tried playing around with the "batchSize" value (128, 512, 728,
>>> 1000, 2000, 2500) and they've haven't made much of a difference.
>>>
>>> Any suggestions on potential settings to tune to improve the performance?
>>>
>>> Thanks
>>>
>>> --
>>>
>>> Here's some relevant log lines from the source data center's leader:
>>>
>>>     2018-03-07 23:16:11.984 INFO
>>> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>     2018-03-07 23:16:23.062 INFO
>>> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>>>     2018-03-07 23:16:32.063 INFO
>>> (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>     2018-03-07 23:16:36.209 INFO
>>> (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>     2018-03-07 23:16:42.091 INFO
>>> (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>     2018-03-07 23:16:46.790 INFO
>>> (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>     2018-03-07 23:16:50.004 INFO
>>> (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>
>>>
>>> And what the log looks like in the target:
>>>
>>>     2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>     2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
>>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>>> params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2}
>>> status=0 QTime=0
>>>
>>>
>>>
>>> This message and any attachment may contain information that is
>>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>>> distribution of this e-mail or any attached file by anyone other than the
>>> intended recipient is strictly prohibited. If you have received this message
>>> in error, please notify the sender by reply email and delete the message and
>>> any attachments. Thank you.
>>
>>
>>
>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this message
>> in error, please notify the sender by reply email and delete the message and
>> any attachments. Thank you.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Tom Peters
In reply to this post by Tom Peters
Anyone have any thoughts on the questions I raised?

I have another question related to CDCR:
Sometimes we have to reindex a large chunk of our index (1M+ documents). What's the best way to handle this if the normal CDCR process won't be able to keep up? Manually trigger a bootstrap again? Or is there something else we can do?

Thanks.



> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
>
> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the requests to the target data center are not batched in any way. Each update comes in as an independent update. Some follow-up questions:
>
> 1. Is it accurate that updates are not actually batched in transit from the source to the target and instead each document is posted separately?
>
> 2. Are they done synchronously? I assume yes (since you wouldn't want operations applied out of order)
>
> 3. If they are done synchronously, and are not batched in any way, does that mean that the best performance I can expect would be roughly how long it takes to round-trip a single document? ie. If my average ping is 25ms, then I can expect a peak performance of roughly 40 ops/s.
>
> Thanks
>
>
>
>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <[hidden email]> wrote:
>>
>> These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP sockets, so general guidelines do apply.
>>
>> Check the round-trip time between the data centers using ping or TCP ping.   Throughput tests may be high, but if Solr has to wait for a response to a request before sending the next action, then just like any network protocol that does that, it will get slow.
>>
>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check whether some proxy/load balancer between data centers is causing it to be a single connection per operation.   That will *kill* performance.   Some proxies default to HTTP/1.0 (open, send request, server send response, close), and that will hurt.
>>
>> Why you should listen to me even without SolrCloud knowledge - checkout paper "Latency performance of SOAP Implementations".   Same distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of Apache Axis 1.1 by 250ms per call with 1-line of code.
>>
>> -----Original Message-----
>> From: Tom Peters [mailto:[hidden email]]
>> Sent: Wednesday, March 7, 2018 6:19 PM
>> To: [hidden email]
>> Subject: CDCR performance issues
>>
>> I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
>>
>> This is what I'm getting back in terms of OPS:
>>
>>   curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>   {
>>     "responseHeader": {
>>       "status": 0,
>>       "QTime": 0
>>     },
>>     "operationsPerSecond": [
>>       "zook01,zook02,zook03/solr",
>>       [
>>         "mycollection",
>>         [
>>           "all",
>>           49.10140553500938,
>>           "adds",
>>           10.27612635309587,
>>           "deletes",
>>           38.82527896994054
>>         ]
>>       ]
>>     ]
>>   }
>>
>> The source and target collections are in separate data centers.
>>
>> Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s
>>
>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.
>>
>> Any suggestions on potential settings to tune to improve the performance?
>>
>> Thanks
>>
>> --
>>
>> Here's some relevant log lines from the source data center's leader:
>>
>>   2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>   2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>>   2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>   2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>   2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>   2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>   2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>
>>
>> And what the log looks like in the target:
>>
>>   2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>   2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>
>>
>>
>> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
>
>
>
> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Tom Peters
I'm also having issue with replicas in the target data center. It will go from recovering to down. And when one of my replicas go to down in the target data center, CDCR will no longer send updates from the source to the target.

> On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]> wrote:
>
> Anyone have any thoughts on the questions I raised?
>
> I have another question related to CDCR:
> Sometimes we have to reindex a large chunk of our index (1M+ documents). What's the best way to handle this if the normal CDCR process won't be able to keep up? Manually trigger a bootstrap again? Or is there something else we can do?
>
> Thanks.
>
>
>
>> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
>>
>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the requests to the target data center are not batched in any way. Each update comes in as an independent update. Some follow-up questions:
>>
>> 1. Is it accurate that updates are not actually batched in transit from the source to the target and instead each document is posted separately?
>>
>> 2. Are they done synchronously? I assume yes (since you wouldn't want operations applied out of order)
>>
>> 3. If they are done synchronously, and are not batched in any way, does that mean that the best performance I can expect would be roughly how long it takes to round-trip a single document? ie. If my average ping is 25ms, then I can expect a peak performance of roughly 40 ops/s.
>>
>> Thanks
>>
>>
>>
>>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <[hidden email]> wrote:
>>>
>>> These are general guidelines, I've done loads of networking, but may be less familiar with SolrCloud  and CDCR architecture.  However, I know it's all TCP sockets, so general guidelines do apply.
>>>
>>> Check the round-trip time between the data centers using ping or TCP ping.   Throughput tests may be high, but if Solr has to wait for a response to a request before sending the next action, then just like any network protocol that does that, it will get slow.
>>>
>>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also check whether some proxy/load balancer between data centers is causing it to be a single connection per operation.   That will *kill* performance.   Some proxies default to HTTP/1.0 (open, send request, server send response, close), and that will hurt.
>>>
>>> Why you should listen to me even without SolrCloud knowledge - checkout paper "Latency performance of SOAP Implementations".   Same distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.   I still improved response time of Apache Axis 1.1 by 250ms per call with 1-line of code.
>>>
>>> -----Original Message-----
>>> From: Tom Peters [mailto:[hidden email]]
>>> Sent: Wednesday, March 7, 2018 6:19 PM
>>> To: [hidden email]
>>> Subject: CDCR performance issues
>>>
>>> I'm having issues with the target collection staying up-to-date with indexing from the source collection using CDCR.
>>>
>>> This is what I'm getting back in terms of OPS:
>>>
>>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>>  {
>>>    "responseHeader": {
>>>      "status": 0,
>>>      "QTime": 0
>>>    },
>>>    "operationsPerSecond": [
>>>      "zook01,zook02,zook03/solr",
>>>      [
>>>        "mycollection",
>>>        [
>>>          "all",
>>>          49.10140553500938,
>>>          "adds",
>>>          10.27612635309587,
>>>          "deletes",
>>>          38.82527896994054
>>>        ]
>>>      ]
>>>    ]
>>>  }
>>>
>>> The source and target collections are in separate data centers.
>>>
>>> Doing a network test between the leader node in the source data center and the ZooKeeper nodes in the target data center show decent enough network performance: ~181 Mbit/s
>>>
>>> I've tried playing around with the "batchSize" value (128, 512, 728, 1000, 2000, 2500) and they've haven't made much of a difference.
>>>
>>> Any suggestions on potential settings to tune to improve the performance?
>>>
>>> Thanks
>>>
>>> --
>>>
>>> Here's some relevant log lines from the source data center's leader:
>>>
>>>  2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>  2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>>>  2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>  2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>  2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>  2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>  2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9) [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6] o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>
>>>
>>> And what the log looks like in the target:
>>>
>>>  2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>  2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request [mycollection_shard1_replica_n1]  webapp=/solr path=/update params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2} status=0 QTime=0
>>>
>>>
>>>
>>> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
>>
>>
>>
>> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
>
>
>
> This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.



This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Amrit Sarkar
Hey Tom,

I'm also having issue with replicas in the target data center. It will go
> from recovering to down. And when one of my replicas go to down in the
> target data center, CDCR will no longer send updates from the source to
> the target.


Are you able to figure out the issue? As long as the leaders of each shard
in each collection is up and serving, CDCR shouldn't stop.

Sometimes we have to reindex a large chunk of our index (1M+ documents).
> What's the best way to handle this if the normal CDCR process won't be
> able to keep up? Manually trigger a bootstrap again? Or is there something
> else we can do?
>

That's one of the limitations of CDCR, it cannot handle bulk indexing,
preferable way to do is
* stop cdcr
* bulk index
* issue manual BOOTSTRAP (it is independent of stop and start cdcr)
* start cdcr

1. Is it accurate that updates are not actually batched in transit from the
> source to the target and instead each document is posted separately?


The batchsize and schedule regulate how many docs are sent across target.
This has more details:
https://lucene.apache.org/solr/guide/7_2/cdcr-config.html#the-replicator-element




Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <[hidden email]> wrote:

> I'm also having issue with replicas in the target data center. It will go
> from recovering to down. And when one of my replicas go to down in the
> target data center, CDCR will no longer send updates from the source to the
> target.
>
> > On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]> wrote:
> >
> > Anyone have any thoughts on the questions I raised?
> >
> > I have another question related to CDCR:
> > Sometimes we have to reindex a large chunk of our index (1M+ documents).
> What's the best way to handle this if the normal CDCR process won't be able
> to keep up? Manually trigger a bootstrap again? Or is there something else
> we can do?
> >
> > Thanks.
> >
> >
> >
> >> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
> >>
> >> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the
> requests to the target data center are not batched in any way. Each update
> comes in as an independent update. Some follow-up questions:
> >>
> >> 1. Is it accurate that updates are not actually batched in transit from
> the source to the target and instead each document is posted separately?
> >>
> >> 2. Are they done synchronously? I assume yes (since you wouldn't want
> operations applied out of order)
> >>
> >> 3. If they are done synchronously, and are not batched in any way, does
> that mean that the best performance I can expect would be roughly how long
> it takes to round-trip a single document? ie. If my average ping is 25ms,
> then I can expect a peak performance of roughly 40 ops/s.
> >>
> >> Thanks
> >>
> >>
> >>
> >>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
> [hidden email]> wrote:
> >>>
> >>> These are general guidelines, I've done loads of networking, but may
> be less familiar with SolrCloud  and CDCR architecture.  However, I know
> it's all TCP sockets, so general guidelines do apply.
> >>>
> >>> Check the round-trip time between the data centers using ping or TCP
> ping.   Throughput tests may be high, but if Solr has to wait for a
> response to a request before sending the next action, then just like any
> network protocol that does that, it will get slow.
> >>>
> >>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also
> check whether some proxy/load balancer between data centers is causing it
> to be a single connection per operation.   That will *kill* performance.
>  Some proxies default to HTTP/1.0 (open, send request, server send
> response, close), and that will hurt.
> >>>
> >>> Why you should listen to me even without SolrCloud knowledge -
> checkout paper "Latency performance of SOAP Implementations".   Same
> distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.
>  I still improved response time of Apache Axis 1.1 by 250ms per call with
> 1-line of code.
> >>>
> >>> -----Original Message-----
> >>> From: Tom Peters [mailto:[hidden email]]
> >>> Sent: Wednesday, March 7, 2018 6:19 PM
> >>> To: [hidden email]
> >>> Subject: CDCR performance issues
> >>>
> >>> I'm having issues with the target collection staying up-to-date with
> indexing from the source collection using CDCR.
> >>>
> >>> This is what I'm getting back in terms of OPS:
> >>>
> >>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
> >>>  {
> >>>    "responseHeader": {
> >>>      "status": 0,
> >>>      "QTime": 0
> >>>    },
> >>>    "operationsPerSecond": [
> >>>      "zook01,zook02,zook03/solr",
> >>>      [
> >>>        "mycollection",
> >>>        [
> >>>          "all",
> >>>          49.10140553500938,
> >>>          "adds",
> >>>          10.27612635309587,
> >>>          "deletes",
> >>>          38.82527896994054
> >>>        ]
> >>>      ]
> >>>    ]
> >>>  }
> >>>
> >>> The source and target collections are in separate data centers.
> >>>
> >>> Doing a network test between the leader node in the source data center
> and the ZooKeeper nodes in the target data center show decent enough
> network performance: ~181 Mbit/s
> >>>
> >>> I've tried playing around with the "batchSize" value (128, 512, 728,
> 1000, 2000, 2500) and they've haven't made much of a difference.
> >>>
> >>> Any suggestions on potential settings to tune to improve the
> performance?
> >>>
> >>> Thanks
> >>>
> >>> --
> >>>
> >>> Here's some relevant log lines from the source data center's leader:
> >>>
> >>>  2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> >>>  2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
> >>>  2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> >>>  2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> >>>  2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> >>>  2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> >>>  2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> >>>
> >>>
> >>> And what the log looks like in the target:
> >>>
> >>>  2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>  2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2}
> status=0 QTime=0
> >>>
> >>>
> >>>
> >>> This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.
> >>
> >>
> >>
> >> This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.
> >
> >
> >
> > This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.
>
>
>
> This message and any attachment may contain information that is
> confidential and/or proprietary. Any use, disclosure, copying, storing, or
> distribution of this e-mail or any attached file by anyone other than the
> intended recipient is strictly prohibited. If you have received this
> message in error, please notify the sender by reply email and delete the
> message and any attachments. Thank you.
>
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Susheel Kumar-3
Just a simple check, if you go to source solr and index single document
from Documents tab, then keep querying target solr for the same document.
How long does it take the document to appear in target data center.  In our
case, I can see document show up in target within 30 sec which is our soft
commit time.

Thanks,
Susheel

On Fri, Mar 23, 2018 at 8:16 AM, Amrit Sarkar <[hidden email]>
wrote:

> Hey Tom,
>
> I'm also having issue with replicas in the target data center. It will go
> > from recovering to down. And when one of my replicas go to down in the
> > target data center, CDCR will no longer send updates from the source to
> > the target.
>
>
> Are you able to figure out the issue? As long as the leaders of each shard
> in each collection is up and serving, CDCR shouldn't stop.
>
> Sometimes we have to reindex a large chunk of our index (1M+ documents).
> > What's the best way to handle this if the normal CDCR process won't be
> > able to keep up? Manually trigger a bootstrap again? Or is there
> something
> > else we can do?
> >
>
> That's one of the limitations of CDCR, it cannot handle bulk indexing,
> preferable way to do is
> * stop cdcr
> * bulk index
> * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> * start cdcr
>
> 1. Is it accurate that updates are not actually batched in transit from the
> > source to the target and instead each document is posted separately?
>
>
> The batchsize and schedule regulate how many docs are sent across target.
> This has more details:
> https://lucene.apache.org/solr/guide/7_2/cdcr-config.
> html#the-replicator-element
>
>
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <[hidden email]> wrote:
>
> > I'm also having issue with replicas in the target data center. It will go
> > from recovering to down. And when one of my replicas go to down in the
> > target data center, CDCR will no longer send updates from the source to
> the
> > target.
> >
> > > On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]> wrote:
> > >
> > > Anyone have any thoughts on the questions I raised?
> > >
> > > I have another question related to CDCR:
> > > Sometimes we have to reindex a large chunk of our index (1M+
> documents).
> > What's the best way to handle this if the normal CDCR process won't be
> able
> > to keep up? Manually trigger a bootstrap again? Or is there something
> else
> > we can do?
> > >
> > > Thanks.
> > >
> > >
> > >
> > >> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
> > >>
> > >> Thanks. This was helpful. I did some tcpdumps and I'm noticing that
> the
> > requests to the target data center are not batched in any way. Each
> update
> > comes in as an independent update. Some follow-up questions:
> > >>
> > >> 1. Is it accurate that updates are not actually batched in transit
> from
> > the source to the target and instead each document is posted separately?
> > >>
> > >> 2. Are they done synchronously? I assume yes (since you wouldn't want
> > operations applied out of order)
> > >>
> > >> 3. If they are done synchronously, and are not batched in any way,
> does
> > that mean that the best performance I can expect would be roughly how
> long
> > it takes to round-trip a single document? ie. If my average ping is 25ms,
> > then I can expect a peak performance of roughly 40 ops/s.
> > >>
> > >> Thanks
> > >>
> > >>
> > >>
> > >>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
> > [hidden email]> wrote:
> > >>>
> > >>> These are general guidelines, I've done loads of networking, but may
> > be less familiar with SolrCloud  and CDCR architecture.  However, I know
> > it's all TCP sockets, so general guidelines do apply.
> > >>>
> > >>> Check the round-trip time between the data centers using ping or TCP
> > ping.   Throughput tests may be high, but if Solr has to wait for a
> > response to a request before sending the next action, then just like any
> > network protocol that does that, it will get slow.
> > >>>
> > >>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also
> > check whether some proxy/load balancer between data centers is causing it
> > to be a single connection per operation.   That will *kill* performance.
> >  Some proxies default to HTTP/1.0 (open, send request, server send
> > response, close), and that will hurt.
> > >>>
> > >>> Why you should listen to me even without SolrCloud knowledge -
> > checkout paper "Latency performance of SOAP Implementations".   Same
> > distribution of skills - I knew TCP well, but Apache Axis 1.1 not so
> well.
> >  I still improved response time of Apache Axis 1.1 by 250ms per call with
> > 1-line of code.
> > >>>
> > >>> -----Original Message-----
> > >>> From: Tom Peters [mailto:[hidden email]]
> > >>> Sent: Wednesday, March 7, 2018 6:19 PM
> > >>> To: [hidden email]
> > >>> Subject: CDCR performance issues
> > >>>
> > >>> I'm having issues with the target collection staying up-to-date with
> > indexing from the source collection using CDCR.
> > >>>
> > >>> This is what I'm getting back in terms of OPS:
> > >>>
> > >>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
> > >>>  {
> > >>>    "responseHeader": {
> > >>>      "status": 0,
> > >>>      "QTime": 0
> > >>>    },
> > >>>    "operationsPerSecond": [
> > >>>      "zook01,zook02,zook03/solr",
> > >>>      [
> > >>>        "mycollection",
> > >>>        [
> > >>>          "all",
> > >>>          49.10140553500938,
> > >>>          "adds",
> > >>>          10.27612635309587,
> > >>>          "deletes",
> > >>>          38.82527896994054
> > >>>        ]
> > >>>      ]
> > >>>    ]
> > >>>  }
> > >>>
> > >>> The source and target collections are in separate data centers.
> > >>>
> > >>> Doing a network test between the leader node in the source data
> center
> > and the ZooKeeper nodes in the target data center show decent enough
> > network performance: ~181 Mbit/s
> > >>>
> > >>> I've tried playing around with the "batchSize" value (128, 512, 728,
> > 1000, 2000, 2500) and they've haven't made much of a difference.
> > >>>
> > >>> Any suggestions on potential settings to tune to improve the
> > performance?
> > >>>
> > >>> Thanks
> > >>>
> > >>> --
> > >>>
> > >>> Here's some relevant log lines from the source data center's leader:
> > >>>
> > >>>  2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > >>>  2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
> > >>>  2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > >>>  2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > >>>  2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > >>>  2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > >>>  2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-
> processing-n:solr2-a:8080_solr
> > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
> > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > >>>
> > >>>
> > >>> And what the log looks like in the target:
> > >>>
> > >>>  2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>  2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
> > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> o.a.s.c.S.Request
> > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.
> update=&wt=javabin&version=2}
> > status=0 QTime=0
> > >>>
> > >>>
> > >>>
> > >>> This message and any attachment may contain information that is
> > confidential and/or proprietary. Any use, disclosure, copying, storing,
> or
> > distribution of this e-mail or any attached file by anyone other than the
> > intended recipient is strictly prohibited. If you have received this
> > message in error, please notify the sender by reply email and delete the
> > message and any attachments. Thank you.
> > >>
> > >>
> > >>
> > >> This message and any attachment may contain information that is
> > confidential and/or proprietary. Any use, disclosure, copying, storing,
> or
> > distribution of this e-mail or any attached file by anyone other than the
> > intended recipient is strictly prohibited. If you have received this
> > message in error, please notify the sender by reply email and delete the
> > message and any attachments. Thank you.
> > >
> > >
> > >
> > > This message and any attachment may contain information that is
> > confidential and/or proprietary. Any use, disclosure, copying, storing,
> or
> > distribution of this e-mail or any attached file by anyone other than the
> > intended recipient is strictly prohibited. If you have received this
> > message in error, please notify the sender by reply email and delete the
> > message and any attachments. Thank you.
> >
> >
> >
> > This message and any attachment may contain information that is
> > confidential and/or proprietary. Any use, disclosure, copying, storing,
> or
> > distribution of this e-mail or any attached file by anyone other than the
> > intended recipient is strictly prohibited. If you have received this
> > message in error, please notify the sender by reply email and delete the
> > message and any attachments. Thank you.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Amrit Sarkar
Susheel,

That is the correct behavior, "commit" operation is not propagated to
target and the documents will be visible in the target as per commit
strategy devised there.

Amrit Sarkar
Search Engineer
Lucidworks, Inc.
415-589-9269
www.lucidworks.com
Twitter http://twitter.com/lucidworks
LinkedIn: https://www.linkedin.com/in/sarkaramrit2
Medium: https://medium.com/@sarkaramrit2

On Fri, Mar 23, 2018 at 6:02 PM, Susheel Kumar <[hidden email]>
wrote:

> Just a simple check, if you go to source solr and index single document
> from Documents tab, then keep querying target solr for the same document.
> How long does it take the document to appear in target data center.  In our
> case, I can see document show up in target within 30 sec which is our soft
> commit time.
>
> Thanks,
> Susheel
>
> On Fri, Mar 23, 2018 at 8:16 AM, Amrit Sarkar <[hidden email]>
> wrote:
>
> > Hey Tom,
> >
> > I'm also having issue with replicas in the target data center. It will go
> > > from recovering to down. And when one of my replicas go to down in the
> > > target data center, CDCR will no longer send updates from the source to
> > > the target.
> >
> >
> > Are you able to figure out the issue? As long as the leaders of each
> shard
> > in each collection is up and serving, CDCR shouldn't stop.
> >
> > Sometimes we have to reindex a large chunk of our index (1M+ documents).
> > > What's the best way to handle this if the normal CDCR process won't be
> > > able to keep up? Manually trigger a bootstrap again? Or is there
> > something
> > > else we can do?
> > >
> >
> > That's one of the limitations of CDCR, it cannot handle bulk indexing,
> > preferable way to do is
> > * stop cdcr
> > * bulk index
> > * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> > * start cdcr
> >
> > 1. Is it accurate that updates are not actually batched in transit from
> the
> > > source to the target and instead each document is posted separately?
> >
> >
> > The batchsize and schedule regulate how many docs are sent across target.
> > This has more details:
> > https://lucene.apache.org/solr/guide/7_2/cdcr-config.
> > html#the-replicator-element
> >
> >
> >
> >
> > Amrit Sarkar
> > Search Engineer
> > Lucidworks, Inc.
> > 415-589-9269
> > www.lucidworks.com
> > Twitter http://twitter.com/lucidworks
> > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> > Medium: https://medium.com/@sarkaramrit2
> >
> > On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <[hidden email]>
> wrote:
> >
> > > I'm also having issue with replicas in the target data center. It will
> go
> > > from recovering to down. And when one of my replicas go to down in the
> > > target data center, CDCR will no longer send updates from the source to
> > the
> > > target.
> > >
> > > > On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]> wrote:
> > > >
> > > > Anyone have any thoughts on the questions I raised?
> > > >
> > > > I have another question related to CDCR:
> > > > Sometimes we have to reindex a large chunk of our index (1M+
> > documents).
> > > What's the best way to handle this if the normal CDCR process won't be
> > able
> > > to keep up? Manually trigger a bootstrap again? Or is there something
> > else
> > > we can do?
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > >> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
> > > >>
> > > >> Thanks. This was helpful. I did some tcpdumps and I'm noticing that
> > the
> > > requests to the target data center are not batched in any way. Each
> > update
> > > comes in as an independent update. Some follow-up questions:
> > > >>
> > > >> 1. Is it accurate that updates are not actually batched in transit
> > from
> > > the source to the target and instead each document is posted
> separately?
> > > >>
> > > >> 2. Are they done synchronously? I assume yes (since you wouldn't
> want
> > > operations applied out of order)
> > > >>
> > > >> 3. If they are done synchronously, and are not batched in any way,
> > does
> > > that mean that the best performance I can expect would be roughly how
> > long
> > > it takes to round-trip a single document? ie. If my average ping is
> 25ms,
> > > then I can expect a peak performance of roughly 40 ops/s.
> > > >>
> > > >> Thanks
> > > >>
> > > >>
> > > >>
> > > >>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
> > > [hidden email]> wrote:
> > > >>>
> > > >>> These are general guidelines, I've done loads of networking, but
> may
> > > be less familiar with SolrCloud  and CDCR architecture.  However, I
> know
> > > it's all TCP sockets, so general guidelines do apply.
> > > >>>
> > > >>> Check the round-trip time between the data centers using ping or
> TCP
> > > ping.   Throughput tests may be high, but if Solr has to wait for a
> > > response to a request before sending the next action, then just like
> any
> > > network protocol that does that, it will get slow.
> > > >>>
> > > >>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also
> > > check whether some proxy/load balancer between data centers is causing
> it
> > > to be a single connection per operation.   That will *kill*
> performance.
> > >  Some proxies default to HTTP/1.0 (open, send request, server send
> > > response, close), and that will hurt.
> > > >>>
> > > >>> Why you should listen to me even without SolrCloud knowledge -
> > > checkout paper "Latency performance of SOAP Implementations".   Same
> > > distribution of skills - I knew TCP well, but Apache Axis 1.1 not so
> > well.
> > >  I still improved response time of Apache Axis 1.1 by 250ms per call
> with
> > > 1-line of code.
> > > >>>
> > > >>> -----Original Message-----
> > > >>> From: Tom Peters [mailto:[hidden email]]
> > > >>> Sent: Wednesday, March 7, 2018 6:19 PM
> > > >>> To: [hidden email]
> > > >>> Subject: CDCR performance issues
> > > >>>
> > > >>> I'm having issues with the target collection staying up-to-date
> with
> > > indexing from the source collection using CDCR.
> > > >>>
> > > >>> This is what I'm getting back in terms of OPS:
> > > >>>
> > > >>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
> > > >>>  {
> > > >>>    "responseHeader": {
> > > >>>      "status": 0,
> > > >>>      "QTime": 0
> > > >>>    },
> > > >>>    "operationsPerSecond": [
> > > >>>      "zook01,zook02,zook03/solr",
> > > >>>      [
> > > >>>        "mycollection",
> > > >>>        [
> > > >>>          "all",
> > > >>>          49.10140553500938,
> > > >>>          "adds",
> > > >>>          10.27612635309587,
> > > >>>          "deletes",
> > > >>>          38.82527896994054
> > > >>>        ]
> > > >>>      ]
> > > >>>    ]
> > > >>>  }
> > > >>>
> > > >>> The source and target collections are in separate data centers.
> > > >>>
> > > >>> Doing a network test between the leader node in the source data
> > center
> > > and the ZooKeeper nodes in the target data center show decent enough
> > > network performance: ~181 Mbit/s
> > > >>>
> > > >>> I've tried playing around with the "batchSize" value (128, 512,
> 728,
> > > 1000, 2000, 2500) and they've haven't made much of a difference.
> > > >>>
> > > >>> Any suggestions on potential settings to tune to improve the
> > > performance?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> --
> > > >>>
> > > >>> Here's some relevant log lines from the source data center's
> leader:
> > > >>>
> > > >>>  2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > >>>  2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
> > > >>>  2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > >>>  2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > >>>  2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > >>>  2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > >>>  2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-
> > processing-n:solr2-a:8080_solr
> > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
> > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> n6]
> > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > >>>
> > > >>>
> > > >>> And what the log looks like in the target:
> > > >>>
> > > >>>  2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067896487950&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067896487951&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536512&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536513&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536514&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536515&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536516&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536517&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536518&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>  2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
> > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > o.a.s.c.S.Request
> > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > params={_stateVer_=mycollection:30&_version_=-
> 1594317067897536519&cdcr.
> > update=&wt=javabin&version=2}
> > > status=0 QTime=0
> > > >>>
> > > >>>
> > > >>>
> > > >>> This message and any attachment may contain information that is
> > > confidential and/or proprietary. Any use, disclosure, copying, storing,
> > or
> > > distribution of this e-mail or any attached file by anyone other than
> the
> > > intended recipient is strictly prohibited. If you have received this
> > > message in error, please notify the sender by reply email and delete
> the
> > > message and any attachments. Thank you.
> > > >>
> > > >>
> > > >>
> > > >> This message and any attachment may contain information that is
> > > confidential and/or proprietary. Any use, disclosure, copying, storing,
> > or
> > > distribution of this e-mail or any attached file by anyone other than
> the
> > > intended recipient is strictly prohibited. If you have received this
> > > message in error, please notify the sender by reply email and delete
> the
> > > message and any attachments. Thank you.
> > > >
> > > >
> > > >
> > > > This message and any attachment may contain information that is
> > > confidential and/or proprietary. Any use, disclosure, copying, storing,
> > or
> > > distribution of this e-mail or any attached file by anyone other than
> the
> > > intended recipient is strictly prohibited. If you have received this
> > > message in error, please notify the sender by reply email and delete
> the
> > > message and any attachments. Thank you.
> > >
> > >
> > >
> > > This message and any attachment may contain information that is
> > > confidential and/or proprietary. Any use, disclosure, copying, storing,
> > or
> > > distribution of this e-mail or any attached file by anyone other than
> the
> > > intended recipient is strictly prohibited. If you have received this
> > > message in error, please notify the sender by reply email and delete
> the
> > > message and any attachments. Thank you.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Susheel Kumar-3
Yea,  Amrit. to clarify we have 30 sec soft commit on target data center
and for the test when we use Documents tab,  the default Commit Within=1000
ms which makes the commit quickly on source and then we just wait for it to
appear on target data center per commit strategy.

On Fri, Mar 23, 2018 at 8:47 AM, Amrit Sarkar <[hidden email]>
wrote:

> Susheel,
>
> That is the correct behavior, "commit" operation is not propagated to
> target and the documents will be visible in the target as per commit
> strategy devised there.
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Fri, Mar 23, 2018 at 6:02 PM, Susheel Kumar <[hidden email]>
> wrote:
>
> > Just a simple check, if you go to source solr and index single document
> > from Documents tab, then keep querying target solr for the same document.
> > How long does it take the document to appear in target data center.  In
> our
> > case, I can see document show up in target within 30 sec which is our
> soft
> > commit time.
> >
> > Thanks,
> > Susheel
> >
> > On Fri, Mar 23, 2018 at 8:16 AM, Amrit Sarkar <[hidden email]>
> > wrote:
> >
> > > Hey Tom,
> > >
> > > I'm also having issue with replicas in the target data center. It will
> go
> > > > from recovering to down. And when one of my replicas go to down in
> the
> > > > target data center, CDCR will no longer send updates from the source
> to
> > > > the target.
> > >
> > >
> > > Are you able to figure out the issue? As long as the leaders of each
> > shard
> > > in each collection is up and serving, CDCR shouldn't stop.
> > >
> > > Sometimes we have to reindex a large chunk of our index (1M+
> documents).
> > > > What's the best way to handle this if the normal CDCR process won't
> be
> > > > able to keep up? Manually trigger a bootstrap again? Or is there
> > > something
> > > > else we can do?
> > > >
> > >
> > > That's one of the limitations of CDCR, it cannot handle bulk indexing,
> > > preferable way to do is
> > > * stop cdcr
> > > * bulk index
> > > * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> > > * start cdcr
> > >
> > > 1. Is it accurate that updates are not actually batched in transit from
> > the
> > > > source to the target and instead each document is posted separately?
> > >
> > >
> > > The batchsize and schedule regulate how many docs are sent across
> target.
> > > This has more details:
> > > https://lucene.apache.org/solr/guide/7_2/cdcr-config.
> > > html#the-replicator-element
> > >
> > >
> > >
> > >
> > > Amrit Sarkar
> > > Search Engineer
> > > Lucidworks, Inc.
> > > 415-589-9269
> > > www.lucidworks.com
> > > Twitter http://twitter.com/lucidworks
> > > LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> > > Medium: https://medium.com/@sarkaramrit2
> > >
> > > On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <[hidden email]>
> > wrote:
> > >
> > > > I'm also having issue with replicas in the target data center. It
> will
> > go
> > > > from recovering to down. And when one of my replicas go to down in
> the
> > > > target data center, CDCR will no longer send updates from the source
> to
> > > the
> > > > target.
> > > >
> > > > > On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]>
> wrote:
> > > > >
> > > > > Anyone have any thoughts on the questions I raised?
> > > > >
> > > > > I have another question related to CDCR:
> > > > > Sometimes we have to reindex a large chunk of our index (1M+
> > > documents).
> > > > What's the best way to handle this if the normal CDCR process won't
> be
> > > able
> > > > to keep up? Manually trigger a bootstrap again? Or is there something
> > > else
> > > > we can do?
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > >
> > > > >> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]>
> wrote:
> > > > >>
> > > > >> Thanks. This was helpful. I did some tcpdumps and I'm noticing
> that
> > > the
> > > > requests to the target data center are not batched in any way. Each
> > > update
> > > > comes in as an independent update. Some follow-up questions:
> > > > >>
> > > > >> 1. Is it accurate that updates are not actually batched in transit
> > > from
> > > > the source to the target and instead each document is posted
> > separately?
> > > > >>
> > > > >> 2. Are they done synchronously? I assume yes (since you wouldn't
> > want
> > > > operations applied out of order)
> > > > >>
> > > > >> 3. If they are done synchronously, and are not batched in any way,
> > > does
> > > > that mean that the best performance I can expect would be roughly how
> > > long
> > > > it takes to round-trip a single document? ie. If my average ping is
> > 25ms,
> > > > then I can expect a peak performance of roughly 40 ops/s.
> > > > >>
> > > > >> Thanks
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > [hidden email]> wrote:
> > > > >>>
> > > > >>> These are general guidelines, I've done loads of networking, but
> > may
> > > > be less familiar with SolrCloud  and CDCR architecture.  However, I
> > know
> > > > it's all TCP sockets, so general guidelines do apply.
> > > > >>>
> > > > >>> Check the round-trip time between the data centers using ping or
> > TCP
> > > > ping.   Throughput tests may be high, but if Solr has to wait for a
> > > > response to a request before sending the next action, then just like
> > any
> > > > network protocol that does that, it will get slow.
> > > > >>>
> > > > >>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so
> also
> > > > check whether some proxy/load balancer between data centers is
> causing
> > it
> > > > to be a single connection per operation.   That will *kill*
> > performance.
> > > >  Some proxies default to HTTP/1.0 (open, send request, server send
> > > > response, close), and that will hurt.
> > > > >>>
> > > > >>> Why you should listen to me even without SolrCloud knowledge -
> > > > checkout paper "Latency performance of SOAP Implementations".   Same
> > > > distribution of skills - I knew TCP well, but Apache Axis 1.1 not so
> > > well.
> > > >  I still improved response time of Apache Axis 1.1 by 250ms per call
> > with
> > > > 1-line of code.
> > > > >>>
> > > > >>> -----Original Message-----
> > > > >>> From: Tom Peters [mailto:[hidden email]]
> > > > >>> Sent: Wednesday, March 7, 2018 6:19 PM
> > > > >>> To: [hidden email]
> > > > >>> Subject: CDCR performance issues
> > > > >>>
> > > > >>> I'm having issues with the target collection staying up-to-date
> > with
> > > > indexing from the source collection using CDCR.
> > > > >>>
> > > > >>> This is what I'm getting back in terms of OPS:
> > > > >>>
> > > > >>>  curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
> > > > >>>  {
> > > > >>>    "responseHeader": {
> > > > >>>      "status": 0,
> > > > >>>      "QTime": 0
> > > > >>>    },
> > > > >>>    "operationsPerSecond": [
> > > > >>>      "zook01,zook02,zook03/solr",
> > > > >>>      [
> > > > >>>        "mycollection",
> > > > >>>        [
> > > > >>>          "all",
> > > > >>>          49.10140553500938,
> > > > >>>          "adds",
> > > > >>>          10.27612635309587,
> > > > >>>          "deletes",
> > > > >>>          38.82527896994054
> > > > >>>        ]
> > > > >>>      ]
> > > > >>>    ]
> > > > >>>  }
> > > > >>>
> > > > >>> The source and target collections are in separate data centers.
> > > > >>>
> > > > >>> Doing a network test between the leader node in the source data
> > > center
> > > > and the ZooKeeper nodes in the target data center show decent enough
> > > > network performance: ~181 Mbit/s
> > > > >>>
> > > > >>> I've tried playing around with the "batchSize" value (128, 512,
> > 728,
> > > > 1000, 2000, 2500) and they've haven't made much of a difference.
> > > > >>>
> > > > >>> Any suggestions on potential settings to tune to improve the
> > > > performance?
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>> --
> > > > >>>
> > > > >>> Here's some relevant log lines from the source data center's
> > leader:
> > > > >>>
> > > > >>>  2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > > >>>  2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
> > > > >>>  2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > > >>>  2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > > >>>  2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > > >>>  2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
> > > > >>>  2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-
> > > processing-n:solr2-a:8080_solr
> > > > x:mycollection_shard1_replica_n6 s:shard1 c:mycollection
> r:core_node9)
> > > > [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_
> > n6]
> > > > o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
> > > > >>>
> > > > >>>
> > > > >>> And what the log looks like in the target:
> > > > >>>
> > > > >>>  2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067896487950&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067896487951&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536512&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793)
> [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536513&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536514&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536515&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536516&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536517&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793)
> [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536518&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>  2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
> > > > s:shard1 r:core_node2 x:mycollection_shard1_replica_n1]
> > > o.a.s.c.S.Request
> > > > [mycollection_shard1_replica_n1]  webapp=/solr path=/update
> > > > params={_stateVer_=mycollection:30&_version_=-
> > 1594317067897536519&cdcr.
> > > update=&wt=javabin&version=2}
> > > > status=0 QTime=0
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> This message and any attachment may contain information that is
> > > > confidential and/or proprietary. Any use, disclosure, copying,
> storing,
> > > or
> > > > distribution of this e-mail or any attached file by anyone other than
> > the
> > > > intended recipient is strictly prohibited. If you have received this
> > > > message in error, please notify the sender by reply email and delete
> > the
> > > > message and any attachments. Thank you.
> > > > >>
> > > > >>
> > > > >>
> > > > >> This message and any attachment may contain information that is
> > > > confidential and/or proprietary. Any use, disclosure, copying,
> storing,
> > > or
> > > > distribution of this e-mail or any attached file by anyone other than
> > the
> > > > intended recipient is strictly prohibited. If you have received this
> > > > message in error, please notify the sender by reply email and delete
> > the
> > > > message and any attachments. Thank you.
> > > > >
> > > > >
> > > > >
> > > > > This message and any attachment may contain information that is
> > > > confidential and/or proprietary. Any use, disclosure, copying,
> storing,
> > > or
> > > > distribution of this e-mail or any attached file by anyone other than
> > the
> > > > intended recipient is strictly prohibited. If you have received this
> > > > message in error, please notify the sender by reply email and delete
> > the
> > > > message and any attachments. Thank you.
> > > >
> > > >
> > > >
> > > > This message and any attachment may contain information that is
> > > > confidential and/or proprietary. Any use, disclosure, copying,
> storing,
> > > or
> > > > distribution of this e-mail or any attached file by anyone other than
> > the
> > > > intended recipient is strictly prohibited. If you have received this
> > > > message in error, please notify the sender by reply email and delete
> > the
> > > > message and any attachments. Thank you.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CDCR performance issues

Tom Peters
In reply to this post by Amrit Sarkar
Thanks for responding. My responses are inline.

> On Mar 23, 2018, at 8:16 AM, Amrit Sarkar <[hidden email]> wrote:
>
> Hey Tom,
>
> I'm also having issue with replicas in the target data center. It will go
>> from recovering to down. And when one of my replicas go to down in the
>> target data center, CDCR will no longer send updates from the source to
>> the target.
>
>
> Are you able to figure out the issue? As long as the leaders of each shard
> in each collection is up and serving, CDCR shouldn't stop.

I cannot replicate the issue I was having. In a test environment, I'm able to knock one of the replicas into recovery mode and can verify that CDCR updates are still being sent.

>
> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't be
>> able to keep up? Manually trigger a bootstrap again? Or is there something
>> else we can do?
>>
>
> That's one of the limitations of CDCR, it cannot handle bulk indexing,
> preferable way to do is
> * stop cdcr
> * bulk index
> * issue manual BOOTSTRAP (it is independent of stop and start cdcr)
> * start cdcr

I plan on testing this, but if I issue a bootstrap, will I run into the https://issues.apache.org/jira/browse/SOLR-11724 <https://issues.apache.org/jira/browse/SOLR-11724> bug where the bootstrap doesn't replicate to the replicas?

> 1. Is it accurate that updates are not actually batched in transit from the
>> source to the target and instead each document is posted separately?
>
>
> The batchsize and schedule regulate how many docs are sent across target.
> This has more details:
> https://lucene.apache.org/solr/guide/7_2/cdcr-config.html#the-replicator-element
>

As far as I can tell, I'm not seeing batching. I'm using tcpdump (and a script to decompile the JavaBin bytes) to monitor what is actually being sent and I'm seeing documents arrive one-at-a-time.

POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&version=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902502068224]):null]]}
----------
POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&version=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902600634368]):null]]}
----------
POST /solr/synacor/update?cdcr.update=&_stateVer_=synacor%3A199&wt=javabin&version=2 HTTP/1.1
User-Agent: Solr[org.apache.solr.client.solrj.impl.HttpSolrClient] 1.0
Content-Length: 114
Content-Type: application/javabin
Host: solr02-a.svcs.opal.synacor.com:8080
Connection: Keep-Alive

{params={cdcr.update=,_stateVer_=synacor:199},delByQ=null,docsMap=[MapEntry[SolrInputDocument(fields: [solr_id=Mytest, _version_=1595749902698151936]):null]]}

>
>
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
> Medium: https://medium.com/@sarkaramrit2
>
> On Tue, Mar 13, 2018 at 12:21 AM, Tom Peters <[hidden email]> wrote:
>
>> I'm also having issue with replicas in the target data center. It will go
>> from recovering to down. And when one of my replicas go to down in the
>> target data center, CDCR will no longer send updates from the source to the
>> target.
>>
>>> On Mar 12, 2018, at 9:24 AM, Tom Peters <[hidden email]> wrote:
>>>
>>> Anyone have any thoughts on the questions I raised?
>>>
>>> I have another question related to CDCR:
>>> Sometimes we have to reindex a large chunk of our index (1M+ documents).
>> What's the best way to handle this if the normal CDCR process won't be able
>> to keep up? Manually trigger a bootstrap again? Or is there something else
>> we can do?
>>>
>>> Thanks.
>>>
>>>
>>>
>>>> On Mar 9, 2018, at 3:59 PM, Tom Peters <[hidden email]> wrote:
>>>>
>>>> Thanks. This was helpful. I did some tcpdumps and I'm noticing that the
>> requests to the target data center are not batched in any way. Each update
>> comes in as an independent update. Some follow-up questions:
>>>>
>>>> 1. Is it accurate that updates are not actually batched in transit from
>> the source to the target and instead each document is posted separately?
>>>>
>>>> 2. Are they done synchronously? I assume yes (since you wouldn't want
>> operations applied out of order)
>>>>
>>>> 3. If they are done synchronously, and are not batched in any way, does
>> that mean that the best performance I can expect would be roughly how long
>> it takes to round-trip a single document? ie. If my average ping is 25ms,
>> then I can expect a peak performance of roughly 40 ops/s.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>> On Mar 9, 2018, at 11:21 AM, Davis, Daniel (NIH/NLM) [C] <
>> [hidden email]> wrote:
>>>>>
>>>>> These are general guidelines, I've done loads of networking, but may
>> be less familiar with SolrCloud  and CDCR architecture.  However, I know
>> it's all TCP sockets, so general guidelines do apply.
>>>>>
>>>>> Check the round-trip time between the data centers using ping or TCP
>> ping.   Throughput tests may be high, but if Solr has to wait for a
>> response to a request before sending the next action, then just like any
>> network protocol that does that, it will get slow.
>>>>>
>>>>> I'm pretty sure CDCR uses HTTP/HTTPS rather than just TCP, so also
>> check whether some proxy/load balancer between data centers is causing it
>> to be a single connection per operation.   That will *kill* performance.
>> Some proxies default to HTTP/1.0 (open, send request, server send
>> response, close), and that will hurt.
>>>>>
>>>>> Why you should listen to me even without SolrCloud knowledge -
>> checkout paper "Latency performance of SOAP Implementations".   Same
>> distribution of skills - I knew TCP well, but Apache Axis 1.1 not so well.
>> I still improved response time of Apache Axis 1.1 by 250ms per call with
>> 1-line of code.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tom Peters [mailto:[hidden email]]
>>>>> Sent: Wednesday, March 7, 2018 6:19 PM
>>>>> To: [hidden email]
>>>>> Subject: CDCR performance issues
>>>>>
>>>>> I'm having issues with the target collection staying up-to-date with
>> indexing from the source collection using CDCR.
>>>>>
>>>>> This is what I'm getting back in terms of OPS:
>>>>>
>>>>> curl -s 'solr2-a:8080/solr/mycollection/cdcr?action=OPS' | jq .
>>>>> {
>>>>>   "responseHeader": {
>>>>>     "status": 0,
>>>>>     "QTime": 0
>>>>>   },
>>>>>   "operationsPerSecond": [
>>>>>     "zook01,zook02,zook03/solr",
>>>>>     [
>>>>>       "mycollection",
>>>>>       [
>>>>>         "all",
>>>>>         49.10140553500938,
>>>>>         "adds",
>>>>>         10.27612635309587,
>>>>>         "deletes",
>>>>>         38.82527896994054
>>>>>       ]
>>>>>     ]
>>>>>   ]
>>>>> }
>>>>>
>>>>> The source and target collections are in separate data centers.
>>>>>
>>>>> Doing a network test between the leader node in the source data center
>> and the ZooKeeper nodes in the target data center show decent enough
>> network performance: ~181 Mbit/s
>>>>>
>>>>> I've tried playing around with the "batchSize" value (128, 512, 728,
>> 1000, 2000, 2500) and they've haven't made much of a difference.
>>>>>
>>>>> Any suggestions on potential settings to tune to improve the
>> performance?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --
>>>>>
>>>>> Here's some relevant log lines from the source data center's leader:
>>>>>
>>>>> 2018-03-07 23:16:11.984 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>>> 2018-03-07 23:16:23.062 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 510 updates to target mycollection
>>>>> 2018-03-07 23:16:32.063 INFO  (cdcr-replicator-207-thread-5-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>>> 2018-03-07 23:16:36.209 INFO  (cdcr-replicator-207-thread-1-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>>> 2018-03-07 23:16:42.091 INFO  (cdcr-replicator-207-thread-2-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>>> 2018-03-07 23:16:46.790 INFO  (cdcr-replicator-207-thread-3-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 511 updates to target mycollection
>>>>> 2018-03-07 23:16:50.004 INFO  (cdcr-replicator-207-thread-4-processing-n:solr2-a:8080_solr
>> x:mycollection_shard1_replica_n6 s:shard1 c:mycollection r:core_node9)
>> [c:mycollection s:shard1 r:core_node9 x:mycollection_shard1_replica_n6]
>> o.a.s.h.CdcrReplicator Forwarded 512 updates to target mycollection
>>>>>
>>>>>
>>>>> And what the log looks like in the target:
>>>>>
>>>>> 2018-03-07 23:18:46.475 INFO  (qtp1595212853-26) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067896487950&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.500 INFO  (qtp1595212853-25) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067896487951&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.525 INFO  (qtp1595212853-24) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536512&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.550 INFO  (qtp1595212853-3793) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536513&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.575 INFO  (qtp1595212853-30) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536514&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.600 INFO  (qtp1595212853-26) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536515&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.625 INFO  (qtp1595212853-25) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536516&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.651 INFO  (qtp1595212853-24) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536517&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.676 INFO  (qtp1595212853-3793) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536518&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>> 2018-03-07 23:18:46.701 INFO  (qtp1595212853-30) [c:mycollection
>> s:shard1 r:core_node2 x:mycollection_shard1_replica_n1] o.a.s.c.S.Request
>> [mycollection_shard1_replica_n1]  webapp=/solr path=/update
>> params={_stateVer_=mycollection:30&_version_=-1594317067897536519&cdcr.update=&wt=javabin&version=2}
>> status=0 QTime=0
>>>>>
>>>>>
>>>>>
>>>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any attachments. Thank you.
>>>>
>>>>
>>>>
>>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any attachments. Thank you.
>>>
>>>
>>>
>>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any attachments. Thank you.
>>
>>
>>
>> This message and any attachment may contain information that is
>> confidential and/or proprietary. Any use, disclosure, copying, storing, or
>> distribution of this e-mail or any attached file by anyone other than the
>> intended recipient is strictly prohibited. If you have received this
>> message in error, please notify the sender by reply email and delete the
>> message and any attachments. Thank you.
>>





This message and any attachment may contain information that is confidential and/or proprietary. Any use, disclosure, copying, storing, or distribution of this e-mail or any attached file by anyone other than the intended recipient is strictly prohibited. If you have received this message in error, please notify the sender by reply email and delete the message and any attachments. Thank you.