CDCR sensitive to network failures

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

CDCR sensitive to network failures

WebsterHomer
Recently I encountered some problems with CDCR after we experienced network
problems, I thought I'd share.

I'm using Solr 7.2.0
We have 3 solr cloud instances where we update one cloud and use cdcr to
forward updates to the two solrclouds that are hosted in a cloud.

Usually this works pretty well.
Recently we have experienced some serious but intermittent network issues.
When that occurs we find that we get tons of cdcr warnings:

CdcrReplicator  Failed to forward update request to target:
bioreliance-catalog-assay
with errors like ClassCastException, and/or NullpointerException etc...

Updates accumulate on the server and it has tons of errors in the
cdcr?action=errors
"2018-05-18T16:11:19.860Z","internal","2018-05-18T16:11:18.860Z","internal",
"2018-05-18T16:11:17.860Z","internal",
When I looked around on the source collection, I found tlog files like this:
-rw-r--r-- 1 apache apache 1376736 May 10 23:04
tlog.0000000000000000141.1600138985674375168
*-rw-r--r-- 1 apache apache       0 May 11 23:05
tlog.0000000000000000143.1600229645842644992*
*-rw-r--r-- 1 apache apache   65458 May 12 07:50
tlog.0000000000000000142.1600229582225539072*
-rw-r--r-- 1 apache apache 1355610 May 18 10:05
tlog.0000000000000000144.1600814785270644736
-rw-r--r-- 1 apache apache 1355610 May 18 10:16
tlog.0000000000000000145.1600815458585411584
-rw-r--r-- 1 apache apache 1355610 May 18 10:21
tlog.0000000000000000146.1600815785277652992
-rw-r--r-- 1 apache apache 1355610 May 18 10:29
tlog.0000000000000000147.1600816282070941696

Note the 0 length file, and the truncated file
tlog.0000000000000000142.1600229582225539072

The solution is to delete these files. Once these files are removed the
updates start flowing

These errors show up as warnings in the log, I would have expected them to
be errors. CDCR doesn't seem to be able to detect that the tlog is
corrupted.

Hope this helps someone else. If there are better solutions, I'd like to
know

--


This message and any attachment are confidential and may be
privileged or
otherwise protected from disclosure. If you are not the intended
recipient,
you must not copy this message or attachment or disclose the
contents to
any other person. If you have received this transmission in error,
please
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do
not accept liability for any omissions or errors in this
message which may
arise as a result of E-Mail-transmission or for damages
resulting from any
unauthorized changes of the content of this message and
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee
that this message is free of viruses and does
not accept liability for any
damages caused by any virus transmitted
therewith.



Click http://www.emdgroup.com/disclaimer 
<http://www.emdgroup.com/disclaimer> to access the
German, French, Spanish
and Portuguese versions of this disclaimer.