CDCR tlog corruption leads to infinite loop

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CDCR tlog corruption leads to infinite loop

Webster Homer
We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud which is set to replicate using CDCR to 3 different solrclouds. All of our target collections have 2 shards with two replicas per shard. Our source collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, and the cdcr action=errors endpoint shows large numbers of errors
For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the CDCR keeps trying to send data, but cannot read the data from the tlog and then it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and difficult to trouble shoot
Today we had it happen with one of our collections. Here is the listing for the tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 tlog.0000000000000000766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 tlog.0000000000000000765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 tlog.0000000000000000764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 tlog.0000000000000000763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 tlog.0000000000000000762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 tlog.0000000000000000761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 tlog.0000000000000000760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 tlog.0000000000000000759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 tlog.0000000000000000758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 tlog.0000000000000000756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 tlog.0000000000000000755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 tlog.0000000000000000754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 tlog.0000000000000000753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 tlog.0000000000000000752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 tlog.0000000000000000751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 tlog.0000000000000000750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 tlog.0000000000000000749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 tlog.0000000000000000748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 tlog.0000000000000000747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 tlog.0000000000000000746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 tlog.0000000000000000745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 tlog.0000000000000000744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 tlog.0000000000000000743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 tlog.0000000000000000742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 tlog.0000000000000000741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 tlog.0000000000000000740.1643794428790308864
-rw-r--r-- 1 apache apache  11M Sep  4 14:58 tlog.0000000000000000737.1643701398824550400
drwxr-xr-x 5 apache apache   53 Aug 21 06:30 ..
[apache@dfw-pauth-msc01 tlog]$ ls -alht tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 tlog.0000000000000000757.1643885009483857920
$ date
Mon Sep  9 14:27:31 CDT 2019
$ pwd
/var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog

CDCR started replicating after we deleted the oldest tlog file and restarted CDCR
tlog.0000000000000000737.1643701398824550400

About the same time I found a number of errors in the solr logs like this:
2019-09-04 19:58:01.393 ERROR (recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.u.UpdateLog java.lang.ClassCastException

This was the most common error at the time, I saw it for all of our collections
2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to non-leader replica

I would like to understand how this happens, and find better ways to diagnose it early.

This occurs frequently enough that we are thinking about developing an alternative to CDCR as it does not seem to be very robust

Any help would be appreciated.
Reply | Threaded
Open this post in threaded view
|

RE: CDCR tlog corruption leads to infinite loop

Webster Homer
We also see an accumulation of tlog files on the target solrs. One of our production clouds crashed due to too many open files
2019-09-11 15:59:39.570 ERROR (qtp1355531311-81540) [c:bioreliance-catalog-testarticle-20190713 s:shard2 r:core_node8 x:bioreliance-catalog-testarticle-20190713_shard2_replica_n6] o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: java.io.FileNotFoundException: /var/solr/data/bioreliance-catalog-testarticle-20190713_shard2_replica_n6/data/tlog/tlog.0000000000000005307.1642472809370222592 (Too many open files)

We found 9106 open files.

This is our update request handler
<updateHandler class="solr.DirectUpdateHandler2">

    <!-- Enables a transaction log, used for real-time get, durability, and
         and solr cloud replica recovery.  The log can grow as big as
         uncommitted changes to the index, so use of a hard autoCommit
         is recommended (see below).
         "dir" - the target directory for transaction logs, defaults to the
                solr data directory.  -->

    <updateLog class="solr.CdcrUpdateLog">
      <str name="dir">${solr.ulog.dir:}</str>
    </updateLog>

    <!-- AutoCommit

       -->
     <autoCommit>
       <maxTime>${solr.autoCommit.maxTime:60000}</maxTime>
       <openSearcher>false</openSearcher>
     </autoCommit>

     <autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:3000}</maxTime>
     </autoSoftCommit>

  </updateHandler>

solr.autoSoftCommit.maxTime is set to 3000
solr.autoCommit.maxTime is set to 60000

-----Original Message-----
From: Webster Homer <[hidden email]>
Sent: Monday, September 09, 2019 4:17 PM
To: [hidden email]
Subject: CDCR tlog corruption leads to infinite loop

We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud which is set to replicate using CDCR to 3 different solrclouds. All of our target collections have 2 shards with two replicas per shard. Our source collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, and the cdcr action=errors endpoint shows large numbers of errors For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the CDCR keeps trying to send data, but cannot read the data from the tlog and then it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and difficult to trouble shoot Today we had it happen with one of our collections. Here is the listing for the tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 tlog.0000000000000000766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 tlog.0000000000000000765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 tlog.0000000000000000764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 tlog.0000000000000000763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 tlog.0000000000000000762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 tlog.0000000000000000761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 tlog.0000000000000000760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 tlog.0000000000000000759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 tlog.0000000000000000758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 tlog.0000000000000000756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 tlog.0000000000000000755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 tlog.0000000000000000754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 tlog.0000000000000000753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 tlog.0000000000000000752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 tlog.0000000000000000751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 tlog.0000000000000000750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 tlog.0000000000000000749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 tlog.0000000000000000748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 tlog.0000000000000000747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 tlog.0000000000000000746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 tlog.0000000000000000745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 tlog.0000000000000000744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 tlog.0000000000000000743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 tlog.0000000000000000742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 tlog.0000000000000000741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 tlog.0000000000000000740.1643794428790308864
-rw-r--r-- 1 apache apache  11M Sep  4 14:58 tlog.0000000000000000737.1643701398824550400
drwxr-xr-x 5 apache apache   53 Aug 21 06:30 ..
[apache@dfw-pauth-msc01 tlog]$ ls -alht tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 tlog.0000000000000000757.1643885009483857920
$ date
Mon Sep  9 14:27:31 CDT 2019
$ pwd
/var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog

CDCR started replicating after we deleted the oldest tlog file and restarted CDCR
tlog.0000000000000000737.1643701398824550400

About the same time I found a number of errors in the solr logs like this:
2019-09-04 19:58:01.393 ERROR (recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.u.UpdateLog java.lang.ClassCastException

This was the most common error at the time, I saw it for all of our collections
2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to non-leader replica

I would like to understand how this happens, and find better ways to diagnose it early.

This occurs frequently enough that we are thinking about developing an alternative to CDCR as it does not seem to be very robust

Any help would be appreciated.