SolrCloud Replication Failure

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud Replication Failure

Jeremy Smith
Hi all,

     We are currently running a moderately large instance of standalone solr and are preparing to switch to solr cloud to help us scale up.  I have been running a number of tests using docker locally and ran into an issue where replication is consistently failing.  I have pared down the test case as minimally as I could.  Here's a link for the docker-compose.yml (I put it in a directory called solrcloud_simple) and a script to run the test:


https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489


Here's the basic idea behind the test:


1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2 replicas (each node gets a replica).  Just use the default schema, although I've also tried our schema and got the same result.


2) Shut down solr-2


3) Add 100 simple docs, just id and a field called num.


4) Start solr-2 and check that it received the documents.  It did!


5) Update a document, commit, and check that solr-2 received the update.  It did!


6) Stop solr-2, update the same document, start solr-2, and make sure that it received the update.  It did!


7) Repeat step 6 with a new value.  This time solr-2 reverts back to what it had in step 5.


I believe the main issue comes from this in the logs:


solr-2_1  | 2018-10-31 17:04:26.135 INFO  (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1 r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync: core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are newer. ourHighThreshold=1615861330901729280 otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280 otherHighest=1615861335081353216

PeerSync thinks the versions on solr-2 are newer for some reason, so it doesn't try to sync from solr-1.  In the final state, solr-2 will always have a lower version for the updated doc than solr-1.  I've tried this with different commit strategies, both auto and manual, and it doesn't seem to make any difference.

Is this a bug with solr, an issue with using docker, or am I just expecting too much from solr?

Thanks for any insights you may have,

Jeremy


Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Erick Erickson
What version of solr? This code was pretty much rewriten in 7.3 IIRC

On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:

> Hi all,
>
>      We are currently running a moderately large instance of standalone
> solr and are preparing to switch to solr cloud to help us scale up.  I have
> been running a number of tests using docker locally and ran into an issue
> where replication is consistently failing.  I have pared down the test case
> as minimally as I could.  Here's a link for the docker-compose.yml (I put
> it in a directory called solrcloud_simple) and a script to run the test:
>
>
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>
>
> Here's the basic idea behind the test:
>
>
> 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> replicas (each node gets a replica).  Just use the default schema, although
> I've also tried our schema and got the same result.
>
>
> 2) Shut down solr-2
>
>
> 3) Add 100 simple docs, just id and a field called num.
>
>
> 4) Start solr-2 and check that it received the documents.  It did!
>
>
> 5) Update a document, commit, and check that solr-2 received the update.
> It did!
>
>
> 6) Stop solr-2, update the same document, start solr-2, and make sure that
> it received the update.  It did!
>
>
> 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> it had in step 5.
>
>
> I believe the main issue comes from this in the logs:
>
>
> solr-2_1  | 2018-10-31 17:04:26.135 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are
> newer. ourHighThreshold=1615861330901729280
> otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> otherHighest=1615861335081353216
>
> PeerSync thinks the versions on solr-2 are newer for some reason, so it
> doesn't try to sync from solr-1.  In the final state, solr-2 will always
> have a lower version for the updated doc than solr-1.  I've tried this with
> different commit strategies, both auto and manual, and it doesn't seem to
> make any difference.
>
> Is this a bug with solr, an issue with using docker, or am I just
> expecting too much from solr?
>
> Thanks for any insights you may have,
>
> Jeremy
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Jeremy Smith
Thanks Erick, this is 7.5.0.
________________________________
From: Erick Erickson <[hidden email]>
Sent: Wednesday, October 31, 2018 8:20:18 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

What version of solr? This code was pretty much rewriten in 7.3 IIRC

On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:

> Hi all,
>
>      We are currently running a moderately large instance of standalone
> solr and are preparing to switch to solr cloud to help us scale up.  I have
> been running a number of tests using docker locally and ran into an issue
> where replication is consistently failing.  I have pared down the test case
> as minimally as I could.  Here's a link for the docker-compose.yml (I put
> it in a directory called solrcloud_simple) and a script to run the test:
>
>
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>
>
> Here's the basic idea behind the test:
>
>
> 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> replicas (each node gets a replica).  Just use the default schema, although
> I've also tried our schema and got the same result.
>
>
> 2) Shut down solr-2
>
>
> 3) Add 100 simple docs, just id and a field called num.
>
>
> 4) Start solr-2 and check that it received the documents.  It did!
>
>
> 5) Update a document, commit, and check that solr-2 received the update.
> It did!
>
>
> 6) Stop solr-2, update the same document, start solr-2, and make sure that
> it received the update.  It did!
>
>
> 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> it had in step 5.
>
>
> I believe the main issue comes from this in the logs:
>
>
> solr-2_1  | 2018-10-31 17:04:26.135 INFO
> (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions are
> newer. ourHighThreshold=1615861330901729280
> otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> otherHighest=1615861335081353216
>
> PeerSync thinks the versions on solr-2 are newer for some reason, so it
> doesn't try to sync from solr-1.  In the final state, solr-2 will always
> have a lower version for the updated doc than solr-1.  I've tried this with
> different commit strategies, both auto and manual, and it doesn't seem to
> make any difference.
>
> Is this a bug with solr, an issue with using docker, or am I just
> expecting too much from solr?
>
> Thanks for any insights you may have,
>
> Jeremy
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
I haven't dug into why this is happening but it definitely reproduces. I
removed the local requirements (port mapping and such) from the gist you
posted (very helpful). I confirmed this fails locally and on Travis CI.

https://github.com/risdenk/test-solr-start-stop-replica-consistency

I don't even see the first update getting applied from num 10 -> 20. After
the first update there is no more change.

Kevin Risden


On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]> wrote:

> Thanks Erick, this is 7.5.0.
> ________________________________
> From: Erick Erickson <[hidden email]>
> Sent: Wednesday, October 31, 2018 8:20:18 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> What version of solr? This code was pretty much rewriten in 7.3 IIRC
>
> On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
>
> > Hi all,
> >
> >      We are currently running a moderately large instance of standalone
> > solr and are preparing to switch to solr cloud to help us scale up.  I
> have
> > been running a number of tests using docker locally and ran into an issue
> > where replication is consistently failing.  I have pared down the test
> case
> > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > it in a directory called solrcloud_simple) and a script to run the test:
> >
> >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >
> >
> > Here's the basic idea behind the test:
> >
> >
> > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > replicas (each node gets a replica).  Just use the default schema,
> although
> > I've also tried our schema and got the same result.
> >
> >
> > 2) Shut down solr-2
> >
> >
> > 3) Add 100 simple docs, just id and a field called num.
> >
> >
> > 4) Start solr-2 and check that it received the documents.  It did!
> >
> >
> > 5) Update a document, commit, and check that solr-2 received the update.
> > It did!
> >
> >
> > 6) Stop solr-2, update the same document, start solr-2, and make sure
> that
> > it received the update.  It did!
> >
> >
> > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > it had in step 5.
> >
> >
> > I believe the main issue comes from this in the logs:
> >
> >
> > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> are
> > newer. ourHighThreshold=1615861330901729280
> > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > otherHighest=1615861335081353216
> >
> > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > have a lower version for the updated doc than solr-1.  I've tried this
> with
> > different commit strategies, both auto and manual, and it doesn't seem to
> > make any difference.
> >
> > Is this a bug with solr, an issue with using docker, or am I just
> > expecting too much from solr?
> >
> > Thanks for any insights you may have,
> >
> > Jeremy
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Jeremy Smith
Thanks so much for looking into this and cleaning up my code.


I added a pull request to show some additional strange behavior.  If we restart solr-1, making solr-2 the leader, the out of date value of [10] gets propagated back to solr-1.  Perhaps this will give a hint as to what is going on.

________________________________
From: Kevin Risden <[hidden email]>
Sent: Wednesday, October 31, 2018 10:24:24 PM
To: [hidden email]
Subject: Re: SolrCloud Replication Failure

I haven't dug into why this is happening but it definitely reproduces. I
removed the local requirements (port mapping and such) from the gist you
posted (very helpful). I confirmed this fails locally and on Travis CI.

https://github.com/risdenk/test-solr-start-stop-replica-consistency

I don't even see the first update getting applied from num 10 -> 20. After
the first update there is no more change.

Kevin Risden


On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]> wrote:

> Thanks Erick, this is 7.5.0.
> ________________________________
> From: Erick Erickson <[hidden email]>
> Sent: Wednesday, October 31, 2018 8:20:18 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> What version of solr? This code was pretty much rewriten in 7.3 IIRC
>
> On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
>
> > Hi all,
> >
> >      We are currently running a moderately large instance of standalone
> > solr and are preparing to switch to solr cloud to help us scale up.  I
> have
> > been running a number of tests using docker locally and ran into an issue
> > where replication is consistently failing.  I have pared down the test
> case
> > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > it in a directory called solrcloud_simple) and a script to run the test:
> >
> >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >
> >
> > Here's the basic idea behind the test:
> >
> >
> > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > replicas (each node gets a replica).  Just use the default schema,
> although
> > I've also tried our schema and got the same result.
> >
> >
> > 2) Shut down solr-2
> >
> >
> > 3) Add 100 simple docs, just id and a field called num.
> >
> >
> > 4) Start solr-2 and check that it received the documents.  It did!
> >
> >
> > 5) Update a document, commit, and check that solr-2 received the update.
> > It did!
> >
> >
> > 6) Stop solr-2, update the same document, start solr-2, and make sure
> that
> > it received the update.  It did!
> >
> >
> > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > it had in step 5.
> >
> >
> > I believe the main issue comes from this in the logs:
> >
> >
> > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> are
> > newer. ourHighThreshold=1615861330901729280
> > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > otherHighest=1615861335081353216
> >
> > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > have a lower version for the updated doc than solr-1.  I've tried this
> with
> > different commit strategies, both auto and manual, and it doesn't seem to
> > make any difference.
> >
> > Is this a bug with solr, an issue with using docker, or am I just
> > expecting too much from solr?
> >
> > Thanks for any insights you may have,
> >
> > Jeremy
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
Ahhh your PR triggered an idea. I'll open a few PRs adjusting the Solr
version from latest back to  earlier 7.x versions. See which version the
problem was introduced in.

Kevin Risden


On Thu, Nov 1, 2018 at 9:17 AM Jeremy Smith <[hidden email]> wrote:

> Thanks so much for looking into this and cleaning up my code.
>
>
> I added a pull request to show some additional strange behavior.  If we
> restart solr-1, making solr-2 the leader, the out of date value of [10]
> gets propagated back to solr-1.  Perhaps this will give a hint as to what
> is going on.
>
> ________________________________
> From:
> Kevin Risden
> <[hidden email]>
> Sent: Wednesday, October 31, 2018 10:24:24 PM
> To: [hidden email]
> Subject: Re: SolrCloud Replication Failure
>
> I haven't dug into why this is happening but it definitely reproduces. I
> removed the local requirements (port mapping and such) from the gist you
> posted (very helpful). I confirmed this fails locally and on Travis CI.
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>
> I don't even see the first update getting applied from num 10 -> 20. After
> the first update there is no more change.
>
> Kevin Risden
>
>
> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]> wrote:
>
> > Thanks Erick, this is 7.5.0.
> > ________________________________
> > From: Erick Erickson <[hidden email]>
> > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > To: solr-user
> > Subject: Re: SolrCloud Replication Failure
> >
> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >
> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
> >
> > > Hi all,
> > >
> > >      We are currently running a moderately large instance of standalone
> > > solr and are preparing to switch to solr cloud to help us scale up.  I
> > have
> > > been running a number of tests using docker locally and ran into an
> issue
> > > where replication is consistently failing.  I have pared down the test
> > case
> > > as minimally as I could.  Here's a link for the docker-compose.yml (I
> put
> > > it in a directory called solrcloud_simple) and a script to run the
> test:
> > >
> > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >
> > >
> > > Here's the basic idea behind the test:
> > >
> > >
> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > replicas (each node gets a replica).  Just use the default schema,
> > although
> > > I've also tried our schema and got the same result.
> > >
> > >
> > > 2) Shut down solr-2
> > >
> > >
> > > 3) Add 100 simple docs, just id and a field called num.
> > >
> > >
> > > 4) Start solr-2 and check that it received the documents.  It did!
> > >
> > >
> > > 5) Update a document, commit, and check that solr-2 received the
> update.
> > > It did!
> > >
> > >
> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > that
> > > it received the update.  It did!
> > >
> > >
> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
> what
> > > it had in step 5.
> > >
> > >
> > > I believe the main issue comes from this in the logs:
> > >
> > >
> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> > are
> > > newer. ourHighThreshold=1615861330901729280
> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > otherHighest=1615861335081353216
> > >
> > > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > > doesn't try to sync from solr-1.  In the final state, solr-2 will
> always
> > > have a lower version for the updated doc than solr-1.  I've tried this
> > with
> > > different commit strategies, both auto and manual, and it doesn't seem
> to
> > > make any difference.
> > >
> > > Is this a bug with solr, an issue with using docker, or am I just
> > > expecting too much from solr?
> > >
> > > Thanks for any insights you may have,
> > >
> > > Jeremy
> > >
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
So I just added PRs 5.5, 6.6, 7.1, 7.2, 7.3, 7.4, and 7.5. They all seem to
have the exact same behavior... I don't have much more insight here but it
doesn't seem to be correct.

Kevin Risden


On Thu, Nov 1, 2018 at 9:45 AM Kevin Risden <[hidden email]> wrote:

> Ahhh your PR triggered an idea. I'll open a few PRs adjusting the Solr
> version from latest back to  earlier 7.x versions. See which version the
> problem was introduced in.
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 9:17 AM Jeremy Smith <[hidden email]> wrote:
>
>> Thanks so much for looking into this and cleaning up my code.
>>
>>
>> I added a pull request to show some additional strange behavior.  If we
>> restart solr-1, making solr-2 the leader, the out of date value of [10]
>> gets propagated back to solr-1.  Perhaps this will give a hint as to what
>> is going on.
>>
>> ________________________________
>> From:
>> Kevin Risden
>> <[hidden email]>
>> Sent: Wednesday, October 31, 2018 10:24:24 PM
>> To: [hidden email]
>> Subject: Re: SolrCloud Replication Failure
>>
>> I haven't dug into why this is happening but it definitely reproduces. I
>> removed the local requirements (port mapping and such) from the gist you
>> posted (very helpful). I confirmed this fails locally and on Travis CI.
>>
>> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>>
>> I don't even see the first update getting applied from num 10 -> 20. After
>> the first update there is no more change.
>>
>> Kevin Risden
>>
>>
>> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]> wrote:
>>
>> > Thanks Erick, this is 7.5.0.
>> > ________________________________
>> > From: Erick Erickson <[hidden email]>
>> > Sent: Wednesday, October 31, 2018 8:20:18 PM
>> > To: solr-user
>> > Subject: Re: SolrCloud Replication Failure
>> >
>> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
>> >
>> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
>> >
>> > > Hi all,
>> > >
>> > >      We are currently running a moderately large instance of
>> standalone
>> > > solr and are preparing to switch to solr cloud to help us scale up.  I
>> > have
>> > > been running a number of tests using docker locally and ran into an
>> issue
>> > > where replication is consistently failing.  I have pared down the test
>> > case
>> > > as minimally as I could.  Here's a link for the docker-compose.yml (I
>> put
>> > > it in a directory called solrcloud_simple) and a script to run the
>> test:
>> > >
>> > >
>> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>> > >
>> > >
>> > > Here's the basic idea behind the test:
>> > >
>> > >
>> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
>> > > replicas (each node gets a replica).  Just use the default schema,
>> > although
>> > > I've also tried our schema and got the same result.
>> > >
>> > >
>> > > 2) Shut down solr-2
>> > >
>> > >
>> > > 3) Add 100 simple docs, just id and a field called num.
>> > >
>> > >
>> > > 4) Start solr-2 and check that it received the documents.  It did!
>> > >
>> > >
>> > > 5) Update a document, commit, and check that solr-2 received the
>> update.
>> > > It did!
>> > >
>> > >
>> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
>> > that
>> > > it received the update.  It did!
>> > >
>> > >
>> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
>> what
>> > > it had in step 5.
>> > >
>> > >
>> > > I believe the main issue comes from this in the logs:
>> > >
>> > >
>> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
>> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
>> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
>> s:shard1
>> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
>> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
>> > are
>> > > newer. ourHighThreshold=1615861330901729280
>> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
>> > > otherHighest=1615861335081353216
>> > >
>> > > PeerSync thinks the versions on solr-2 are newer for some reason, so
>> it
>> > > doesn't try to sync from solr-1.  In the final state, solr-2 will
>> always
>> > > have a lower version for the updated doc than solr-1.  I've tried this
>> > with
>> > > different commit strategies, both auto and manual, and it doesn't
>> seem to
>> > > make any difference.
>> > >
>> > > Is this a bug with solr, an issue with using docker, or am I just
>> > > expecting too much from solr?
>> > >
>> > > Thanks for any insights you may have,
>> > >
>> > > Jeremy
>> > >
>> > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Erick Erickson
In reply to this post by Kevin Risden-3
Kevin:

You're also using Docker, right? Docker is not "officially" supported
although there's some movement in that direction and if this is only
reproducible in Docker than it's a clue where to look....

Erick
On Wed, Oct 31, 2018 at 7:24 PM Kevin Risden <[hidden email]> wrote:

>
> I haven't dug into why this is happening but it definitely reproduces. I
> removed the local requirements (port mapping and such) from the gist you
> posted (very helpful). I confirmed this fails locally and on Travis CI.
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
>
> I don't even see the first update getting applied from num 10 -> 20. After
> the first update there is no more change.
>
> Kevin Risden
>
>
> On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]> wrote:
>
> > Thanks Erick, this is 7.5.0.
> > ________________________________
> > From: Erick Erickson <[hidden email]>
> > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > To: solr-user
> > Subject: Re: SolrCloud Replication Failure
> >
> > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >
> > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
> >
> > > Hi all,
> > >
> > >      We are currently running a moderately large instance of standalone
> > > solr and are preparing to switch to solr cloud to help us scale up.  I
> > have
> > > been running a number of tests using docker locally and ran into an issue
> > > where replication is consistently failing.  I have pared down the test
> > case
> > > as minimally as I could.  Here's a link for the docker-compose.yml (I put
> > > it in a directory called solrcloud_simple) and a script to run the test:
> > >
> > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >
> > >
> > > Here's the basic idea behind the test:
> > >
> > >
> > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > replicas (each node gets a replica).  Just use the default schema,
> > although
> > > I've also tried our schema and got the same result.
> > >
> > >
> > > 2) Shut down solr-2
> > >
> > >
> > > 3) Add 100 simple docs, just id and a field called num.
> > >
> > >
> > > 4) Start solr-2 and check that it received the documents.  It did!
> > >
> > >
> > > 5) Update a document, commit, and check that solr-2 received the update.
> > > It did!
> > >
> > >
> > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > that
> > > it received the update.  It did!
> > >
> > >
> > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to what
> > > it had in step 5.
> > >
> > >
> > > I believe the main issue comes from this in the logs:
> > >
> > >
> > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test s:shard1
> > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our versions
> > are
> > > newer. ourHighThreshold=1615861330901729280
> > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > otherHighest=1615861335081353216
> > >
> > > PeerSync thinks the versions on solr-2 are newer for some reason, so it
> > > doesn't try to sync from solr-1.  In the final state, solr-2 will always
> > > have a lower version for the updated doc than solr-1.  I've tried this
> > with
> > > different commit strategies, both auto and manual, and it doesn't seem to
> > > make any difference.
> > >
> > > Is this a bug with solr, an issue with using docker, or am I just
> > > expecting too much from solr?
> > >
> > > Thanks for any insights you may have,
> > >
> > > Jeremy
> > >
> > >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
Erick - Yea thats a fair point. Would be interesting to see if this fails
without Docker.

Kevin Risden


On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <[hidden email]>
wrote:

> Kevin:
>
> You're also using Docker, right? Docker is not "officially" supported
> although there's some movement in that direction and if this is only
> reproducible in Docker than it's a clue where to look....
>
> Erick
> On Wed, Oct 31, 2018 at 7:24 PM
> Kevin Risden
> <[hidden email]> wrote:
> >
> > I haven't dug into why this is happening but it definitely reproduces. I
> > removed the local requirements (port mapping and such) from the gist you
> > posted (very helpful). I confirmed this fails locally and on Travis CI.
> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> >
> > I don't even see the first update getting applied from num 10 -> 20.
> After
> > the first update there is no more change.
> >
> > Kevin Risden
> >
> >
> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]>
> wrote:
> >
> > > Thanks Erick, this is 7.5.0.
> > > ________________________________
> > > From: Erick Erickson <[hidden email]>
> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > To: solr-user
> > > Subject: Re: SolrCloud Replication Failure
> > >
> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> > >
> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
> > >
> > > > Hi all,
> > > >
> > > >      We are currently running a moderately large instance of
> standalone
> > > > solr and are preparing to switch to solr cloud to help us scale up.
> I
> > > have
> > > > been running a number of tests using docker locally and ran into an
> issue
> > > > where replication is consistently failing.  I have pared down the
> test
> > > case
> > > > as minimally as I could.  Here's a link for the docker-compose.yml
> (I put
> > > > it in a directory called solrcloud_simple) and a script to run the
> test:
> > > >
> > > >
> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > >
> > > >
> > > > Here's the basic idea behind the test:
> > > >
> > > >
> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> > > > replicas (each node gets a replica).  Just use the default schema,
> > > although
> > > > I've also tried our schema and got the same result.
> > > >
> > > >
> > > > 2) Shut down solr-2
> > > >
> > > >
> > > > 3) Add 100 simple docs, just id and a field called num.
> > > >
> > > >
> > > > 4) Start solr-2 and check that it received the documents.  It did!
> > > >
> > > >
> > > > 5) Update a document, commit, and check that solr-2 received the
> update.
> > > > It did!
> > > >
> > > >
> > > > 6) Stop solr-2, update the same document, start solr-2, and make sure
> > > that
> > > > it received the update.  It did!
> > > >
> > > >
> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back to
> what
> > > > it had in step 5.
> > > >
> > > >
> > > > I believe the main issue comes from this in the logs:
> > > >
> > > >
> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> s:shard1
> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> versions
> > > are
> > > > newer. ourHighThreshold=1615861330901729280
> > > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> > > > otherHighest=1615861335081353216
> > > >
> > > > PeerSync thinks the versions on solr-2 are newer for some reason, so
> it
> > > > doesn't try to sync from solr-1.  In the final state, solr-2 will
> always
> > > > have a lower version for the updated doc than solr-1.  I've tried
> this
> > > with
> > > > different commit strategies, both auto and manual, and it doesn't
> seem to
> > > > make any difference.
> > > >
> > > > Is this a bug with solr, an issue with using docker, or am I just
> > > > expecting too much from solr?
> > > >
> > > > Thanks for any insights you may have,
> > > >
> > > > Jeremy
> > > >
> > > >
> > > >
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 locally
without docker. I still see the same behavior where the latest updates
aren't on the replicas. I still don't know what is happening but it happens
without Docker :(

https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches

Kevin Risden


On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]> wrote:

> Erick - Yea thats a fair point. Would be interesting to see if this fails
> without Docker.
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <[hidden email]>
> wrote:
>
>> Kevin:
>>
>> You're also using Docker, right? Docker is not "officially" supported
>> although there's some movement in that direction and if this is only
>> reproducible in Docker than it's a clue where to look....
>>
>> Erick
>> On Wed, Oct 31, 2018 at 7:24 PM
>> Kevin Risden
>> <[hidden email]> wrote:
>> >
>> > I haven't dug into why this is happening but it definitely reproduces. I
>> > removed the local requirements (port mapping and such) from the gist you
>> > posted (very helpful). I confirmed this fails locally and on Travis CI.
>> >
>> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
>> >
>> > I don't even see the first update getting applied from num 10 -> 20.
>> After
>> > the first update there is no more change.
>> >
>> > Kevin Risden
>> >
>> >
>> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]>
>> wrote:
>> >
>> > > Thanks Erick, this is 7.5.0.
>> > > ________________________________
>> > > From: Erick Erickson <[hidden email]>
>> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
>> > > To: solr-user
>> > > Subject: Re: SolrCloud Replication Failure
>> > >
>> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
>> > >
>> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > >      We are currently running a moderately large instance of
>> standalone
>> > > > solr and are preparing to switch to solr cloud to help us scale
>> up.  I
>> > > have
>> > > > been running a number of tests using docker locally and ran into an
>> issue
>> > > > where replication is consistently failing.  I have pared down the
>> test
>> > > case
>> > > > as minimally as I could.  Here's a link for the docker-compose.yml
>> (I put
>> > > > it in a directory called solrcloud_simple) and a script to run the
>> test:
>> > > >
>> > > >
>> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
>> > > >
>> > > >
>> > > > Here's the basic idea behind the test:
>> > > >
>> > > >
>> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
>> > > > replicas (each node gets a replica).  Just use the default schema,
>> > > although
>> > > > I've also tried our schema and got the same result.
>> > > >
>> > > >
>> > > > 2) Shut down solr-2
>> > > >
>> > > >
>> > > > 3) Add 100 simple docs, just id and a field called num.
>> > > >
>> > > >
>> > > > 4) Start solr-2 and check that it received the documents.  It did!
>> > > >
>> > > >
>> > > > 5) Update a document, commit, and check that solr-2 received the
>> update.
>> > > > It did!
>> > > >
>> > > >
>> > > > 6) Stop solr-2, update the same document, start solr-2, and make
>> sure
>> > > that
>> > > > it received the update.  It did!
>> > > >
>> > > >
>> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back
>> to what
>> > > > it had in step 5.
>> > > >
>> > > >
>> > > > I believe the main issue comes from this in the logs:
>> > > >
>> > > >
>> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
>> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
>> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
>> s:shard1
>> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
>> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
>> versions
>> > > are
>> > > > newer. ourHighThreshold=1615861330901729280
>> > > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
>> > > > otherHighest=1615861335081353216
>> > > >
>> > > > PeerSync thinks the versions on solr-2 are newer for some reason,
>> so it
>> > > > doesn't try to sync from solr-1.  In the final state, solr-2 will
>> always
>> > > > have a lower version for the updated doc than solr-1.  I've tried
>> this
>> > > with
>> > > > different commit strategies, both auto and manual, and it doesn't
>> seem to
>> > > > make any difference.
>> > > >
>> > > > Is this a bug with solr, an issue with using docker, or am I just
>> > > > expecting too much from solr?
>> > > >
>> > > > Thanks for any insights you may have,
>> > > >
>> > > > Jeremy
>> > > >
>> > > >
>> > > >
>> > >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Erick Erickson
So  this seems like it absolutely needs a JIRA....
On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden <[hidden email]> wrote:

>
> I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5 locally
> without docker. I still see the same behavior where the latest updates
> aren't on the replicas. I still don't know what is happening but it happens
> without Docker :(
>
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
>
> Kevin Risden
>
>
> On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]> wrote:
>
> > Erick - Yea thats a fair point. Would be interesting to see if this fails
> > without Docker.
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <[hidden email]>
> > wrote:
> >
> >> Kevin:
> >>
> >> You're also using Docker, right? Docker is not "officially" supported
> >> although there's some movement in that direction and if this is only
> >> reproducible in Docker than it's a clue where to look....
> >>
> >> Erick
> >> On Wed, Oct 31, 2018 at 7:24 PM
> >> Kevin Risden
> >> <[hidden email]> wrote:
> >> >
> >> > I haven't dug into why this is happening but it definitely reproduces. I
> >> > removed the local requirements (port mapping and such) from the gist you
> >> > posted (very helpful). I confirmed this fails locally and on Travis CI.
> >> >
> >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> >> >
> >> > I don't even see the first update getting applied from num 10 -> 20.
> >> After
> >> > the first update there is no more change.
> >> >
> >> > Kevin Risden
> >> >
> >> >
> >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]>
> >> wrote:
> >> >
> >> > > Thanks Erick, this is 7.5.0.
> >> > > ________________________________
> >> > > From: Erick Erickson <[hidden email]>
> >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> >> > > To: solr-user
> >> > > Subject: Re: SolrCloud Replication Failure
> >> > >
> >> > > What version of solr? This code was pretty much rewriten in 7.3 IIRC
> >> > >
> >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email] wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > >      We are currently running a moderately large instance of
> >> standalone
> >> > > > solr and are preparing to switch to solr cloud to help us scale
> >> up.  I
> >> > > have
> >> > > > been running a number of tests using docker locally and ran into an
> >> issue
> >> > > > where replication is consistently failing.  I have pared down the
> >> test
> >> > > case
> >> > > > as minimally as I could.  Here's a link for the docker-compose.yml
> >> (I put
> >> > > > it in a directory called solrcloud_simple) and a script to run the
> >> test:
> >> > > >
> >> > > >
> >> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> >> > > >
> >> > > >
> >> > > > Here's the basic idea behind the test:
> >> > > >
> >> > > >
> >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard, and 2
> >> > > > replicas (each node gets a replica).  Just use the default schema,
> >> > > although
> >> > > > I've also tried our schema and got the same result.
> >> > > >
> >> > > >
> >> > > > 2) Shut down solr-2
> >> > > >
> >> > > >
> >> > > > 3) Add 100 simple docs, just id and a field called num.
> >> > > >
> >> > > >
> >> > > > 4) Start solr-2 and check that it received the documents.  It did!
> >> > > >
> >> > > >
> >> > > > 5) Update a document, commit, and check that solr-2 received the
> >> update.
> >> > > > It did!
> >> > > >
> >> > > >
> >> > > > 6) Stop solr-2, update the same document, start solr-2, and make
> >> sure
> >> > > that
> >> > > > it received the update.  It did!
> >> > > >
> >> > > >
> >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts back
> >> to what
> >> > > > it had in step 5.
> >> > > >
> >> > > >
> >> > > > I believe the main issue comes from this in the logs:
> >> > > >
> >> > > >
> >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> >> s:shard1
> >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync PeerSync:
> >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> >> versions
> >> > > are
> >> > > > newer. ourHighThreshold=1615861330901729280
> >> > > > otherLowThreshold=1615861314086764545 ourHighest=1615861330901729280
> >> > > > otherHighest=1615861335081353216
> >> > > >
> >> > > > PeerSync thinks the versions on solr-2 are newer for some reason,
> >> so it
> >> > > > doesn't try to sync from solr-1.  In the final state, solr-2 will
> >> always
> >> > > > have a lower version for the updated doc than solr-1.  I've tried
> >> this
> >> > > with
> >> > > > different commit strategies, both auto and manual, and it doesn't
> >> seem to
> >> > > > make any difference.
> >> > > >
> >> > > > Is this a bug with solr, an issue with using docker, or am I just
> >> > > > expecting too much from solr?
> >> > > >
> >> > > > Thanks for any insights you may have,
> >> > > >
> >> > > > Jeremy
> >> > > >
> >> > > >
> >> > > >
> >> > >
> >>
> >
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Susheel Kumar-3
Are we saying it has to do something with stop and restarting replica's
otherwise I haven't seen/heard any issues with document updates and
forwarding to replica's...

Thanks,
Susheel

On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
wrote:

> So  this seems like it absolutely needs a JIRA....
> On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden <[hidden email]> wrote:
> >
> > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> locally
> > without docker. I still see the same behavior where the latest updates
> > aren't on the replicas. I still don't know what is happening but it
> happens
> > without Docker :(
> >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]> wrote:
> >
> > > Erick - Yea thats a fair point. Would be interesting to see if this
> fails
> > > without Docker.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> [hidden email]>
> > > wrote:
> > >
> > >> Kevin:
> > >>
> > >> You're also using Docker, right? Docker is not "officially" supported
> > >> although there's some movement in that direction and if this is only
> > >> reproducible in Docker than it's a clue where to look....
> > >>
> > >> Erick
> > >> On Wed, Oct 31, 2018 at 7:24 PM
> > >> Kevin Risden
> > >> <[hidden email]> wrote:
> > >> >
> > >> > I haven't dug into why this is happening but it definitely
> reproduces. I
> > >> > removed the local requirements (port mapping and such) from the
> gist you
> > >> > posted (very helpful). I confirmed this fails locally and on Travis
> CI.
> > >> >
> > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > >> >
> > >> > I don't even see the first update getting applied from num 10 -> 20.
> > >> After
> > >> > the first update there is no more change.
> > >> >
> > >> > Kevin Risden
> > >> >
> > >> >
> > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]>
> > >> wrote:
> > >> >
> > >> > > Thanks Erick, this is 7.5.0.
> > >> > > ________________________________
> > >> > > From: Erick Erickson <[hidden email]>
> > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > >> > > To: solr-user
> > >> > > Subject: Re: SolrCloud Replication Failure
> > >> > >
> > >> > > What version of solr? This code was pretty much rewriten in 7.3
> IIRC
> > >> > >
> > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > >      We are currently running a moderately large instance of
> > >> standalone
> > >> > > > solr and are preparing to switch to solr cloud to help us scale
> > >> up.  I
> > >> > > have
> > >> > > > been running a number of tests using docker locally and ran
> into an
> > >> issue
> > >> > > > where replication is consistently failing.  I have pared down
> the
> > >> test
> > >> > > case
> > >> > > > as minimally as I could.  Here's a link for the
> docker-compose.yml
> > >> (I put
> > >> > > > it in a directory called solrcloud_simple) and a script to run
> the
> > >> test:
> > >> > > >
> > >> > > >
> > >> > > >
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >> > > >
> > >> > > >
> > >> > > > Here's the basic idea behind the test:
> > >> > > >
> > >> > > >
> > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> and 2
> > >> > > > replicas (each node gets a replica).  Just use the default
> schema,
> > >> > > although
> > >> > > > I've also tried our schema and got the same result.
> > >> > > >
> > >> > > >
> > >> > > > 2) Shut down solr-2
> > >> > > >
> > >> > > >
> > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > >> > > >
> > >> > > >
> > >> > > > 4) Start solr-2 and check that it received the documents.  It
> did!
> > >> > > >
> > >> > > >
> > >> > > > 5) Update a document, commit, and check that solr-2 received the
> > >> update.
> > >> > > > It did!
> > >> > > >
> > >> > > >
> > >> > > > 6) Stop solr-2, update the same document, start solr-2, and make
> > >> sure
> > >> > > that
> > >> > > > it received the update.  It did!
> > >> > > >
> > >> > > >
> > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> back
> > >> to what
> > >> > > > it had in step 5.
> > >> > > >
> > >> > > >
> > >> > > > I believe the main issue comes from this in the logs:
> > >> > > >
> > >> > > >
> > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > >> s:shard1
> > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> PeerSync:
> > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > >> versions
> > >> > > are
> > >> > > > newer. ourHighThreshold=1615861330901729280
> > >> > > > otherLowThreshold=1615861314086764545
> ourHighest=1615861330901729280
> > >> > > > otherHighest=1615861335081353216
> > >> > > >
> > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> reason,
> > >> so it
> > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> will
> > >> always
> > >> > > > have a lower version for the updated doc than solr-1.  I've
> tried
> > >> this
> > >> > > with
> > >> > > > different commit strategies, both auto and manual, and it
> doesn't
> > >> seem to
> > >> > > > make any difference.
> > >> > > >
> > >> > > > Is this a bug with solr, an issue with using docker, or am I
> just
> > >> > > > expecting too much from solr?
> > >> > > >
> > >> > > > Thanks for any insights you may have,
> > >> > > >
> > >> > > > Jeremy
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >>
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Jeremy Smith
Hi Susheel,

     Yes, it appears that under certain conditions, if a follower is down when the leader gets an update, the follower will not receive that update when it comes back (or maybe it receives the update and it's then overwritten by its own transaction logs, I'm not sure).  Furthermore, if that follower then becomes the leader, it will replicate its own out of date value back to the former leader, even though the version number is lower.


   -Jeremy

________________________________
From: Susheel Kumar <[hidden email]>
Sent: Thursday, November 1, 2018 2:57:00 PM
To: [hidden email]
Subject: Re: SolrCloud Replication Failure

Are we saying it has to do something with stop and restarting replica's
otherwise I haven't seen/heard any issues with document updates and
forwarding to replica's...

Thanks,
Susheel

On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
wrote:

> So  this seems like it absolutely needs a JIRA....
> On Thu, Nov 1, 2018 at 9:39 AM Kevin Risden <[hidden email]> wrote:
> >
> > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> locally
> > without docker. I still see the same behavior where the latest updates
> > aren't on the replicas. I still don't know what is happening but it
> happens
> > without Docker :(
> >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> >
> > Kevin Risden
> >
> >
> > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]> wrote:
> >
> > > Erick - Yea thats a fair point. Would be interesting to see if this
> fails
> > > without Docker.
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> [hidden email]>
> > > wrote:
> > >
> > >> Kevin:
> > >>
> > >> You're also using Docker, right? Docker is not "officially" supported
> > >> although there's some movement in that direction and if this is only
> > >> reproducible in Docker than it's a clue where to look....
> > >>
> > >> Erick
> > >> On Wed, Oct 31, 2018 at 7:24 PM
> > >> Kevin Risden
> > >> <[hidden email]> wrote:
> > >> >
> > >> > I haven't dug into why this is happening but it definitely
> reproduces. I
> > >> > removed the local requirements (port mapping and such) from the
> gist you
> > >> > posted (very helpful). I confirmed this fails locally and on Travis
> CI.
> > >> >
> > >> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > >> >
> > >> > I don't even see the first update getting applied from num 10 -> 20.
> > >> After
> > >> > the first update there is no more change.
> > >> >
> > >> > Kevin Risden
> > >> >
> > >> >
> > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]>
> > >> wrote:
> > >> >
> > >> > > Thanks Erick, this is 7.5.0.
> > >> > > ________________________________
> > >> > > From: Erick Erickson <[hidden email]>
> > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > >> > > To: solr-user
> > >> > > Subject: Re: SolrCloud Replication Failure
> > >> > >
> > >> > > What version of solr? This code was pretty much rewriten in 7.3
> IIRC
> > >> > >
> > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > >      We are currently running a moderately large instance of
> > >> standalone
> > >> > > > solr and are preparing to switch to solr cloud to help us scale
> > >> up.  I
> > >> > > have
> > >> > > > been running a number of tests using docker locally and ran
> into an
> > >> issue
> > >> > > > where replication is consistently failing.  I have pared down
> the
> > >> test
> > >> > > case
> > >> > > > as minimally as I could.  Here's a link for the
> docker-compose.yml
> > >> (I put
> > >> > > > it in a directory called solrcloud_simple) and a script to run
> the
> > >> test:
> > >> > > >
> > >> > > >
> > >> > > >
> https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > >> > > >
> > >> > > >
> > >> > > > Here's the basic idea behind the test:
> > >> > > >
> > >> > > >
> > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> and 2
> > >> > > > replicas (each node gets a replica).  Just use the default
> schema,
> > >> > > although
> > >> > > > I've also tried our schema and got the same result.
> > >> > > >
> > >> > > >
> > >> > > > 2) Shut down solr-2
> > >> > > >
> > >> > > >
> > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > >> > > >
> > >> > > >
> > >> > > > 4) Start solr-2 and check that it received the documents.  It
> did!
> > >> > > >
> > >> > > >
> > >> > > > 5) Update a document, commit, and check that solr-2 received the
> > >> update.
> > >> > > > It did!
> > >> > > >
> > >> > > >
> > >> > > > 6) Stop solr-2, update the same document, start solr-2, and make
> > >> sure
> > >> > > that
> > >> > > > it received the update.  It did!
> > >> > > >
> > >> > > >
> > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> back
> > >> to what
> > >> > > > it had in step 5.
> > >> > > >
> > >> > > >
> > >> > > > I believe the main issue comes from this in the logs:
> > >> > > >
> > >> > > >
> > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > >> s:shard1
> > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> PeerSync:
> > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > >> versions
> > >> > > are
> > >> > > > newer. ourHighThreshold=1615861330901729280
> > >> > > > otherLowThreshold=1615861314086764545
> ourHighest=1615861330901729280
> > >> > > > otherHighest=1615861335081353216
> > >> > > >
> > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> reason,
> > >> so it
> > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> will
> > >> always
> > >> > > > have a lower version for the updated doc than solr-1.  I've
> tried
> > >> this
> > >> > > with
> > >> > > > different commit strategies, both auto and manual, and it
> doesn't
> > >> seem to
> > >> > > > make any difference.
> > >> > > >
> > >> > > > Is this a bug with solr, an issue with using docker, or am I
> just
> > >> > > > expecting too much from solr?
> > >> > > >
> > >> > > > Thanks for any insights you may have,
> > >> > > >
> > >> > > > Jeremy
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > >
> > >>
> > >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Kevin Risden-3
Erick Erickson - I don't have much time to chase this down. Do you think
this a blocker for 7.6? It seems pretty serious.

Jeremy - This would be a good JIRA to create - we can move the conversation
there to try to get the right people involved.

Kevin Risden


On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <[hidden email]> wrote:

> Hi Susheel,
>
>      Yes, it appears that under certain conditions, if a follower is down
> when the leader gets an update, the follower will not receive that update
> when it comes back (or maybe it receives the update and it's then
> overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> that follower then becomes the leader, it will replicate its own out of
> date value back to the former leader, even though the version number is
> lower.
>
>
>    -Jeremy
>
> ________________________________
> From: Susheel Kumar <[hidden email]>
> Sent: Thursday, November 1, 2018 2:57:00 PM
> To: [hidden email]
> Subject: Re: SolrCloud Replication Failure
>
> Are we saying it has to do something with stop and restarting replica's
> otherwise I haven't seen/heard any issues with document updates and
> forwarding to replica's...
>
> Thanks,
> Susheel
>
> On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
> wrote:
>
> > So  this seems like it absolutely needs a JIRA....
> > On Thu, Nov 1, 2018 at 9:39 AM
> Kevin Risden
> <[hidden email]> wrote:
> > >
> > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > locally
> > > without docker. I still see the same behavior where the latest updates
> > > aren't on the replicas. I still don't know what is happening but it
> > happens
> > > without Docker :(
> > >
> > >
> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > >
> > > Kevin Risden
> > >
> > >
> > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]>
> wrote:
> > >
> > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > fails
> > > > without Docker.
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > [hidden email]>
> > > > wrote:
> > > >
> > > >> Kevin:
> > > >>
> > > >> You're also using Docker, right? Docker is not "officially"
> supported
> > > >> although there's some movement in that direction and if this is only
> > > >> reproducible in Docker than it's a clue where to look....
> > > >>
> > > >> Erick
> > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > >> Kevin Risden
> > > >> <[hidden email]> wrote:
> > > >> >
> > > >> > I haven't dug into why this is happening but it definitely
> > reproduces. I
> > > >> > removed the local requirements (port mapping and such) from the
> > gist you
> > > >> > posted (very helpful). I confirmed this fails locally and on
> Travis
> > CI.
> > > >> >
> > > >> >
> https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > >> >
> > > >> > I don't even see the first update getting applied from num 10 ->
> 20.
> > > >> After
> > > >> > the first update there is no more change.
> > > >> >
> > > >> > Kevin Risden
> > > >> >
> > > >> >
> > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]
> >
> > > >> wrote:
> > > >> >
> > > >> > > Thanks Erick, this is 7.5.0.
> > > >> > > ________________________________
> > > >> > > From: Erick Erickson <[hidden email]>
> > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > >> > > To: solr-user
> > > >> > > Subject: Re: SolrCloud Replication Failure
> > > >> > >
> > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > IIRC
> > > >> > >
> > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> > wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > >      We are currently running a moderately large instance of
> > > >> standalone
> > > >> > > > solr and are preparing to switch to solr cloud to help us
> scale
> > > >> up.  I
> > > >> > > have
> > > >> > > > been running a number of tests using docker locally and ran
> > into an
> > > >> issue
> > > >> > > > where replication is consistently failing.  I have pared down
> > the
> > > >> test
> > > >> > > case
> > > >> > > > as minimally as I could.  Here's a link for the
> > docker-compose.yml
> > > >> (I put
> > > >> > > > it in a directory called solrcloud_simple) and a script to run
> > the
> > > >> test:
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > >> > > >
> > > >> > > >
> > > >> > > > Here's the basic idea behind the test:
> > > >> > > >
> > > >> > > >
> > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> > and 2
> > > >> > > > replicas (each node gets a replica).  Just use the default
> > schema,
> > > >> > > although
> > > >> > > > I've also tried our schema and got the same result.
> > > >> > > >
> > > >> > > >
> > > >> > > > 2) Shut down solr-2
> > > >> > > >
> > > >> > > >
> > > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > > >> > > >
> > > >> > > >
> > > >> > > > 4) Start solr-2 and check that it received the documents.  It
> > did!
> > > >> > > >
> > > >> > > >
> > > >> > > > 5) Update a document, commit, and check that solr-2 received
> the
> > > >> update.
> > > >> > > > It did!
> > > >> > > >
> > > >> > > >
> > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and
> make
> > > >> sure
> > > >> > > that
> > > >> > > > it received the update.  It did!
> > > >> > > >
> > > >> > > >
> > > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> > back
> > > >> to what
> > > >> > > > it had in step 5.
> > > >> > > >
> > > >> > > >
> > > >> > > > I believe the main issue comes from this in the logs:
> > > >> > > >
> > > >> > > >
> > > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > > >> s:shard1
> > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> > PeerSync:
> > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > > >> versions
> > > >> > > are
> > > >> > > > newer. ourHighThreshold=1615861330901729280
> > > >> > > > otherLowThreshold=1615861314086764545
> > ourHighest=1615861330901729280
> > > >> > > > otherHighest=1615861335081353216
> > > >> > > >
> > > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> > reason,
> > > >> so it
> > > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> > will
> > > >> always
> > > >> > > > have a lower version for the updated doc than solr-1.  I've
> > tried
> > > >> this
> > > >> > > with
> > > >> > > > different commit strategies, both auto and manual, and it
> > doesn't
> > > >> seem to
> > > >> > > > make any difference.
> > > >> > > >
> > > >> > > > Is this a bug with solr, an issue with using docker, or am I
> > just
> > > >> > > > expecting too much from solr?
> > > >> > > >
> > > >> > > > Thanks for any insights you may have,
> > > >> > > >
> > > >> > > > Jeremy
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >>
> > > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Erick Erickson
Kevin:

Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
I _think_ the new LIR work done in Solr 7.3 might make it possible to
detect this condition but I'm not totally sure what to do about it.

So let's say the leader gets an update while a follower is down. (one
leader and one follower for simplicity). Now say the leader dies and
the follower is restarted. What should happen? Should Solr refuse to
start? Would FORCELEADER work if the user was willing to lose data?

Let's move the discussion to the JIRA though.
On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden <[hidden email]> wrote:

>
> Erick Erickson - I don't have much time to chase this down. Do you think
> this a blocker for 7.6? It seems pretty serious.
>
> Jeremy - This would be a good JIRA to create - we can move the conversation
> there to try to get the right people involved.
>
> Kevin Risden
>
>
> On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <[hidden email]> wrote:
>
> > Hi Susheel,
> >
> >      Yes, it appears that under certain conditions, if a follower is down
> > when the leader gets an update, the follower will not receive that update
> > when it comes back (or maybe it receives the update and it's then
> > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > that follower then becomes the leader, it will replicate its own out of
> > date value back to the former leader, even though the version number is
> > lower.
> >
> >
> >    -Jeremy
> >
> > ________________________________
> > From: Susheel Kumar <[hidden email]>
> > Sent: Thursday, November 1, 2018 2:57:00 PM
> > To: [hidden email]
> > Subject: Re: SolrCloud Replication Failure
> >
> > Are we saying it has to do something with stop and restarting replica's
> > otherwise I haven't seen/heard any issues with document updates and
> > forwarding to replica's...
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > So  this seems like it absolutely needs a JIRA....
> > > On Thu, Nov 1, 2018 at 9:39 AM
> > Kevin Risden
> > <[hidden email]> wrote:
> > > >
> > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > locally
> > > > without docker. I still see the same behavior where the latest updates
> > > > aren't on the replicas. I still don't know what is happening but it
> > > happens
> > > > without Docker :(
> > > >
> > > >
> > >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]>
> > wrote:
> > > >
> > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > fails
> > > > > without Docker.
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Kevin:
> > > > >>
> > > > >> You're also using Docker, right? Docker is not "officially"
> > supported
> > > > >> although there's some movement in that direction and if this is only
> > > > >> reproducible in Docker than it's a clue where to look....
> > > > >>
> > > > >> Erick
> > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > >> Kevin Risden
> > > > >> <[hidden email]> wrote:
> > > > >> >
> > > > >> > I haven't dug into why this is happening but it definitely
> > > reproduces. I
> > > > >> > removed the local requirements (port mapping and such) from the
> > > gist you
> > > > >> > posted (very helpful). I confirmed this fails locally and on
> > Travis
> > > CI.
> > > > >> >
> > > > >> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > >> >
> > > > >> > I don't even see the first update getting applied from num 10 ->
> > 20.
> > > > >> After
> > > > >> > the first update there is no more change.
> > > > >> >
> > > > >> > Kevin Risden
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks Erick, this is 7.5.0.
> > > > >> > > ________________________________
> > > > >> > > From: Erick Erickson <[hidden email]>
> > > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > > >> > > To: solr-user
> > > > >> > > Subject: Re: SolrCloud Replication Failure
> > > > >> > >
> > > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > > IIRC
> > > > >> > >
> > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> > > wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > >      We are currently running a moderately large instance of
> > > > >> standalone
> > > > >> > > > solr and are preparing to switch to solr cloud to help us
> > scale
> > > > >> up.  I
> > > > >> > > have
> > > > >> > > > been running a number of tests using docker locally and ran
> > > into an
> > > > >> issue
> > > > >> > > > where replication is consistently failing.  I have pared down
> > > the
> > > > >> test
> > > > >> > > case
> > > > >> > > > as minimally as I could.  Here's a link for the
> > > docker-compose.yml
> > > > >> (I put
> > > > >> > > > it in a directory called solrcloud_simple) and a script to run
> > > the
> > > > >> test:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Here's the basic idea behind the test:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> > > and 2
> > > > >> > > > replicas (each node gets a replica).  Just use the default
> > > schema,
> > > > >> > > although
> > > > >> > > > I've also tried our schema and got the same result.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 2) Shut down solr-2
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 4) Start solr-2 and check that it received the documents.  It
> > > did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 5) Update a document, commit, and check that solr-2 received
> > the
> > > > >> update.
> > > > >> > > > It did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and
> > make
> > > > >> sure
> > > > >> > > that
> > > > >> > > > it received the update.  It did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> > > back
> > > > >> to what
> > > > >> > > > it had in step 5.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > I believe the main issue comes from this in the logs:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > > > >> s:shard1
> > > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> > > PeerSync:
> > > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > > > >> versions
> > > > >> > > are
> > > > >> > > > newer. ourHighThreshold=1615861330901729280
> > > > >> > > > otherLowThreshold=1615861314086764545
> > > ourHighest=1615861330901729280
> > > > >> > > > otherHighest=1615861335081353216
> > > > >> > > >
> > > > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> > > reason,
> > > > >> so it
> > > > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> > > will
> > > > >> always
> > > > >> > > > have a lower version for the updated doc than solr-1.  I've
> > > tried
> > > > >> this
> > > > >> > > with
> > > > >> > > > different commit strategies, both auto and manual, and it
> > > doesn't
> > > > >> seem to
> > > > >> > > > make any difference.
> > > > >> > > >
> > > > >> > > > Is this a bug with solr, an issue with using docker, or am I
> > > just
> > > > >> > > > expecting too much from solr?
> > > > >> > > >
> > > > >> > > > Thanks for any insights you may have,
> > > > >> > > >
> > > > >> > > > Jeremy
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > > >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Jeremy Smith
Thanks everyone.  I added SOLR-12969.


Erick - those sound like important questions, but I think this issue is slightly different.  In this case, replication is failing even if the leader never goes down.

________________________________
From: Erick Erickson <[hidden email]>
Sent: Tuesday, November 6, 2018 2:52:30 PM
To: solr-user
Subject: Re: SolrCloud Replication Failure

Kevin:

Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
I _think_ the new LIR work done in Solr 7.3 might make it possible to
detect this condition but I'm not totally sure what to do about it.

So let's say the leader gets an update while a follower is down. (one
leader and one follower for simplicity). Now say the leader dies and
the follower is restarted. What should happen? Should Solr refuse to
start? Would FORCELEADER work if the user was willing to lose data?

Let's move the discussion to the JIRA though.
On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden <[hidden email]> wrote:

>
> Erick Erickson - I don't have much time to chase this down. Do you think
> this a blocker for 7.6? It seems pretty serious.
>
> Jeremy - This would be a good JIRA to create - we can move the conversation
> there to try to get the right people involved.
>
> Kevin Risden
>
>
> On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <[hidden email]> wrote:
>
> > Hi Susheel,
> >
> >      Yes, it appears that under certain conditions, if a follower is down
> > when the leader gets an update, the follower will not receive that update
> > when it comes back (or maybe it receives the update and it's then
> > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > that follower then becomes the leader, it will replicate its own out of
> > date value back to the former leader, even though the version number is
> > lower.
> >
> >
> >    -Jeremy
> >
> > ________________________________
> > From: Susheel Kumar <[hidden email]>
> > Sent: Thursday, November 1, 2018 2:57:00 PM
> > To: [hidden email]
> > Subject: Re: SolrCloud Replication Failure
> >
> > Are we saying it has to do something with stop and restarting replica's
> > otherwise I haven't seen/heard any issues with document updates and
> > forwarding to replica's...
> >
> > Thanks,
> > Susheel
> >
> > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
> > wrote:
> >
> > > So  this seems like it absolutely needs a JIRA....
> > > On Thu, Nov 1, 2018 at 9:39 AM
> > Kevin Risden
> > <[hidden email]> wrote:
> > > >
> > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > locally
> > > > without docker. I still see the same behavior where the latest updates
> > > > aren't on the replicas. I still don't know what is happening but it
> > > happens
> > > > without Docker :(
> > > >
> > > >
> > >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > >
> > > > Kevin Risden
> > > >
> > > >
> > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]>
> > wrote:
> > > >
> > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > fails
> > > > > without Docker.
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > [hidden email]>
> > > > > wrote:
> > > > >
> > > > >> Kevin:
> > > > >>
> > > > >> You're also using Docker, right? Docker is not "officially"
> > supported
> > > > >> although there's some movement in that direction and if this is only
> > > > >> reproducible in Docker than it's a clue where to look....
> > > > >>
> > > > >> Erick
> > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > >> Kevin Risden
> > > > >> <[hidden email]> wrote:
> > > > >> >
> > > > >> > I haven't dug into why this is happening but it definitely
> > > reproduces. I
> > > > >> > removed the local requirements (port mapping and such) from the
> > > gist you
> > > > >> > posted (very helpful). I confirmed this fails locally and on
> > Travis
> > > CI.
> > > > >> >
> > > > >> >
> > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > >> >
> > > > >> > I don't even see the first update getting applied from num 10 ->
> > 20.
> > > > >> After
> > > > >> > the first update there is no more change.
> > > > >> >
> > > > >> > Kevin Risden
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks Erick, this is 7.5.0.
> > > > >> > > ________________________________
> > > > >> > > From: Erick Erickson <[hidden email]>
> > > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > > >> > > To: solr-user
> > > > >> > > Subject: Re: SolrCloud Replication Failure
> > > > >> > >
> > > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > > IIRC
> > > > >> > >
> > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> > > wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > >      We are currently running a moderately large instance of
> > > > >> standalone
> > > > >> > > > solr and are preparing to switch to solr cloud to help us
> > scale
> > > > >> up.  I
> > > > >> > > have
> > > > >> > > > been running a number of tests using docker locally and ran
> > > into an
> > > > >> issue
> > > > >> > > > where replication is consistently failing.  I have pared down
> > > the
> > > > >> test
> > > > >> > > case
> > > > >> > > > as minimally as I could.  Here's a link for the
> > > docker-compose.yml
> > > > >> (I put
> > > > >> > > > it in a directory called solrcloud_simple) and a script to run
> > > the
> > > > >> test:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Here's the basic idea behind the test:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> > > and 2
> > > > >> > > > replicas (each node gets a replica).  Just use the default
> > > schema,
> > > > >> > > although
> > > > >> > > > I've also tried our schema and got the same result.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 2) Shut down solr-2
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 4) Start solr-2 and check that it received the documents.  It
> > > did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 5) Update a document, commit, and check that solr-2 received
> > the
> > > > >> update.
> > > > >> > > > It did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and
> > make
> > > > >> sure
> > > > >> > > that
> > > > >> > > > it received the update.  It did!
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> > > back
> > > > >> to what
> > > > >> > > > it had in step 5.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > I believe the main issue comes from this in the logs:
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > > > >> s:shard1
> > > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> > > PeerSync:
> > > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > > > >> versions
> > > > >> > > are
> > > > >> > > > newer. ourHighThreshold=1615861330901729280
> > > > >> > > > otherLowThreshold=1615861314086764545
> > > ourHighest=1615861330901729280
> > > > >> > > > otherHighest=1615861335081353216
> > > > >> > > >
> > > > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> > > reason,
> > > > >> so it
> > > > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> > > will
> > > > >> always
> > > > >> > > > have a lower version for the updated doc than solr-1.  I've
> > > tried
> > > > >> this
> > > > >> > > with
> > > > >> > > > different commit strategies, both auto and manual, and it
> > > doesn't
> > > > >> seem to
> > > > >> > > > make any difference.
> > > > >> > > >
> > > > >> > > > Is this a bug with solr, an issue with using docker, or am I
> > > just
> > > > >> > > > expecting too much from solr?
> > > > >> > > >
> > > > >> > > > Thanks for any insights you may have,
> > > > >> > > >
> > > > >> > > > Jeremy
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >>
> > > > >
> > >
> >
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud Replication Failure

Erick Erickson
Hmmm, ok. The replication failure could lead to the scenario I
outlined, but that's a secondary issue to the update not getting to
the follower in the first place as you say.
On Tue, Nov 6, 2018 at 12:19 PM Jeremy Smith <[hidden email]> wrote:

>
> Thanks everyone.  I added SOLR-12969.
>
>
> Erick - those sound like important questions, but I think this issue is slightly different.  In this case, replication is failing even if the leader never goes down.
>
> ________________________________
> From: Erick Erickson <[hidden email]>
> Sent: Tuesday, November 6, 2018 2:52:30 PM
> To: solr-user
> Subject: Re: SolrCloud Replication Failure
>
> Kevin:
>
> Well, let's certainly raise it as a JIRA, blocker or not I'm not sure.
> I _think_ the new LIR work done in Solr 7.3 might make it possible to
> detect this condition but I'm not totally sure what to do about it.
>
> So let's say the leader gets an update while a follower is down. (one
> leader and one follower for simplicity). Now say the leader dies and
> the follower is restarted. What should happen? Should Solr refuse to
> start? Would FORCELEADER work if the user was willing to lose data?
>
> Let's move the discussion to the JIRA though.
> On Tue, Nov 6, 2018 at 10:58 AM Kevin Risden <[hidden email]> wrote:
> >
> > Erick Erickson - I don't have much time to chase this down. Do you think
> > this a blocker for 7.6? It seems pretty serious.
> >
> > Jeremy - This would be a good JIRA to create - we can move the conversation
> > there to try to get the right people involved.
> >
> > Kevin Risden
> >
> >
> > On Fri, Nov 2, 2018 at 7:57 AM Jeremy Smith <[hidden email]> wrote:
> >
> > > Hi Susheel,
> > >
> > >      Yes, it appears that under certain conditions, if a follower is down
> > > when the leader gets an update, the follower will not receive that update
> > > when it comes back (or maybe it receives the update and it's then
> > > overwritten by its own transaction logs, I'm not sure).  Furthermore, if
> > > that follower then becomes the leader, it will replicate its own out of
> > > date value back to the former leader, even though the version number is
> > > lower.
> > >
> > >
> > >    -Jeremy
> > >
> > > ________________________________
> > > From: Susheel Kumar <[hidden email]>
> > > Sent: Thursday, November 1, 2018 2:57:00 PM
> > > To: [hidden email]
> > > Subject: Re: SolrCloud Replication Failure
> > >
> > > Are we saying it has to do something with stop and restarting replica's
> > > otherwise I haven't seen/heard any issues with document updates and
> > > forwarding to replica's...
> > >
> > > Thanks,
> > > Susheel
> > >
> > > On Thu, Nov 1, 2018 at 12:58 PM Erick Erickson <[hidden email]>
> > > wrote:
> > >
> > > > So  this seems like it absolutely needs a JIRA....
> > > > On Thu, Nov 1, 2018 at 9:39 AM
> > > Kevin Risden
> > > <[hidden email]> wrote:
> > > > >
> > > > > I pushed 3 branches that modifies test.sh to test 5.5, 6.6, and 7.5
> > > > locally
> > > > > without docker. I still see the same behavior where the latest updates
> > > > > aren't on the replicas. I still don't know what is happening but it
> > > > happens
> > > > > without Docker :(
> > > > >
> > > > >
> > > >
> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency/branches
> > > > >
> > > > > Kevin Risden
> > > > >
> > > > >
> > > > > On Thu, Nov 1, 2018 at 11:41 AM Kevin Risden <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Erick - Yea thats a fair point. Would be interesting to see if this
> > > > fails
> > > > > > without Docker.
> > > > > >
> > > > > > Kevin Risden
> > > > > >
> > > > > >
> > > > > > On Thu, Nov 1, 2018 at 11:06 AM Erick Erickson <
> > > > [hidden email]>
> > > > > > wrote:
> > > > > >
> > > > > >> Kevin:
> > > > > >>
> > > > > >> You're also using Docker, right? Docker is not "officially"
> > > supported
> > > > > >> although there's some movement in that direction and if this is only
> > > > > >> reproducible in Docker than it's a clue where to look....
> > > > > >>
> > > > > >> Erick
> > > > > >> On Wed, Oct 31, 2018 at 7:24 PM
> > > > > >> Kevin Risden
> > > > > >> <[hidden email]> wrote:
> > > > > >> >
> > > > > >> > I haven't dug into why this is happening but it definitely
> > > > reproduces. I
> > > > > >> > removed the local requirements (port mapping and such) from the
> > > > gist you
> > > > > >> > posted (very helpful). I confirmed this fails locally and on
> > > Travis
> > > > CI.
> > > > > >> >
> > > > > >> >
> > > https://github.com/risdenk/test-solr-start-stop-replica-consistency
> > > > > >> >
> > > > > >> > I don't even see the first update getting applied from num 10 ->
> > > 20.
> > > > > >> After
> > > > > >> > the first update there is no more change.
> > > > > >> >
> > > > > >> > Kevin Risden
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Oct 31, 2018 at 8:26 PM Jeremy Smith <[hidden email]
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks Erick, this is 7.5.0.
> > > > > >> > > ________________________________
> > > > > >> > > From: Erick Erickson <[hidden email]>
> > > > > >> > > Sent: Wednesday, October 31, 2018 8:20:18 PM
> > > > > >> > > To: solr-user
> > > > > >> > > Subject: Re: SolrCloud Replication Failure
> > > > > >> > >
> > > > > >> > > What version of solr? This code was pretty much rewriten in 7.3
> > > > IIRC
> > > > > >> > >
> > > > > >> > > On Wed, Oct 31, 2018, 10:47 Jeremy Smith <[hidden email]
> > > > wrote:
> > > > > >> > >
> > > > > >> > > > Hi all,
> > > > > >> > > >
> > > > > >> > > >      We are currently running a moderately large instance of
> > > > > >> standalone
> > > > > >> > > > solr and are preparing to switch to solr cloud to help us
> > > scale
> > > > > >> up.  I
> > > > > >> > > have
> > > > > >> > > > been running a number of tests using docker locally and ran
> > > > into an
> > > > > >> issue
> > > > > >> > > > where replication is consistently failing.  I have pared down
> > > > the
> > > > > >> test
> > > > > >> > > case
> > > > > >> > > > as minimally as I could.  Here's a link for the
> > > > docker-compose.yml
> > > > > >> (I put
> > > > > >> > > > it in a directory called solrcloud_simple) and a script to run
> > > > the
> > > > > >> test:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > https://gist.github.com/smithje/2056209fc4a6fb3bcc8b44d0b7df3489
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > Here's the basic idea behind the test:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 1) Create a cluster with 2 nodes (solr-1 and solr-2), 1 shard,
> > > > and 2
> > > > > >> > > > replicas (each node gets a replica).  Just use the default
> > > > schema,
> > > > > >> > > although
> > > > > >> > > > I've also tried our schema and got the same result.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 2) Shut down solr-2
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 3) Add 100 simple docs, just id and a field called num.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 4) Start solr-2 and check that it received the documents.  It
> > > > did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 5) Update a document, commit, and check that solr-2 received
> > > the
> > > > > >> update.
> > > > > >> > > > It did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 6) Stop solr-2, update the same document, start solr-2, and
> > > make
> > > > > >> sure
> > > > > >> > > that
> > > > > >> > > > it received the update.  It did!
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > 7) Repeat step 6 with a new value.  This time solr-2 reverts
> > > > back
> > > > > >> to what
> > > > > >> > > > it had in step 5.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > I believe the main issue comes from this in the logs:
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > solr-2_1  | 2018-10-31 17:04:26.135 INFO
> > > > > >> > > > (recoveryExecutor-4-thread-1-processing-n:solr-2:8082_solr
> > > > > >> > > > x:test_shard1_replica_n2 c:test s:shard1 r:core_node4) [c:test
> > > > > >> s:shard1
> > > > > >> > > > r:core_node4 x:test_shard1_replica_n2] o.a.s.u.PeerSync
> > > > PeerSync:
> > > > > >> > > > core=test_shard1_replica_n2 url=http://solr-2:8082/solr  Our
> > > > > >> versions
> > > > > >> > > are
> > > > > >> > > > newer. ourHighThreshold=1615861330901729280
> > > > > >> > > > otherLowThreshold=1615861314086764545
> > > > ourHighest=1615861330901729280
> > > > > >> > > > otherHighest=1615861335081353216
> > > > > >> > > >
> > > > > >> > > > PeerSync thinks the versions on solr-2 are newer for some
> > > > reason,
> > > > > >> so it
> > > > > >> > > > doesn't try to sync from solr-1.  In the final state, solr-2
> > > > will
> > > > > >> always
> > > > > >> > > > have a lower version for the updated doc than solr-1.  I've
> > > > tried
> > > > > >> this
> > > > > >> > > with
> > > > > >> > > > different commit strategies, both auto and manual, and it
> > > > doesn't
> > > > > >> seem to
> > > > > >> > > > make any difference.
> > > > > >> > > >
> > > > > >> > > > Is this a bug with solr, an issue with using docker, or am I
> > > > just
> > > > > >> > > > expecting too much from solr?
> > > > > >> > > >
> > > > > >> > > > Thanks for any insights you may have,
> > > > > >> > > >
> > > > > >> > > > Jeremy
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >
> > >