[jira] [Updated] (SOLR-11484) Possible bug with CloudSolrClient directedUpdates & cached collection state -- TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (SOLR-11484) Possible bug with CloudSolrClient directedUpdates & cached collection state -- TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/SOLR-11484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-11484:
----------------------------
    Attachment: jenkins.thetaphi.20662.txt

I've attached the log from https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/

A quick walk through of some of the key bits of logging from testCollectionCreateSearchDelete ...


Test Creates a 2x2 collection, resulting in 4 SolrCores...

{noformat}  
   [junit4]   2> 334246 INFO  (parallelCoreAdminExecutor-1592-thread-1-processing-n:127.0.0.1:42959_solr aecb57e9-be15-41b4-892b-269c3ef4df623724636310642455 CREATE) [n:127.0.0.1:42959_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&async=aecb57e9-be15-41b4-892b-269c3ef4df623724636310642455&coreNodeName=core_node4&name=testcollection_shard1_replica_n3&action=CREATE&numShards=2&shard=shard1&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 334261 INFO  (parallelCoreAdminExecutor-1581-thread-1-processing-n:127.0.0.1:34827_solr aecb57e9-be15-41b4-892b-269c3ef4df623724636321671098 CREATE) [n:127.0.0.1:34827_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&async=aecb57e9-be15-41b4-892b-269c3ef4df623724636321671098&coreNodeName=core_node8&name=testcollection_shard2_replica_n6&action=CREATE&numShards=2&shard=shard2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 334272 INFO  (parallelCoreAdminExecutor-1579-thread-1-processing-n:127.0.0.1:38927_solr aecb57e9-be15-41b4-892b-269c3ef4df623724636317202530 CREATE) [n:127.0.0.1:38927_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&async=aecb57e9-be15-41b4-892b-269c3ef4df623724636317202530&coreNodeName=core_node7&name=testcollection_shard2_replica_n5&action=CREATE&numShards=2&shard=shard2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 334272 INFO  (parallelCoreAdminExecutor-1577-thread-1-processing-n:127.0.0.1:36233_solr aecb57e9-be15-41b4-892b-269c3ef4df623724636304239736 CREATE) [n:127.0.0.1:36233_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&async=aecb57e9-be15-41b4-892b-269c3ef4df623724636304239736&coreNodeName=core_node2&name=testcollection_shard1_replica_n1&action=CREATE&numShards=2&shard=shard1&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
{noformat}

Note that these initial Cores are:

* testcollection_shard1_replica_n3
* testcollection_shard2_replica_n6
* testcollection_shard2_replica_n5
* testcollection_shard1_replica_n1

...the test then does some searching of this collection, and a few other things, before ultimately deleting this collection, and verifies the delete worked correctly...

{noformat}
   ....
   [junit4]   2> 340530 INFO  (TEST-TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete-seed#[DC2FFDE51D1E4138]) [    ] o.a.s.c.AbstractDistribZkTestBase Wait for collection to disappear - collection: testcollection failOnTimeout:true timeout (sec):330
   [junit4]   1> -
   [junit4]   2> 340530 INFO  (TEST-TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete-seed#[DC2FFDE51D1E4138]) [    ] o.a.s.c.AbstractDistribZkTestBase Collection has disappeared - collection: testcollection
{noformat}

...at which point it recreates the collection, and we get 4 completely new SolrCores...

{noformat}
   [junit4]   2> 340531 INFO  (qtp6784903-4183) [n:127.0.0.1:38927_solr    ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params replicationFactor=2&collection.configName=solrCloudCollectionConfig&maxShardsPerNode=1&name=testcollection&nrtReplicas=2&action=CREATE&numShards=2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin&version=2 and sendToOCPQueue=true
   [junit4]   2> 340533 INFO  (OverseerThreadFactory-1587-thread-3-processing-n:127.0.0.1:34827_solr) [n:127.0.0.1:34827_solr    ] o.a.s.c.CreateCollectionCmd Create collection testcollection
   ....
   [junit4]   2> 340948 INFO  (qtp13197426-4184) [n:127.0.0.1:36233_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&coreNodeName=core_node8&name=testcollection_shard2_replica_n6&action=CREATE&numShards=2&shard=shard2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 340948 INFO  (qtp13767699-4323) [n:127.0.0.1:42959_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&coreNodeName=core_node5&name=testcollection_shard1_replica_n2&action=CREATE&numShards=2&shard=shard1&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 340949 INFO  (qtp18576982-4232) [n:127.0.0.1:34827_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&coreNodeName=core_node3&name=testcollection_shard1_replica_n1&action=CREATE&numShards=2&shard=shard1&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
   [junit4]   2> 341020 INFO  (qtp1638616-4379) [n:127.0.0.1:40813_solr    ] o.a.s.h.a.CoreAdminOperation core create command qt=/admin/cores&collection.configName=solrCloudCollectionConfig&newCollection=true&collection=testcollection&version=2&replicaType=NRT&coreNodeName=core_node7&name=testcollection_shard2_replica_n4&action=CREATE&numShards=2&shard=shard2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin
{noformat}

Note the new SolrCore names/paths...

* testcollection_shard1_replica_n1
* testcollection_shard1_replica_n2
* testcollection_shard2_replica_n4
* testcollection_shard2_replica_n6

...note in particular that there is *NOT* a "testcollection_shard1_replica_n3" core in the "new" instance of this collection ... that was only the *prior* (instance) of this collection.

Collection creation finishes...

{noformat}
   [junit4]   2> 344142 INFO  (qtp6784903-4183) [n:127.0.0.1:38927_solr    ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={replicationFactor=2&collection.configName=solrCloudCollectionConfig&maxShardsPerNode=1&name=testcollection&nrtReplicas=2&action=CREATE&numShards=2&property.solr.directoryFactory=solr.StandardDirectoryFactory&wt=javabin&version=2} status=0 QTime=3610
   [junit4]   2> 344143 INFO  (TEST-TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete-seed#[DC2FFDE51D1E4138]) [    ] o.a.s.c.AbstractDistribZkTestBase Wait for recoveries to finish - collection: testcollection failOnTimeout:true timeout (sec):330
   ...
   [junit4]   2> 344144 INFO  (TEST-TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete-seed#[DC2FFDE51D1E4138]) [    ] o.a.s.c.AbstractDistribZkTestBase Recoveries finished - collection: testcollection
{noformat}

...now the test does a "*:*" query over the colleciton -- to ensure no docs found...

{noformat}
   [junit4]   2> 344168 INFO  (qtp13767699-4319) [n:127.0.0.1:42959_solr c:testcollection s:shard1 r:core_node5 x:testcollection_shard1_replica_n2] o.a.s.c.S.Request [testcollection_shard1_replica_n2]  webapp=/solr path=/select params={df=text&distrib=false&_stateVer_=testcollection:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://127.0.0.1:42959/solr/testcollection_shard1_replica_n2/|https://127.0.0.1:34827/solr/testcollection_shard1_replica_n1/&rows=10&version=2&q=*:*&NOW=1507805797249&isShard=true&wt=javabin} hits=0 status=0 QTime=0
   [junit4]   2> 344168 INFO  (qtp1638616-4384) [n:127.0.0.1:40813_solr c:testcollection s:shard2 r:core_node7 x:testcollection_shard2_replica_n4] o.a.s.c.S.Request [testcollection_shard2_replica_n4]  webapp=/solr path=/select params={df=text&distrib=false&_stateVer_=testcollection:7&fl=id&fl=score&shards.purpose=4&start=0&fsv=true&shard.url=https://127.0.0.1:40813/solr/testcollection_shard2_replica_n4/|https://127.0.0.1:36233/solr/testcollection_shard2_replica_n6/&rows=10&version=2&q=*:*&NOW=1507805797249&isShard=true&wt=javabin} hits=0 status=0 QTime=0
   [junit4]   2> 344169 INFO  (qtp18576982-4231) [n:127.0.0.1:34827_solr c:testcollection s:shard1 r:core_node3 x:testcollection_shard1_replica_n1] o.a.s.c.S.Request [testcollection_shard1_replica_n1]  webapp=/solr path=/select params={q=*:*&_stateVer_=testcollection:7&wt=javabin&version=2} hits=0 status=0 QTime=2
{noformat}

...that request was directed to "testcollection_shard1_replica_n1" (a core name that just so happened to exist in both instances of the "testcollection") and that works and we get the correct response of numFound=0.

But the next line in the test/lots is an attempt to add a document -- and this is what fails...

{noformat}
   [junit4]   2> 344173 ERROR (TEST-TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete-seed#[DC2FFDE51D1E4138]) [    ] o.a.s.c.s.i.CloudSolrClient Request to collection testcollection failed due to (404) org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type application/octet-stream but got text/html. <html>
   [junit4]   2> <head>
   [junit4]   2> <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
   [junit4]   2> <title>Error 404 </title>
   [junit4]   2> </head>
   [junit4]   2> <body>
   [junit4]   2> <h2>HTTP ERROR: 404</h2>
   [junit4]   2> <p>Problem accessing /solr/testcollection_shard1_replica_n3/update. Reason:
   [junit4]   2> <pre>    Can not find: /solr/testcollection_shard1_replica_n3/update</pre></p>
   [junit4]   2> <hr /><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.3.20.v20170531</a><hr/>
   [junit4]   2> </body>
   [junit4]   2> </html>
   [junit4]   2> , retry? 0
{noformat}

...the key problem being that {{CloudSolrClient.directUpdate}} is using (stale) ClusterState info about the collection to try and route that update directly to "testcollection_shard1_replica_n3" which doesn't exist in the collection -- it hasn't existed since the *last* time this collection existed.

And once this 404 is returned, CloudSolrClient does not (or can not? not sure what the rules are for Updates) make any attempt to retry this update.

The stale collection info and the resulting 404 are ultimately what fails the test...

{noformat}
   [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
   [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type a
pplication/octet-stream but got text/html. <html>
   [junit4]    > <head>
   [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
   [junit4]    > <title>Error 404 </title>
   [junit4]    > </head>
   [junit4]    > <body>
   [junit4]    > <h2>HTTP ERROR: 404</h2>
   [junit4]    > <p>Problem accessing /solr/testcollection_shard1_replica_n3/update. Reason:
   [junit4]    > <pre>    Can not find: /solr/testcollection_shard1_replica_n3/update</pre></p>
   [junit4]    > <hr /><a href="http://eclipse.org/jetty">Powered by Jetty:// 9.3.20.v20170531</a><hr/>
   [junit4]    > </body>
   [junit4]    > </html>
   [junit4]    >        at __randomizedtesting.SeedInfo.seed([DC2FFDE51D1E4138:7FD553409AF6AB9D]:0)
   [junit4]    >        at org.apache.solr.client.solrj.impl.CloudSolrClient.directUpdate(CloudSolrClient.java:539)
   [junit4]    >        at org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:993)
   [junit4]    >        at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:862)
   [junit4]    >        at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:793)
   [junit4]    >        at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:178)
   [junit4]    >        at org.apache.solr.client.solrj.request.UpdateRequest.commit(UpdateRequest.java:233)
   [junit4]    >        at org.apache.solr.cloud.TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete(TestCollectionsAPIViaSolrCloudCluster.java:167)
{noformat}


...shouldn't this type of failure cause the (stale) collection info in CloudSolrClient to be refreshed, and then retry the update?


> Possible bug with CloudSolrClient directedUpdates & cached collection state -- TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
> -----------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-11484
>                 URL: https://issues.apache.org/jira/browse/SOLR-11484
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public)
>            Reporter: Hoss Man
>         Attachments: jenkins.thetaphi.20662.txt
>
>
> {{TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete}} seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying to follow along with the actions to figure out what exactly is happening....
> https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/
> {noformat}
>    [junit4] ERROR   20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<<
>    [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type a
> pplication/octet-stream but got text/html. <html>
>    [junit4]    > <head>
>    [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/>
>    [junit4]    > <title>Error 404 </title>
> {noformat}
> The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's cached ClusterState info when doing (direct) updates.  The key bits seem to be:
> * CloudSolrClient does _something_ (update,query,etc...) with a collection causing the current cluster state for the collection to be cached
> * The actual collection changes such that a Solr node/core no longer exists as part of the collection
> * CloudSolrClient is asked to process an UpdateRequest which triggers the code paths for the {{directUpdate()}} method -- which attempts to route the updates directly to a replica of the appropriate shard using the (cache) collection state info
> * CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't exist, getting a 404 -- which does not (seem to) trigger a state refresh, or retry to find a correct URL to resend the update to.
> Details to follow in comment....



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]