Negative Core Node Numbers

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Negative Core Node Numbers

Chris Ulicny
Hi,

In 7.1, how does solr determine the numbers that are assigned to the
replicas? I'm familiar with the earlier naming conventions from 6.3, but I
wanted to know if there was supposed to be any connection between the
"_n##" suffix and the number assigned to the "core_node##" name since they
don't seem to follow the old convention. As an example node from
clusterstatus for a testcollection with replication factor 2.

"core_node91":{
                "core":"testcollection_shard22_replica_n84",
                "base_url":"http://host:8080/solr",
                "node_name":"host:8080_solr",
                "state":"active",
                "type":"NRT",
                "leader":"true"}

Along the same lines, when creating the testcollection with 200 shards and
replication factor of 2, I am also getting nodes that have negative numbers
assigned to them which looks a lot like an int overflow issue. From the
cluster status:

          "shard157":{
            "range":"47ae0000-48f4ffff",
            "state":"active",
            "replicas":{
              "core_node1675945628":{
                "core":"testcollection _shard157_replica_n-1174535610",
                "base_url":"http://host1:8080/solr",
                "node_name":"host1:8080_solr",
                "state":"active",
                "type":"NRT"},
              "core_node1642259614":{
                "core":"testcollection _shard157_replica_n-1208090040",
                "base_url":"http://host2:8080/solr",
                "node_name":"host2:8080_solr",
                "state":"active",
                "type":"NRT",
                "leader":"true"}}}

This keeps happening even when the collection is successfully deleted (no
directories or files left on disk), the entire cluster is shutdown, and the
zookeeper chroot path cleared out of all content. The only thing that
happened prior to this cycle was a single failed collection creation which
seemed to clean itself up properly, after which everything was shutdown and
cleaned from zookeeper as well.

Is there something else that is keeping track of those values that wasn't
cleared out? Or is this now the expected behavior for the numerical
assignments to replicas?

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: Negative Core Node Numbers

Anshum Gupta-3
Hi Chris,

The core node numbers should be cleared out when the collection is deleted. Is that something you see consistently ?

P.S: I just tried creating a collection with 1 shard and 200 replicas and saw the core node numbers as expected. On deleting and recreating the collection, I saw that the counter was reset. Just to be clear, I tried this on master.

-Anshum



On Jan 4, 2018, at 12:16 PM, Chris Ulicny <[hidden email]> wrote:

Hi,

In 7.1, how does solr determine the numbers that are assigned to the
replicas? I'm familiar with the earlier naming conventions from 6.3, but I
wanted to know if there was supposed to be any connection between the
"_n##" suffix and the number assigned to the "core_node##" name since they
don't seem to follow the old convention. As an example node from
clusterstatus for a testcollection with replication factor 2.

"core_node91":{
               "core":"testcollection_shard22_replica_n84",
               "base_url":"http://host:8080/solr",
               "node_name":"host:8080_solr",
               "state":"active",
               "type":"NRT",
               "leader":"true"}

Along the same lines, when creating the testcollection with 200 shards and
replication factor of 2, I am also getting nodes that have negative numbers
assigned to them which looks a lot like an int overflow issue. From the
cluster status:

         "shard157":{
           "range":"47ae0000-48f4ffff",
           "state":"active",
           "replicas":{
             "core_node1675945628":{
               "core":"testcollection _shard157_replica_n-1174535610",
               "base_url":"http://host1:8080/solr",
               "node_name":"host1:8080_solr",
               "state":"active",
               "type":"NRT"},
             "core_node1642259614":{
               "core":"testcollection _shard157_replica_n-1208090040",
               "base_url":"http://host2:8080/solr",
               "node_name":"host2:8080_solr",
               "state":"active",
               "type":"NRT",
               "leader":"true"}}}

This keeps happening even when the collection is successfully deleted (no
directories or files left on disk), the entire cluster is shutdown, and the
zookeeper chroot path cleared out of all content. The only thing that
happened prior to this cycle was a single failed collection creation which
seemed to clean itself up properly, after which everything was shutdown and
cleaned from zookeeper as well.

Is there something else that is keeping track of those values that wasn't
cleared out? Or is this now the expected behavior for the numerical
assignments to replicas?

Thanks,
Chris


signature.asc (891 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Negative Core Node Numbers

Chris Ulicny
Thanks Anshum,

They don't seem to be consistently numbered on any particular collection
creation, but the same numbers will be reused (eventually). After about 3
or 4 tries, I got the same numbered replica on the same machine, so
something is being cleared out. The numbers are never consecutive though,
they start around 1, seem to be relatively sequential with gaps until about
120 or so, and then are all over the place. One other thing that seems to
be consistent on each new collection: the numbers at the end of
"core_node#" never appear as the number at the end of
"testcollection_shard1_replica_n#". Parts of the cluster state are below.

 "shard1":{
            "range":"80000000-8146ffff", "state":"active", "replicas":{
              "core_node2":{"core":"testcollection_shard1_replica_n1",
"base_url":"http://host5:8080/solr", "node_name":"host5:8080_solr",
"state":"active","type":"NRT", "leader":"true"},
              "core_node4":{"core":"testcollection_shard1_replica_n3",
"base_url":"http://host3:8080/solr", "node_name":"host3:8080_solr",
"state":"active","type":"NRT"}}},
"shard2":{
            "range":"81470000-828effff", "state":"active", "replicas":{
              "core_node6":{"core":"testcollection_shard2_replica_n5",
"base_url":"http://host1:8080/solr", "node_name":"host1:8080_solr",
"state":"active","type":"NRT"},
              "core_node8":{"core":"testcollection_shard2_replica_n7",
"base_url":"http://host2:8080/solr", "node_name":"host2:8080_solr",
"state":"active","type":"NRT", "leader":"true"}}}
...
"shard170":{
            "range":"58510000-5998ffff", "state":"active", "replicas":{

"core_node800109264":{"core":"testcollection_shard170_replica_n-2046950790
<(204)%20695-0790>", "base_url":"http://host2:8080/solr",
"node_name":"host2:8080_solr","state":"active", "type":"NRT",
"leader":"true"},

"core_node766423250":{"core":"testcollection_shard170_replica_n-2080505220",
"base_url":"http://host4:8080/solr",
"node_name":"host4:8080_solr","state":"active", "type":"NRT"}}}
...

Is there a way to view the counter in a deployed environment, or is it only
accessible through debugging solr?

The setup I've been trying was 200 shards with 2 replicas each, but trying
to create a collection with 1 shard and 200 replicas of it results in the
same situation with abnormal numbers.

A few other details on the setup: 5 solr nodes (v7.1.0), 3 zookeeper nodes
(v3.4.11), Ubuntu 16.04, all hosts (zk & solr) are machines in Google's
Cloud environment.


On Thu, Jan 4, 2018 at 5:53 PM Anshum Gupta <[hidden email]> wrote:

> Hi Chris,
>
> The core node numbers should be cleared out when the collection is
> deleted. Is that something you see consistently ?
>
> P.S: I just tried creating a collection with 1 shard and 200 replicas and
> saw the core node numbers as expected. On deleting and recreating the
> collection, I saw that the counter was reset. Just to be clear, I tried
> this on master.
>
> -Anshum
>
>
>
> On Jan 4, 2018, at 12:16 PM, Chris Ulicny <[hidden email]> wrote:
>
> Hi,
>
> In 7.1, how does solr determine the numbers that are assigned to the
> replicas? I'm familiar with the earlier naming conventions from 6.3, but I
> wanted to know if there was supposed to be any connection between the
> "_n##" suffix and the number assigned to the "core_node##" name since they
> don't seem to follow the old convention. As an example node from
> clusterstatus for a testcollection with replication factor 2.
>
> "core_node91":{
>                "core":"testcollection_shard22_replica_n84",
>                "base_url":"http://host:8080/solr",
>                "node_name":"host:8080_solr",
>                "state":"active",
>                "type":"NRT",
>                "leader":"true"}
>
> Along the same lines, when creating the testcollection with 200 shards and
> replication factor of 2, I am also getting nodes that have negative numbers
> assigned to them which looks a lot like an int overflow issue. From the
> cluster status:
>
>          "shard157":{
>            "range":"47ae0000-48f4ffff",
>            "state":"active",
>            "replicas":{
>              "core_node1675945628":{
>                "core":"testcollection _shard157_replica_n-1174535610",
>                "base_url":"http://host1:8080/solr",
>                "node_name":"host1:8080_solr",
>                "state":"active",
>                "type":"NRT"},
>              "core_node1642259614":{
>                "core":"testcollection _shard157_replica_n-1208090040",
>                "base_url":"http://host2:8080/solr",
>                "node_name":"host2:8080_solr",
>                "state":"active",
>                "type":"NRT",
>                "leader":"true"}}}
>
> This keeps happening even when the collection is successfully deleted (no
> directories or files left on disk), the entire cluster is shutdown, and the
> zookeeper chroot path cleared out of all content. The only thing that
> happened prior to this cycle was a single failed collection creation which
> seemed to clean itself up properly, after which everything was shutdown and
> cleaned from zookeeper as well.
>
> Is there something else that is keeping track of those values that wasn't
> cleared out? Or is this now the expected behavior for the numerical
> assignments to replicas?
>
> Thanks,
> Chris
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Negative Core Node Numbers

Chris Ulicny
After more testing, compiling the archived source and the pre-packaged
files on the archive.apache.org site for 7.1.0 keep generating the same
issue with negative core node numbers.

However, if I compile and run the 7.1 branch from github, it does not
produce the negative numbers. When generating a brand new collection with
200 shards with 2 replicas each, the counter ends up at 800 since it seems
the numbers used to suffix "core_node" and
"testcollection_shardX_replica_n" are pulled from the same counter.

I assume there have been some updates to the github branch in the case that
7.1.1 needed to be released, but some change between that and the released
7.1.0 seems to have fixed the counter issue in my scenario.

On Thu, Jan 4, 2018 at 8:14 PM Chris Ulicny <[hidden email]> wrote:

> Thanks Anshum,
>
> They don't seem to be consistently numbered on any particular collection
> creation, but the same numbers will be reused (eventually). After about 3
> or 4 tries, I got the same numbered replica on the same machine, so
> something is being cleared out. The numbers are never consecutive though,
> they start around 1, seem to be relatively sequential with gaps until about
> 120 or so, and then are all over the place. One other thing that seems to
> be consistent on each new collection: the numbers at the end of
> "core_node#" never appear as the number at the end of
> "testcollection_shard1_replica_n#". Parts of the cluster state are below.
>
>  "shard1":{
>             "range":"80000000-8146ffff", "state":"active", "replicas":{
>               "core_node2":{"core":"testcollection_shard1_replica_n1",
> "base_url":"http://host5:8080/solr", "node_name":"host5:8080_solr",
> "state":"active","type":"NRT", "leader":"true"},
>               "core_node4":{"core":"testcollection_shard1_replica_n3",
> "base_url":"http://host3:8080/solr", "node_name":"host3:8080_solr",
> "state":"active","type":"NRT"}}},
> "shard2":{
>             "range":"81470000-828effff", "state":"active", "replicas":{
>               "core_node6":{"core":"testcollection_shard2_replica_n5",
> "base_url":"http://host1:8080/solr", "node_name":"host1:8080_solr",
> "state":"active","type":"NRT"},
>               "core_node8":{"core":"testcollection_shard2_replica_n7",
> "base_url":"http://host2:8080/solr", "node_name":"host2:8080_solr",
> "state":"active","type":"NRT", "leader":"true"}}}
> ...
> "shard170":{
>             "range":"58510000-5998ffff", "state":"active", "replicas":{
>
> "core_node800109264":{"core":"testcollection_shard170_replica_n-2046950790
> <(204)%20695-0790>", "base_url":"http://host2:8080/solr",
> "node_name":"host2:8080_solr","state":"active", "type":"NRT",
> "leader":"true"},
>
> "core_node766423250":{"core":"testcollection_shard170_replica_n-2080505220",
> "base_url":"http://host4:8080/solr",
> "node_name":"host4:8080_solr","state":"active", "type":"NRT"}}}
> ...
>
> Is there a way to view the counter in a deployed environment, or is it
> only accessible through debugging solr?
>
> The setup I've been trying was 200 shards with 2 replicas each, but trying
> to create a collection with 1 shard and 200 replicas of it results in the
> same situation with abnormal numbers.
>
> A few other details on the setup: 5 solr nodes (v7.1.0), 3 zookeeper nodes
> (v3.4.11), Ubuntu 16.04, all hosts (zk & solr) are machines in Google's
> Cloud environment.
>
>
> On Thu, Jan 4, 2018 at 5:53 PM Anshum Gupta <[hidden email]> wrote:
>
>> Hi Chris,
>>
>> The core node numbers should be cleared out when the collection is
>> deleted. Is that something you see consistently ?
>>
>> P.S: I just tried creating a collection with 1 shard and 200 replicas and
>> saw the core node numbers as expected. On deleting and recreating the
>> collection, I saw that the counter was reset. Just to be clear, I tried
>> this on master.
>>
>> -Anshum
>>
>>
>>
>> On Jan 4, 2018, at 12:16 PM, Chris Ulicny <[hidden email]> wrote:
>>
>> Hi,
>>
>> In 7.1, how does solr determine the numbers that are assigned to the
>> replicas? I'm familiar with the earlier naming conventions from 6.3, but I
>> wanted to know if there was supposed to be any connection between the
>> "_n##" suffix and the number assigned to the "core_node##" name since they
>> don't seem to follow the old convention. As an example node from
>> clusterstatus for a testcollection with replication factor 2.
>>
>> "core_node91":{
>>                "core":"testcollection_shard22_replica_n84",
>>                "base_url":"http://host:8080/solr",
>>                "node_name":"host:8080_solr",
>>                "state":"active",
>>                "type":"NRT",
>>                "leader":"true"}
>>
>> Along the same lines, when creating the testcollection with 200 shards and
>> replication factor of 2, I am also getting nodes that have negative
>> numbers
>> assigned to them which looks a lot like an int overflow issue. From the
>> cluster status:
>>
>>          "shard157":{
>>            "range":"47ae0000-48f4ffff",
>>            "state":"active",
>>            "replicas":{
>>              "core_node1675945628":{
>>                "core":"testcollection _shard157_replica_n-1174535610",
>>                "base_url":"http://host1:8080/solr",
>>                "node_name":"host1:8080_solr",
>>                "state":"active",
>>                "type":"NRT"},
>>              "core_node1642259614":{
>>                "core":"testcollection _shard157_replica_n-1208090040",
>>                "base_url":"http://host2:8080/solr",
>>                "node_name":"host2:8080_solr",
>>                "state":"active",
>>                "type":"NRT",
>>                "leader":"true"}}}
>>
>> This keeps happening even when the collection is successfully deleted (no
>> directories or files left on disk), the entire cluster is shutdown, and
>> the
>> zookeeper chroot path cleared out of all content. The only thing that
>> happened prior to this cycle was a single failed collection creation which
>> seemed to clean itself up properly, after which everything was shutdown
>> and
>> cleaned from zookeeper as well.
>>
>> Is there something else that is keeping track of those values that wasn't
>> cleared out? Or is this now the expected behavior for the numerical
>> assignments to replicas?
>>
>> Thanks,
>> Chris
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Negative Core Node Numbers

Shawn Heisey-2
On 1/5/2018 1:35 PM, Chris Ulicny wrote:

> After more testing, compiling the archived source and the pre-packaged
> files on the archive.apache.org site for 7.1.0 keep generating the same
> issue with negative core node numbers.
>
> However, if I compile and run the 7.1 branch from github, it does not
> produce the negative numbers. When generating a brand new collection with
> 200 shards with 2 replicas each, the counter ends up at 800 since it seems
> the numbers used to suffix "core_node" and
> "testcollection_shardX_replica_n" are pulled from the same counter.
>
> I assume there have been some updates to the github branch in the case that
> 7.1.1 needed to be released, but some change between that and the released
> 7.1.0 seems to have fixed the counter issue in my scenario.

Here's the shortlog for the 7.1.0 release, showing October 13th as the
last commit date:

https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/tags/releases/lucene-solr/7.1.0

Here's the shortlog for the 7.1 branch, showing December 4th as the last
commit date:

https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=shortlog;h=refs/heads/branch_7_1

There are definitely differences between those code branches.  There are
a few dozen additional commits.  Nothing is jumping out at me as the
change that fixed the problem, but I'm not very familiar with that code.

The 7.2 version, which is already released, should contain whatever the
fix is.  I can't be certain of that because I do not know which change
fixed it, but virtually all changes that are backported to a minor
version branch are also applied to the stable branch and any newer minor
version branches.

Thanks,
Shawn