Quantcast

SolrCloud on Trunk

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

SolrCloud on Trunk

Jamie Johnson
I just want to verify some of the features in regards to SolrCloud
that are now on Trunk

documents added to the cluster are automatically distributed amongst
the available shards (I had seen that Yonik had ported the Murmur
hash, but I didn't see that on trunk, what is being used and where can
I look at it?)
documents deletes/updates are automatically forwarded to the correct
shard no matter which shard/replica they are originally sent to

Also on the latest trunk when I run the embedded zk as is describe
here (http://wiki.apache.org/solr/SolrCloud2) I keep getting the
following information

12/01/27 23:44:38 INFO server.NIOServerCnxn: Accepted socket
connection from /fe80:0:0:0:0:0:0:1%1:57549
12/01/27 23:44:38 INFO server.NIOServerCnxn: Refusing session request
for client /fe80:0:0:0:0:0:0:1%1:57549 as it has seen zxid 0x179 our
last zxid is 0x10f client must try another server
12/01/27 23:44:38 INFO server.NIOServerCnxn: Closed socket connection
for client /fe80:0:0:0:0:0:0:1%1:57549 (no session established for
client)
12/01/27 23:44:38 INFO server.NIOServerCnxn: Accepted socket
connection from /127.0.0.1:57550
12/01/27 23:44:38 INFO server.NIOServerCnxn: Closed socket connection
for client /127.0.0.1:57550 (no session established for client)
12/01/27 23:44:39 INFO server.NIOServerCnxn: Accepted socket
connection from /0:0:0:0:0:0:0:1%0:57551

I don't actually run with the embedded ZK in production so I am not
all that worried about this, but figured it was worth figuring out
what was happening.

As always awesome work.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Yonik Seeley-2-2
On Fri, Jan 27, 2012 at 11:46 PM, Jamie Johnson <[hidden email]> wrote:
> I just want to verify some of the features in regards to SolrCloud
> that are now on Trunk
>
> documents added to the cluster are automatically distributed amongst
> the available shards (I had seen that Yonik had ported the Murmur
> hash, but I didn't see that on trunk, what is being used and where can
> I look at it?

It's cut'n'pasted into the Hash class
./solr/solrj/src/java/org/apache/solr/common/util/Hash.java
that also has a lookup3 variant for Strings.

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Jamie Johnson
Thanks Yonik!  I had not dug deeply into it but had expected to find a
class named Murmur which I did not.

Second question, I know there are discussion about storing the shard
assignments in ZK (i.e. shard 1 is responsible for hashed values
between 0 and 10, shard 2 is responsible for hashed values between 11
and 20, etc), this isn't done yet right?  So currently the hashing is
based on the number of shards instead of having the assignments being
calculated the first time you start the cluster (i.e. based on
numShards) so it could be adjusted later, right?

On Sat, Jan 28, 2012 at 10:28 AM, Yonik Seeley
<[hidden email]> wrote:

> On Fri, Jan 27, 2012 at 11:46 PM, Jamie Johnson <[hidden email]> wrote:
>> I just want to verify some of the features in regards to SolrCloud
>> that are now on Trunk
>>
>> documents added to the cluster are automatically distributed amongst
>> the available shards (I had seen that Yonik had ported the Murmur
>> hash, but I didn't see that on trunk, what is being used and where can
>> I look at it?
>
> It's cut'n'pasted into the Hash class
> ./solr/solrj/src/java/org/apache/solr/common/util/Hash.java
> that also has a lookup3 variant for Strings.
>
> -Yonik
> http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Yonik Seeley-2-2
On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <[hidden email]> wrote:
> Second question, I know there are discussion about storing the shard
> assignments in ZK (i.e. shard 1 is responsible for hashed values
> between 0 and 10, shard 2 is responsible for hashed values between 11
> and 20, etc), this isn't done yet right?  So currently the hashing is
> based on the number of shards instead of having the assignments being
> calculated the first time you start the cluster (i.e. based on
> numShards) so it could be adjusted later, right?

Right.  Storing the hash range for each shard/node is something we'll
need to dynamically change the number of shards (as opposed to
replicas), so we'll need to start doing it sooner or later.

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Lance Norskog-2
If this is to do load balancing, the usual solution is to use many
small shards, so you can just move one or two without doing any
surgery on indexes.

On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
<[hidden email]> wrote:

> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <[hidden email]> wrote:
>> Second question, I know there are discussion about storing the shard
>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>> between 0 and 10, shard 2 is responsible for hashed values between 11
>> and 20, etc), this isn't done yet right?  So currently the hashing is
>> based on the number of shards instead of having the assignments being
>> calculated the first time you start the cluster (i.e. based on
>> numShards) so it could be adjusted later, right?
>
> Right.  Storing the hash range for each shard/node is something we'll
> need to dynamically change the number of shards (as opposed to
> replicas), so we'll need to start doing it sooner or later.
>
> -Yonik
> http://www.lucidimagination.com



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Jamie Johnson
The case is actually anytime you need to add another shard.  With the
current implementation if you need to add a new shard the current
hashing approach breaks down.  Even with many small shards I think you
still have this issue when you're adding/updating/deleting docs.  I'm
definitely interested in hearing other approaches that would work
though if there are any.

On Sat, Jan 28, 2012 at 7:53 PM, Lance Norskog <[hidden email]> wrote:

> If this is to do load balancing, the usual solution is to use many
> small shards, so you can just move one or two without doing any
> surgery on indexes.
>
> On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
> <[hidden email]> wrote:
>> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <[hidden email]> wrote:
>>> Second question, I know there are discussion about storing the shard
>>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>>> between 0 and 10, shard 2 is responsible for hashed values between 11
>>> and 20, etc), this isn't done yet right?  So currently the hashing is
>>> based on the number of shards instead of having the assignments being
>>> calculated the first time you start the cluster (i.e. based on
>>> numShards) so it could be adjusted later, right?
>>
>> Right.  Storing the hash range for each shard/node is something we'll
>> need to dynamically change the number of shards (as opposed to
>> replicas), so we'll need to start doing it sooner or later.
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>
>
> --
> Lance Norskog
> [hidden email]
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Andre Bois-Crettez
Consistent hashing seem like a solution to reduce the shuffling of keys
when adding/deleting shards :
http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/

Twitter describe a more flexible sharding in section "Gizzard handles
partitioning through a forwarding table"
https://github.com/twitter/gizzard
An explicit mapping could allow to take advantage of heterogeneous
servers, and still allow for reduced shuffling of document when
expanding/reducing the cluster.

Are there any ideas or progress in this direction, be it in a branch or
in JIRA issues ?


Andre


Jamie Johnson wrote:

> The case is actually anytime you need to add another shard.  With the
> current implementation if you need to add a new shard the current
> hashing approach breaks down.  Even with many small shards I think you
> still have this issue when you're adding/updating/deleting docs.  I'm
> definitely interested in hearing other approaches that would work
> though if there are any.
>
> On Sat, Jan 28, 2012 at 7:53 PM, Lance Norskog <[hidden email]> wrote:
>
>> If this is to do load balancing, the usual solution is to use many
>> small shards, so you can just move one or two without doing any
>> surgery on indexes.
>>
>> On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
>> <[hidden email]> wrote:
>>
>>> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <[hidden email]> wrote:
>>>
>>>> Second question, I know there are discussion about storing the shard
>>>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>>>> between 0 and 10, shard 2 is responsible for hashed values between 11
>>>> and 20, etc), this isn't done yet right?  So currently the hashing is
>>>> based on the number of shards instead of having the assignments being
>>>> calculated the first time you start the cluster (i.e. based on
>>>> numShards) so it could be adjusted later, right?
>>>>
>>> Right.  Storing the hash range for each shard/node is something we'll
>>> need to dynamically change the number of shards (as opposed to
>>> replicas), so we'll need to start doing it sooner or later.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>> --
>> Lance Norskog
>> [hidden email]
>>
>
>

--
André Bois-Crettez

Search technology, Kelkoo
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Jamie Johnson
Very interesting Andre.  I believe this is inline with the larger
vision, specifically you'd use the hashing algorithm to create the
initial splits in the forwarding table, then if you needed to add a
new shard you'd need to split/merge an existing range.  I think
creating the algorithm is probably the easier part (maybe I'm wrong?),
the harder part to me appears to be splitting the index based on the
new ranges and then moving that split to a new core.  I'm aware of the
index splitter contrib which could be used for this, but I am unaware
of where specifically this is on the roadmap for SolrCloud.  Anyone
else have those details?

On Tue, Feb 28, 2012 at 5:40 AM, Andre Bois-Crettez
<[hidden email]> wrote:

> Consistent hashing seem like a solution to reduce the shuffling of keys
> when adding/deleting shards :
> http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
>
> Twitter describe a more flexible sharding in section "Gizzard handles
> partitioning through a forwarding table"
> https://github.com/twitter/gizzard
> An explicit mapping could allow to take advantage of heterogeneous
> servers, and still allow for reduced shuffling of document when
> expanding/reducing the cluster.
>
> Are there any ideas or progress in this direction, be it in a branch or
> in JIRA issues ?
>
>
> Andre
>
>
>
> Jamie Johnson wrote:
>>
>> The case is actually anytime you need to add another shard.  With the
>> current implementation if you need to add a new shard the current
>> hashing approach breaks down.  Even with many small shards I think you
>> still have this issue when you're adding/updating/deleting docs.  I'm
>> definitely interested in hearing other approaches that would work
>> though if there are any.
>>
>> On Sat, Jan 28, 2012 at 7:53 PM, Lance Norskog <[hidden email]> wrote:
>>
>>> If this is to do load balancing, the usual solution is to use many
>>> small shards, so you can just move one or two without doing any
>>> surgery on indexes.
>>>
>>> On Sat, Jan 28, 2012 at 2:46 PM, Yonik Seeley
>>> <[hidden email]> wrote:
>>>
>>>> On Sat, Jan 28, 2012 at 3:45 PM, Jamie Johnson <[hidden email]>
>>>> wrote:
>>>>
>>>>> Second question, I know there are discussion about storing the shard
>>>>> assignments in ZK (i.e. shard 1 is responsible for hashed values
>>>>> between 0 and 10, shard 2 is responsible for hashed values between 11
>>>>> and 20, etc), this isn't done yet right?  So currently the hashing is
>>>>> based on the number of shards instead of having the assignments being
>>>>> calculated the first time you start the cluster (i.e. based on
>>>>> numShards) so it could be adjusted later, right?
>>>>>
>>>> Right.  Storing the hash range for each shard/node is something we'll
>>>> need to dynamically change the number of shards (as opposed to
>>>> replicas), so we'll need to start doing it sooner or later.
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>
>>> --
>>> Lance Norskog
>>> [hidden email]
>>>
>>
>>
>
> --
> André Bois-Crettez
>
> Search technology, Kelkoo
> http://www.kelkoo.com/
>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à l'attention
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
> message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Mark Miller-3

On Feb 28, 2012, at 9:33 AM, Jamie Johnson wrote:

> where specifically this is on the roadmap for SolrCloud.  Anyone
> else have those details?

I think we would like to do this sometime in the near future, but I don't know exactly what time frame fits in yet. There is a lot to do still, and we also need to get a 4 release of both Lucene and Solr out to users soon. It could be in a point release later - but it's open source - it really just depends on what people start doing it and get it done. I will say it's something I'd like to see done.

With what we have now, one option we have talked about in the past was to just install multiple shards on a single machine - later you can start up a replica on a new machine when you are ready to grow and kill the original shard.

i.e. you could startup 15 shards on a single machine, and then over time migrate shards off nodes and onto new hardware. It's as simple as starting up a new replica on the new hardware and removing the core on machines you want to stop serving that shard from. This would let you expand to a 15 shard/machine cluster with N replicas (scaling replicas is as simple as starting a new node or stopping an old one).

- Mark Miller
lucidimagination.com











Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Jamie Johnson
Mark,

Is there a ticket around doing this?  If the work/design was written
down somewhere the community might have a better idea of how exactly
we could help.

On Wed, Feb 29, 2012 at 11:21 PM, Mark Miller <[hidden email]> wrote:

>
> On Feb 28, 2012, at 9:33 AM, Jamie Johnson wrote:
>
>> where specifically this is on the roadmap for SolrCloud.  Anyone
>> else have those details?
>
> I think we would like to do this sometime in the near future, but I don't know exactly what time frame fits in yet. There is a lot to do still, and we also need to get a 4 release of both Lucene and Solr out to users soon. It could be in a point release later - but it's open source - it really just depends on what people start doing it and get it done. I will say it's something I'd like to see done.
>
> With what we have now, one option we have talked about in the past was to just install multiple shards on a single machine - later you can start up a replica on a new machine when you are ready to grow and kill the original shard.
>
> i.e. you could startup 15 shards on a single machine, and then over time migrate shards off nodes and onto new hardware. It's as simple as starting up a new replica on the new hardware and removing the core on machines you want to stop serving that shard from. This would let you expand to a 15 shard/machine cluster with N replicas (scaling replicas is as simple as starting a new node or stopping an old one).
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: SolrCloud on Trunk

Yonik Seeley-2-2
On Thu, Mar 1, 2012 at 12:27 AM, Jamie Johnson <[hidden email]> wrote:
> Is there a ticket around doing this?

Around splitting shards?

The easiest thing to consider is just splitting a single shard in two
reusing some of the existing buffering/replication mechanisms we have.
1) create two new shards to represent each half of the old index
2) make sure leaders are forwarding udpates to them and that the
shards are buffering them
3) do a commit and split the current index
4) proceed with recovery as normal on the two new shards (replicate
the halfs, apply the buffered updates)
5) some unresolved stuff such as how to transition leadership from the
single big shard to the smaller shards.  maybe just handle like leader
failure.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10
Loading...