Sharding configuration

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Sharding configuration

Anca Kopetz
Hi,

We have a SolrCloud configuration of 10 servers, no sharding, 20
millions of documents, the index has 26 GB.
As the number of documents has increased recently, the performance of
the cluster decreased.

We thought of sharding the index, in order to measure the latency. What
is the best approach ?
- to use shard splitting and have several sub-shards on the same server
and in the same tomcat instance
- having several shards on the same server but on different tomcat instances
- having one shard on each server (for example 2 shards / 5 replicas on
10 servers)

What's the impact of these 3 configuration on performance ?

Thanks,
Anca

--

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Ramkumar R. Aiyengar
As far as the second option goes, unless you are using a large amount of
memory and you reach a point where a JVM can't sensibly deal with a GC
load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
probably haven't reached that point. There are also other shared resources
at an instance level like connection pools and ZK connections, but those
are tunable and you probably aren't pushing them as well (I would imagine
you are just trying to have only a handful of shards given that you aren't
sharded at all currently).

That leaves single vs multiple machines. Assuming the network isn't a
bottleneck, and given the same amount of resources overall (number of
cores, amount of memory, IO bandwidth times number of machines), it
shouldn't matter between the two. If you are procuring new hardware, I
would say buy more, smaller machines, but if you already have the hardware,
you could serve as much as possible off a machine before moving to a
second. There's nothing which limits the number of shards as long as the
underlying machine has the sufficient amount of parallelism.

Again, this advice is for a small number of shards, if you had a lot more
(hundreds) of shards and significant volume of requests, things start to
become a bit more fuzzy with other limits kicking in.
On 28 Oct 2014 09:26, "Anca Kopetz" <[hidden email]> wrote:

> Hi,
>
> We have a SolrCloud configuration of 10 servers, no sharding, 20
> millions of documents, the index has 26 GB.
> As the number of documents has increased recently, the performance of
> the cluster decreased.
>
> We thought of sharding the index, in order to measure the latency. What
> is the best approach ?
> - to use shard splitting and have several sub-shards on the same server
> and in the same tomcat instance
> - having several shards on the same server but on different tomcat
> instances
> - having one shard on each server (for example 2 shards / 5 replicas on
> 10 servers)
>
> What's the impact of these 3 configuration on performance ?
>
> Thanks,
> Anca
>
> --
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à
> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
> destinataire de ce message, merci de le détruire et d'en avertir
> l'expéditeur.
>
Reply | Threaded
Open this post in threaded view
|

RE: Sharding configuration

wmartinusa

Informational only. FYI

 

Machine parallelism has been empirically proven to be application dependent.

See DaCapo benchmarks (lucene indexing and lucene searching) use in  http://dx.doi.org/10.1145/2479871.2479901

 

" Parallelism profiling and wall-time prediction for multi-threaded applications" 2013.

 

FYI:

 

 

 

 

-----Original Message-----
From: Ramkumar R. Aiyengar [mailto:[hidden email]]
Sent: Tuesday, October 28, 2014 3:44 PM
To: [hidden email]
Subject: Re: Sharding configuration

 

As far as the second option goes, unless you are using a large amount of memory and you reach a point where a JVM can't sensibly deal with a GC load, having multiple JVMs wouldn't buy you much. With a 26GB index, you probably haven't reached that point. There are also other shared resources at an instance level like connection pools and ZK connections, but those are tunable and you probably aren't pushing them as well (I would imagine you are just trying to have only a handful of shards given that you aren't sharded at all currently).

 

That leaves single vs multiple machines. Assuming the network isn't a bottleneck, and given the same amount of resources overall (number of cores, amount of memory, IO bandwidth times number of machines), it shouldn't matter between the two. If you are procuring new hardware, I would say buy more, smaller machines, but if you already have the hardware, you could serve as much as possible off a machine before moving to a second. There's nothing which limits the number of shards as long as the underlying machine has the sufficient amount of parallelism.

 

Again, this advice is for a small number of shards, if you had a lot more

(hundreds) of shards and significant volume of requests, things start to become a bit more fuzzy with other limits kicking in.

On 28 Oct 2014 09:26, "Anca Kopetz" <[hidden email]> wrote:

 

> Hi,

> 

> We have a SolrCloud configuration of 10 servers, no sharding, 20

> millions of documents, the index has 26 GB.

> As the number of documents has increased recently, the performance of

> the cluster decreased.

> 

> We thought of sharding the index, in order to measure the latency.

> What is the best approach ?

> - to use shard splitting and have several sub-shards on the same

> server and in the same tomcat instance

> - having several shards on the same server but on different tomcat

> instances

> - having one shard on each server (for example 2 shards / 5 replicas

> on

> 10 servers)

> 

> What's the impact of these 3 configuration on performance ?

> 

> Thanks,

> Anca

> 

> --

> 

> Kelkoo SAS

> Société par Actions Simplifiée

> Au capital de € 4.168.964,30

> Siège social : 8, rue du Sentier 75002 Paris

> 425 093 069 RCS Paris

> 

> Ce message et les pièces jointes sont confidentiels et établis à

> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le

> destinataire de ce message, merci de le détruire et d'en avertir

> l'expéditeur.

> 

Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Anca Kopetz
In reply to this post by Ramkumar R. Aiyengar
Hi,

We did some tests with 4 shards / 4 different tomcat instances on the
same server and the average latency was smaller than the one when having
only one shard.
We tested also é shards on different servers and the performance results
were also worse.

It seems that the sharding does not make any difference for our index in
terms of latency gains.

Thanks for your response,
Anca

On 10/28/2014 08:44 PM, Ramkumar R. Aiyengar wrote:

> As far as the second option goes, unless you are using a large amount of
> memory and you reach a point where a JVM can't sensibly deal with a GC
> load, having multiple JVMs wouldn't buy you much. With a 26GB index, you
> probably haven't reached that point. There are also other shared resources
> at an instance level like connection pools and ZK connections, but those
> are tunable and you probably aren't pushing them as well (I would imagine
> you are just trying to have only a handful of shards given that you aren't
> sharded at all currently).
>
> That leaves single vs multiple machines. Assuming the network isn't a
> bottleneck, and given the same amount of resources overall (number of
> cores, amount of memory, IO bandwidth times number of machines), it
> shouldn't matter between the two. If you are procuring new hardware, I
> would say buy more, smaller machines, but if you already have the hardware,
> you could serve as much as possible off a machine before moving to a
> second. There's nothing which limits the number of shards as long as the
> underlying machine has the sufficient amount of parallelism.
>
> Again, this advice is for a small number of shards, if you had a lot more
> (hundreds) of shards and significant volume of requests, things start to
> become a bit more fuzzy with other limits kicking in.
> On 28 Oct 2014 09:26, "Anca Kopetz"<[hidden email]>  wrote:
>
>> Hi,
>>
>> We have a SolrCloud configuration of 10 servers, no sharding, 20
>> millions of documents, the index has 26 GB.
>> As the number of documents has increased recently, the performance of
>> the cluster decreased.
>>
>> We thought of sharding the index, in order to measure the latency. What
>> is the best approach ?
>> - to use shard splitting and have several sub-shards on the same server
>> and in the same tomcat instance
>> - having several shards on the same server but on different tomcat
>> instances
>> - having one shard on each server (for example 2 shards / 5 replicas on
>> 10 servers)
>>
>> What's the impact of these 3 configuration on performance ?
>>
>> Thanks,
>> Anca
>>
>> --
>>
>> Kelkoo SAS
>> Société par Actions Simplifiée
>> Au capital de € 4.168.964,30
>> Siège social : 8, rue du Sentier 75002 Paris
>> 425 093 069 RCS Paris
>>
>> Ce message et les pièces jointes sont confidentiels et établis à
>> l'attention exclusive de leurs destinataires. Si vous n'êtes pas le
>> destinataire de ce message, merci de le détruire et d'en avertir
>> l'expéditeur.
>>

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Shawn Heisey-2
On 10/30/2014 4:32 AM, Anca Kopetz wrote:
> We did some tests with 4 shards / 4 different tomcat instances on the
> same server and the average latency was smaller than the one when having
> only one shard.
> We tested also é shards on different servers and the performance results
> were also worse.
>
> It seems that the sharding does not make any difference for our index in
> terms of latency gains.

That statement is confusing, because if latency goes down, that's good,
not worse.

If you're going to put multiple shards on one server, it should be done
with one solr/tomcat instance, not multiple.  One instance is perfectly
capable of dealing with many shards, and has a lot less overhead.  The
SolrCloud collection create command would need the maxShardsPerNode
parameter.

In order to see a gain in performance from multiple shards per server,
the server must have a lot of CPUs and the query rate must be fairly
low.  If the query rate is high, then all the CPUs will be busy just
handling simultaneous queries, so putting multiple shards per server
will probably slow things down.  When query rate is low, multiple CPUs
can handle each shard query simultaneously, speeding up the overall query.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Anca Kopetz
Hi,

You are right, it is a mistake in my phrase, for the tests with 4
shards/ 4 instances,  the latency was worse (therefore *bigger*) than
for the tests with one shard.

In our case, the query rate is high.

Thanks,
Anca

On 10/30/2014 03:48 PM, Shawn Heisey wrote:

> On 10/30/2014 4:32 AM, Anca Kopetz wrote:
>> We did some tests with 4 shards / 4 different tomcat instances on the
>> same server and the average latency was smaller than the one when having
>> only one shard.
>> We tested also é shards on different servers and the performance results
>> were also worse.
>>
>> It seems that the sharding does not make any difference for our index in
>> terms of latency gains.
> That statement is confusing, because if latency goes down, that's good,
> not worse.
>
> If you're going to put multiple shards on one server, it should be done
> with one solr/tomcat instance, not multiple.  One instance is perfectly
> capable of dealing with many shards, and has a lot less overhead.  The
> SolrCloud collection create command would need the maxShardsPerNode
> parameter.
>
> In order to see a gain in performance from multiple shards per server,
> the server must have a lot of CPUs and the query rate must be fairly
> low.  If the query rate is high, then all the CPUs will be busy just
> handling simultaneous queries, so putting multiple shards per server
> will probably slow things down.  When query rate is low, multiple CPUs
> can handle each shard query simultaneously, speeding up the overall query.
>
> Thanks,
> Shawn
>

Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Erick Erickson
This is not too surprising. There are additional hops necessary for a
cloud setup. This is the sequence, let's say there are 4 shards and the
rows parameter on the query is 10 and you're sorting by score

node1 receives request.
node1 sends the request out to each shard
node1 receives the top 10 doc Ids back with the  score (note, not the
_contents_).
node1 sorts the 4 lists of 10 docs into the final top 10.
node1 then requests the actual docs from the nodes that they reside on
node1 then gets the results back and assembles them into a final list
node1 then returns the list to the client.

Contrast this with a single shard
node1 receives the request
node1 finds the top 10 docs locally
node1 return the docs to the client

You should only resort to sharding when you have too many docs
to fit in a single shard (and give you acceptable search times). If
all your docs fit comfortably on a single machine, you can _still_ use
SolrCloud, just with a single shard. This configuration deals with all
the replication, NRT processing, self-repair when nodes go up and
down and all that, but since there's no second trip to get the docs
from shards your query performance won't be affected.

And using SolrCloud with a single shard will essentially scale linearly
as you add nodes for queries.

Best,
Erick


On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz <[hidden email]> wrote:

> Hi,
>
> You are right, it is a mistake in my phrase, for the tests with 4
> shards/ 4 instances,  the latency was worse (therefore *bigger*) than
> for the tests with one shard.
>
> In our case, the query rate is high.
>
> Thanks,
> Anca
>
>
> On 10/30/2014 03:48 PM, Shawn Heisey wrote:
>>
>> On 10/30/2014 4:32 AM, Anca Kopetz wrote:
>>>
>>> We did some tests with 4 shards / 4 different tomcat instances on the
>>> same server and the average latency was smaller than the one when having
>>> only one shard.
>>> We tested also é shards on different servers and the performance results
>>> were also worse.
>>>
>>> It seems that the sharding does not make any difference for our index in
>>> terms of latency gains.
>>
>> That statement is confusing, because if latency goes down, that's good,
>> not worse.
>>
>> If you're going to put multiple shards on one server, it should be done
>> with one solr/tomcat instance, not multiple.  One instance is perfectly
>> capable of dealing with many shards, and has a lot less overhead.  The
>> SolrCloud collection create command would need the maxShardsPerNode
>> parameter.
>>
>> In order to see a gain in performance from multiple shards per server,
>> the server must have a lot of CPUs and the query rate must be fairly
>> low.  If the query rate is high, then all the CPUs will be busy just
>> handling simultaneous queries, so putting multiple shards per server
>> will probably slow things down.  When query rate is low, multiple CPUs
>> can handle each shard query simultaneously, speeding up the overall query.
>>
>> Thanks,
>> Shawn
>>
>
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
>
> Ce message et les pièces jointes sont confidentiels et établis à l'attention
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce
> message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Ramkumar R. Aiyengar
On 30 Oct 2014 23:46, "Erick Erickson" <[hidden email]> wrote:
>
> This configuration deals with all
> the replication, NRT processing, self-repair when nodes go up and
> down and all that, but since there's no second trip to get the docs
> from shards your query performance won't be affected.

More or less.. Vaguely recall that you still would need to add a
shortCircuit parameter to the url in such a case to avoid a second trip. I
might be wrong here but I do recall wondering why that wasn't the default..

>
> And using SolrCloud with a single shard will essentially scale linearly
> as you add nodes for queries.
>
> Best,
> Erick
>
>
> On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz <[hidden email]>
wrote:

> > Hi,
> >
> > You are right, it is a mistake in my phrase, for the tests with 4
> > shards/ 4 instances,  the latency was worse (therefore *bigger*) than
> > for the tests with one shard.
> >
> > In our case, the query rate is high.
> >
> > Thanks,
> > Anca
> >
> >
> > On 10/30/2014 03:48 PM, Shawn Heisey wrote:
> >>
> >> On 10/30/2014 4:32 AM, Anca Kopetz wrote:
> >>>
> >>> We did some tests with 4 shards / 4 different tomcat instances on the
> >>> same server and the average latency was smaller than the one when
having
> >>> only one shard.
> >>> We tested also é shards on different servers and the performance
results
> >>> were also worse.
> >>>
> >>> It seems that the sharding does not make any difference for our index
in

> >>> terms of latency gains.
> >>
> >> That statement is confusing, because if latency goes down, that's good,
> >> not worse.
> >>
> >> If you're going to put multiple shards on one server, it should be done
> >> with one solr/tomcat instance, not multiple.  One instance is perfectly
> >> capable of dealing with many shards, and has a lot less overhead.  The
> >> SolrCloud collection create command would need the maxShardsPerNode
> >> parameter.
> >>
> >> In order to see a gain in performance from multiple shards per server,
> >> the server must have a lot of CPUs and the query rate must be fairly
> >> low.  If the query rate is high, then all the CPUs will be busy just
> >> handling simultaneous queries, so putting multiple shards per server
> >> will probably slow things down.  When query rate is low, multiple CPUs
> >> can handle each shard query simultaneously, speeding up the overall
query.

> >>
> >> Thanks,
> >> Shawn
> >>
> >
> > Kelkoo SAS
> > Société par Actions Simplifiée
> > Au capital de € 4.168.964,30
> > Siège social : 8, rue du Sentier 75002 Paris
> > 425 093 069 RCS Paris
> >
> > Ce message et les pièces jointes sont confidentiels et établis à
l'attention
> > exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de
ce
> > message, merci de le détruire et d'en avertir l'expéditeur.
Reply | Threaded
Open this post in threaded view
|

Re: Sharding configuration

Ramkumar R. Aiyengar
In reply to this post by Shawn Heisey-2
On 30 Oct 2014 14:49, "Shawn Heisey" <[hidden email]> wrote:

> In order to see a gain in performance from multiple shards per server,
> the server must have a lot of CPUs and the query rate must be fairly
> low.  If the query rate is high, then all the CPUs will be busy just
> handling simultaneous queries, so putting multiple shards per server
> will probably slow things down.  When query rate is low, multiple CPUs
> can handle each shard query simultaneously, speeding up the overall query.

Except that your query latency isn't always CPU bound, there's a
significant IO bound portion as well. I wouldn't go so far as to say that
will large query volumes you shouldn't use multiple shards -- finally comes
down to how many shards a machine can handle under peak load, it could
depend on CPU/IO/GC pressure.. We have multiple shards on a machine under
heavy query load for example. The only real way is to benchmark this and
see..

> Thanks,
> Shawn
>