Decision on Number of shards and collection

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Decision on Number of shards and collection

neotorand
Hi Team
First of all i take this opportunity to thank you all for creating a
beautiful place where people can explore ,learn and debate.

I have been on my knees for couple of days to decide on this.

When i am creating a solr cloud eco system i need to decide on number of
shards and collection.
What are the best practices for taking this decisions.

I believe heterogeneous data can be indexed to same collection and i can
have multiple shards for the index to be partitioned.So whats the need of a
second collection?. yes when collection size grows i should look for more
collection.what exactly that size is? what KPI drives the decision of having
more collection?Any pointers or links for best practice.

when should i go for multiple shards?
yes when shard size grows.Right? whats the size and how do i benchmark.

I am sorry for my question if its already asked but googled all the ecospace
quora,stackoverflow,lucid

Regards
Neo





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Emir Arnautović
Hi Neo,
Shard size determines query latency, so you split your index when queries become too slow. Distributed search comes with some overhead, so oversharding is not the way to go either. There is no hard rule what are the best numbers, but here  are some thought how to approach this: http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html <http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html>

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Apr 2018, at 12:15, neotorand <[hidden email]> wrote:
>
> Hi Team
> First of all i take this opportunity to thank you all for creating a
> beautiful place where people can explore ,learn and debate.
>
> I have been on my knees for couple of days to decide on this.
>
> When i am creating a solr cloud eco system i need to decide on number of
> shards and collection.
> What are the best practices for taking this decisions.
>
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice.
>
> when should i go for multiple shards?
> yes when shard size grows.Right? whats the size and how do i benchmark.
>
> I am sorry for my question if its already asked but googled all the ecospace
> quora,stackoverflow,lucid
>
> Regards
> Neo
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

neotorand
This post was updated on .
Hi Emir,
Thanks a lot for your reply.
so when i design a solr eco system i should start with some rough guess on
shards and increase the number of shards to make performance better.what is
the accepted/ideal Response Time.There should be a trade off between
Response time and the number of shards as data keeps growing.

I agree we split our index when response time increases.So what could be
that response time threshold or query Latency?

Thanks again!


Regards
Neo





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Shawn Heisey-2
In reply to this post by neotorand
On 4/11/2018 4:15 AM, neotorand wrote:
> I believe heterogeneous data can be indexed to same collection and i can
> have multiple shards for the index to be partitioned.So whats the need of a
> second collection?. yes when collection size grows i should look for more
> collection.what exactly that size is? what KPI drives the decision of having
> more collection?Any pointers or links for best practice.

There are no hard rules.  Many factors affect these decisions.

https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Creating multiple collections should be done when there is a logical or
business reason for keeping different sets of data separate from each
other.  If there's never any need for people to query all the data at
once, then it might make sense to use separate collections.  Or you
might want to put them together just for convenience, and use data in
the index to filter the results to only the information that the user is
allowed to access.

> when should i go for multiple shards?
> yes when shard size grows.Right? whats the size and how do i benchmark.

Some indexes function really well with 300 million documents or more per
shard.  Other indexes struggle with less than a million per shard.  It's
impossible to give you any specific number.  It depends on a bunch of
factors.

If query rate is very high, then you want to keep the shard count low. 
Using one shard might not be possible due to index size, but it should
be as low as you can make it.  You're also going to want to have a lot
of replicas to handle the load.

If query rate is extremely low, then sharding the index can actually
*improve* performance, because there will be idle CPU capacity that can
be used for the subqueries.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Abhi Basu
 *The BKM I have read so far (trying to find source) says 50 million
docs/shard performs well. I have found this in my recent tests as well. But
of course it depends on index structure, etc.*

On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <[hidden email]> wrote:

> On 4/11/2018 4:15 AM, neotorand wrote:
> > I believe heterogeneous data can be indexed to same collection and i can
> > have multiple shards for the index to be partitioned.So whats the need
> of a
> > second collection?. yes when collection size grows i should look for more
> > collection.what exactly that size is? what KPI drives the decision of
> having
> > more collection?Any pointers or links for best practice.
>
> There are no hard rules.  Many factors affect these decisions.
>
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
> abstract-why-we-dont-have-a-definitive-answer/
>
> Creating multiple collections should be done when there is a logical or
> business reason for keeping different sets of data separate from each
> other.  If there's never any need for people to query all the data at
> once, then it might make sense to use separate collections.  Or you
> might want to put them together just for convenience, and use data in
> the index to filter the results to only the information that the user is
> allowed to access.
>
> > when should i go for multiple shards?
> > yes when shard size grows.Right? whats the size and how do i benchmark.
>
> Some indexes function really well with 300 million documents or more per
> shard.  Other indexes struggle with less than a million per shard.  It's
> impossible to give you any specific number.  It depends on a bunch of
> factors.
>
> If query rate is very high, then you want to keep the shard count low.
> Using one shard might not be possible due to index size, but it should
> be as low as you can make it.  You're also going to want to have a lot
> of replicas to handle the load.
>
> If query rate is extremely low, then sharding the index can actually
> *improve* performance, because there will be idle CPU capacity that can
> be used for the subqueries.
>
> Thanks,
> Shawn
>
>


--
Abhi Basu
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Erick Erickson
50M is a ballpark number I use as a place to _start_ getting a handle
on capacity. It's useful solely to answer the "is it bigger than a breadbox
and smaller than a house" question. It's totally meaningless without
testing.

Say I'm talking to a client and we have no data. Some are scared
that their 10M docs will require lots of hardware. Saying "I usualy
expect to see 50M docs on a node" gives them
some confidence that it's not going to require a massive hardware
investment and they can go forward with a PoC.

OTOH I have other clients saying "We have 100B documents" and
I have to say "You could be talking 200 nodes" which gives them
incentive to do a PoC to get a hard number.

I do recommend you keep adding (perhaps synthetic) docs to your
node until it tips over. Finding your installation falls over at, say, 50M
docs means you need to start taking action beforehand. OTOH if you
load 150M docs on it and still function OK you can breathe a lot
easier...

Best,
Erick



On Wed, Apr 11, 2018 at 8:55 AM, Abhi Basu <[hidden email]> wrote:

>  *The BKM I have read so far (trying to find source) says 50 million
> docs/shard performs well. I have found this in my recent tests as well. But
> of course it depends on index structure, etc.*
>
> On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <[hidden email]> wrote:
>
>> On 4/11/2018 4:15 AM, neotorand wrote:
>> > I believe heterogeneous data can be indexed to same collection and i can
>> > have multiple shards for the index to be partitioned.So whats the need
>> of a
>> > second collection?. yes when collection size grows i should look for more
>> > collection.what exactly that size is? what KPI drives the decision of
>> having
>> > more collection?Any pointers or links for best practice.
>>
>> There are no hard rules.  Many factors affect these decisions.
>>
>> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
>> abstract-why-we-dont-have-a-definitive-answer/
>>
>> Creating multiple collections should be done when there is a logical or
>> business reason for keeping different sets of data separate from each
>> other.  If there's never any need for people to query all the data at
>> once, then it might make sense to use separate collections.  Or you
>> might want to put them together just for convenience, and use data in
>> the index to filter the results to only the information that the user is
>> allowed to access.
>>
>> > when should i go for multiple shards?
>> > yes when shard size grows.Right? whats the size and how do i benchmark.
>>
>> Some indexes function really well with 300 million documents or more per
>> shard.  Other indexes struggle with less than a million per shard.  It's
>> impossible to give you any specific number.  It depends on a bunch of
>> factors.
>>
>> If query rate is very high, then you want to keep the shard count low.
>> Using one shard might not be possible due to index size, but it should
>> be as low as you can make it.  You're also going to want to have a lot
>> of replicas to handle the load.
>>
>> If query rate is extremely low, then sharding the index can actually
>> *improve* performance, because there will be idle CPU capacity that can
>> be used for the subqueries.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Abhi Basu
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Emir Arnautović
In reply to this post by neotorand
Hi,
Only you can tell what are acceptable query latency (I can tell you ideal - it is 0 :)
Usually you start test with a single shard and start adding documents to it and measure query latency. When you start being close to max allowed latency, you have your shard size. Then you try to estimate number of documents in index now/future and you divide that number by shard size to get the number of shards. You then test to see what is the number of shards you can have per node. You multiple number of shards by number of replicas you plan to have and divide by number of shards per node to estimate number of nodes.
If you are not happy with numbers, you change some assumption, e.g. node type, and redo tests.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 11 Apr 2018, at 15:30, neotorand <[hidden email]> wrote:
>
> Hi Emir,
> Thanks a lot for your reply.
> so when i design a solr eco system i should start with some rough guess on
> shards and increase the number of shards to make performance better.what is
> the accepted/ideal Response Time.There should be a trade off between
> Response time and the number of shards as data keeps growing.
>
> I agree we split our index when response time increases.So what could be
> that response time threshold or query Latency?
>
> Thanks again!
>
>
> Regards
> priyadarshi
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

SOLR4189
In reply to this post by neotorand
I advise you to read the book Solr in Action. To answer your question you
need to take account server resources that you have (CPU, RAM and disk),
take account index size and take account average size single doc.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

neotorand
Thanks every one for your beautifull explanation and valuable time.

Thanks Emir for the Nice
Link(http://www.od-bits.com/2018/01/solrelasticsearch-capacity-planning.html)
Thanks Shawn for
https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

When should we have more collection?

We have a business reason to keep them in separate collection
we dont need to query all data at once

When should we have more shards?
Define Latency
Go on adding document to shards till you have acceptable Latency.That will
define the shards size(SS)
Get the size of all data to be indexed.(TS)
numshards = TS/SS

One quick question.
@Shawn
If i have data in more than one collection still i can query them at once.?
I think yes as i read from SOLR site.
What are pros and cons of single vs multiple collection?

I have gone through the estimating Memory and storage for SOLR from
Lucid.(https://lucidworks.com/2011/09/14/estimating-memory-and-storage-for-lucenesolr/)

@SOLR4189 i will go through the book and get back to you.Thanks.

Time is too short to explore the Long Lived Open source technology

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

neotorand
In reply to this post by SOLR4189
Emir
I read from the link you shared that
"Shard cannot contain more than 2 billion documents since Lucene is using
integer for internal IDs."

In which java class of SOLR implimentaion repository this can be found.

Regards
Neo



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Shawn Heisey-2
On 4/12/2018 4:57 AM, neotorand wrote:
> I read from the link you shared that
> "Shard cannot contain more than 2 billion documents since Lucene is using
> integer for internal IDs."
>
> In which java class of SOLR implimentaion repository this can be found.

The 2 billion limit  is a *hard* limit from Lucene.  It's not in Solr. 
It's pretty much the only hard limit that Lucene actually has - there's
a workaround for everything else.  Solr can overcome this limit for a
single index by sharding the index into multiple physical indexes across
multiple servers, which is more automated in SolrCloud than in
standalone mode.

The 2 billion limit per individual index can't be raised. Lucene uses an
"int" datatype to hold the internal ID everywhere it's used.  Java
numeric types are signed, which means that the maximum number a 32-bit
data type can hold is 2147483647.  This is the value returned by the
Java constant Integer.MAX_VALUE.  A little bit is subtracted from that
value to obtain the maximum it will attempt to use, to be absolutely
sure it can't go over.

https://issues.apache.org/jira/browse/LUCENE-5843

Raising the limit is theoretically possible, but not without *MAJOR*
surgery to an extremely large amount of Lucene's code. The risk of bugs
when attempting that change is *VERY* high -- it could literally take
months to find them all and fix them.

The two most popular search engines using Lucene are Solr and
elasticsearch. Both of these packages can overcome the 2 billion limit
with sharding.

Summary: The 2 billion document limit can be frustrating, but since an
index that large on a single machine is most likely not going to perform
well and should be split across several machines, there's almost no
value to raising the limit and risking a large number of software bugs.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

neotorand
Hi Shawn,
Thanks for the long explanation.
Now 2 Billion limit can be overcome by using shard.

Now coming back to collection.Unless we have  a logical or Business reason
we should not go for more than one collection.

Lets say i have 5 different entities and they have each 10,20,30,40 and 50
attributes(Columns) to be indexed/stored.
Now if i store them in single collection.is there any ways empty spaces
being created.
On other way if i store heterogeneous data items in a single collection,
Does by any means there is a poor utilization of memory by creation of empty
holes.

What are the pros and cons of single vs Multiple.

Thanks team for spending your valuable time to clarify.

Regards
Neo





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Erick Erickson
Having documents without fields doesn't matter much.

Solr (well, Lucene actually) is pretty efficient about this. It
handles thousands of different field types, although I have to say
that when you have thousands of fields it's usually time to revisit
the design. It looks like your total field count is quite reasonable.

The biggest bit for single .vs. multiple is if you want to select
documents of one type based on some criteria of another type (think
join in DB terms). It's much easier with a single collection.

That said, if you start to use Solr "just like a database" then you
might also want to revisit your architecture ;)

Best,
Erick

On Fri, Apr 13, 2018 at 12:44 AM, neotorand <[hidden email]> wrote:

> Hi Shawn,
> Thanks for the long explanation.
> Now 2 Billion limit can be overcome by using shard.
>
> Now coming back to collection.Unless we have  a logical or Business reason
> we should not go for more than one collection.
>
> Lets say i have 5 different entities and they have each 10,20,30,40 and 50
> attributes(Columns) to be indexed/stored.
> Now if i store them in single collection.is there any ways empty spaces
> being created.
> On other way if i store heterogeneous data items in a single collection,
> Does by any means there is a poor utilization of memory by creation of empty
> holes.
>
> What are the pros and cons of single vs Multiple.
>
> Thanks team for spending your valuable time to clarify.
>
> Regards
> Neo
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Reply | Threaded
Open this post in threaded view
|

Re: Decision on Number of shards and collection

Shawn Heisey-2
In reply to this post by neotorand
On 4/13/2018 1:44 AM, neotorand wrote:
> Lets say i have 5 different entities and they have each 10,20,30,40 and 50
> attributes(Columns) to be indexed/stored.
> Now if i store them in single collection.is there any ways empty spaces
> being created.
> On other way if i store heterogeneous data items in a single collection,
> Does by any means there is a poor utilization of memory by creation of empty
> holes.

If a document doesn't have some of the fields your schema is capable of
addressing, no space or memory is consumed for the missing fields.

> What are the pros and cons of single vs Multiple.

If you have a single collection for different kinds of data, then you do
get *some* economies of scale in the total index size.  Whether that
means a significant size reduction or a small size reduction depends on
your data.  The downside to one collection: If you have 500000 of each
kind of document and combine five of them into one collection, then
every query must look through 2.5 million documents instead of 500000
documents. These are both small numbers for Solr, but the larger index
is still going to take more time to search.  If there are any possible
issues with security, you'll need to include one or more fields with
every document with information about the type of document so that
you're able to filter results according to the access privileges of the
user making a query.

With multiple collections, searching on each one is going to be faster
than searching on a combined collection. Whether or not it's enough of a
difference to matter will depend on how much data is involved, the
nature of that data, and the nature of your queries.  But as Erick
mentioned, you'll have less capability with Solr's join functionality --
assuming you even need that functionality.

Thanks,
Shawn