Need help on handling large size of index.

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help on handling large size of index.

Modassar Ather-2
Hi,

Currently we have index of size 3.5 TB. These index are distributed across
12 shards under two cores. The size of index on each shards are almost
equal.
We do a delta indexing every week and optimise the index.

The server configuration is as follows.

   - Solr Version  : 6.5.1
   - AWS instance type : r5a.16xlarge
   - CPU(s)  : 64
   - RAM  : 512GB
   - EBS size  : 7 TB (For indexing as well as index optimisation.)
   - IOPs  : 30000 (For faster index optimisation)


Can you please help me with following few questions?

   - What is the ideal index size per shard?
   - The optimisation takes lot of time and IOPs to complete. Will
   increasing the number of shards help in reducing the optimisation time and
   IOPs?
   - We are planning to reduce each shard index size to 30GB and the entire
   3.5 TB index will be distributed across more shards. In this case to almost
   70+ shards. Will this help?
   - Will adding so many new shards increase the search response time and
   possibly how much?
   - If we have to increase the shards should we do it on a single larger
   server or should do it on multiple small servers?


Kindly share your thoughts on how best we can use Solr with such a large
index size.

Best,
Modassar
Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Phill Campbell
In my world your index size is common.

Optimal Index size: Depends on what you are optimizing for. Query Speed? Hardware utilization?
Optimizing the index is something I never do. We live with about 28% deletes. You should check your configuration for your merge policy.
I run 120 shards, and I am currently redesigning for 256 shards.
Increased sharding has helped reduce query response time, but surely there is a point where the colation of results starts to be the bottleneck.
I run the 120 shards on 90 r4.4xlarge instances with a replication factor of 3.

The things missing are:
What does your schema look like? I index around 120 fields per document.
What does your queries look like? Mine are so varied that caching never helps, the same query rarely comes through.
My system takes continuous updates, yours does not.

It is really up to you to experiment.

If you follow the development pattern of Design By Use (DBU) the first thing you do for solr and even for SQL is to come up with your queries first. Then design the schema. Then figure out how to distribute it for performance.

Oh, another thing, are you concerned about  availability? Do you have a replication factor > 1? Do you run those replicas in a different region for safety?
How many zookeepers are you running and where are they?

Lots of questions.

Regards

> On May 20, 2020, at 11:43 AM, Modassar Ather <[hidden email]> wrote:
>
> Hi,
>
> Currently we have index of size 3.5 TB. These index are distributed across
> 12 shards under two cores. The size of index on each shards are almost
> equal.
> We do a delta indexing every week and optimise the index.
>
> The server configuration is as follows.
>
>   - Solr Version  : 6.5.1
>   - AWS instance type : r5a.16xlarge
>   - CPU(s)  : 64
>   - RAM  : 512GB
>   - EBS size  : 7 TB (For indexing as well as index optimisation.)
>   - IOPs  : 30000 (For faster index optimisation)
>
>
> Can you please help me with following few questions?
>
>   - What is the ideal index size per shard?
>   - The optimisation takes lot of time and IOPs to complete. Will
>   increasing the number of shards help in reducing the optimisation time and
>   IOPs?
>   - We are planning to reduce each shard index size to 30GB and the entire
>   3.5 TB index will be distributed across more shards. In this case to almost
>   70+ shards. Will this help?
>   - Will adding so many new shards increase the search response time and
>   possibly how much?
>   - If we have to increase the shards should we do it on a single larger
>   server or should do it on multiple small servers?
>
>
> Kindly share your thoughts on how best we can use Solr with such a large
> index size.
>
> Best,
> Modassar

Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Phill Campbell
In reply to this post by Modassar Ather-2
In my world your index size is common.

Optimal Index size: Depends on what you are optimizing for. Query Speed? Hardware utilization?
Optimizing the index is something I never do. We live with about 28% deletes. You should check your configuration for your merge policy.
I run 120 shards, and I am currently redesigning for 256 shards.
Increased sharding has helped reduce query response time, but surely there is a point where the colation of results starts to be the bottleneck.
I run the 120 shards on 90 r4.4xlarge instances with a replication factor of 3.

The things missing are:
What does your schema look like? I index around 120 fields per document.
What does your queries look like? Mine are so varied that caching never helps, the same query rarely comes through.
My system takes continuous updates, yours does not.

It is really up to you to experiment.

If you follow the development pattern of Design By Use (DBU) the first thing you do for solr and even for SQL is to come up with your queries first. Then design the schema. Then figure out how to distribute it for performance.

Oh, another thing, are you concerned about  availability? Do you have a replication factor > 1? Do you run those replicas in a different region for safety?
How many zookeepers are you running and where are they?

Lots of questions.

Regards

> On May 20, 2020, at 11:43 AM, Modassar Ather <[hidden email]> wrote:
>
> Hi,
>
> Currently we have index of size 3.5 TB. These index are distributed across
> 12 shards under two cores. The size of index on each shards are almost
> equal.
> We do a delta indexing every week and optimise the index.
>
> The server configuration is as follows.
>
>  - Solr Version  : 6.5.1
>  - AWS instance type : r5a.16xlarge
>  - CPU(s)  : 64
>  - RAM  : 512GB
>  - EBS size  : 7 TB (For indexing as well as index optimisation.)
>  - IOPs  : 30000 (For faster index optimisation)
>
>
> Can you please help me with following few questions?
>
>  - What is the ideal index size per shard?
>  - The optimisation takes lot of time and IOPs to complete. Will
>  increasing the number of shards help in reducing the optimisation time and
>  IOPs?
>  - We are planning to reduce each shard index size to 30GB and the entire
>  3.5 TB index will be distributed across more shards. In this case to almost
>  70+ shards. Will this help?
>  - Will adding so many new shards increase the search response time and
>  possibly how much?
>  - If we have to increase the shards should we do it on a single larger
>  server or should do it on multiple small servers?
>
>
> Kindly share your thoughts on how best we can use Solr with such a large
> index size.
>
> Best,
> Modassar

Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Shawn Heisey-2
In reply to this post by Modassar Ather-2
On 5/20/2020 11:43 AM, Modassar Ather wrote:
> Can you please help me with following few questions?
>
>     - What is the ideal index size per shard?

We have no way of knowing that.  A size that works well for one index
use case may not work well for another, even if the index size in both
cases is identical.  Determining the ideal shard size requires
experimentation.

https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

>     - The optimisation takes lot of time and IOPs to complete. Will
>     increasing the number of shards help in reducing the optimisation time and
>     IOPs?

No, changing the number of shards will not help with the time required
to optimize, and might make it slower.  Increasing the speed of the
disks won't help either.  Optimizing involves a lot more than just
copying data -- it will never use all the available disk bandwidth of
modern disks.  SolrCloud does optimizes of the shard replicas making up
a full collection sequentially, not simultaneously.

>     - We are planning to reduce each shard index size to 30GB and the entire
>     3.5 TB index will be distributed across more shards. In this case to almost
>     70+ shards. Will this help?

Maybe.  Maybe not.  You'll have to try it.  If you increase the number
of shards without adding additional servers, I would expect things to
get worse, not better.

> Kindly share your thoughts on how best we can use Solr with such a large
> index size.

Something to keep in mind -- memory is the resource that makes the most
difference in performance.  Buying enough memory to get decent
performance out of an index that big would probably be very expensive.
You should probably explore ways to make your index smaller.  Another
idea is to split things up so the most frequently accessed search data
is in a relatively small index and lives on beefy servers, and data used
for less frequent or data-mining queries (where performance doesn't
matter as much) can live on less expensive servers.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Modassar Ather-2
Thanks Phill for your response.

Optimal Index size: Depends on what you are optimizing for. Query Speed?
Hardware utilization?
We are optimising it for query speed. What I understand even if we set the
merge policy to any number the amount of hard disk will still be required
for the bigger segment merges. Please correct me if I am wrong.

Optimizing the index is something I never do. We live with about 28%
deletes. You should check your configuration for your merge policy.
There is a delete of about 10-20% in our updates. We have no merge policy
set in configuration as we do a full optimisation after the indexing.

Increased sharding has helped reduce query response time, but surely there
is a point where the colation of results starts to be the bottleneck.
The query response time is my concern. I understand the aggregation of
results may increase the search response time.

*What does your schema look like? I index around 120 fields per document.*
The schema has a combination of text and string fields. None of the field
except Id field is stored. We also have around 120 fields. A few of them
have docValues enabled.

*What does your queries look like? Mine are so varied that caching never
helps, the same query rarely comes through.*
Our search queries are combination of proximity, nested proximity and
wildcards most of the time. The query can be very complex with 100s of
wildcard and proximity terms in it. Different grouping option are also
enabled on search result. And the search queries vary a lot.

Oh, another thing, are you concerned about  availability? Do you have a
replication factor > 1? Do you run those replicas in a different region for
safety?
How many zookeepers are you running and where are they?
As of now we do not have any replication factor. We are not using zookeeper
ensemble but would like to move to it sooner.

Best,
Modassar

On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]> wrote:

> On 5/20/2020 11:43 AM, Modassar Ather wrote:
> > Can you please help me with following few questions?
> >
> >     - What is the ideal index size per shard?
>
> We have no way of knowing that.  A size that works well for one index
> use case may not work well for another, even if the index size in both
> cases is identical.  Determining the ideal shard size requires
> experimentation.
>
>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> >     - The optimisation takes lot of time and IOPs to complete. Will
> >     increasing the number of shards help in reducing the optimisation
> time and
> >     IOPs?
>
> No, changing the number of shards will not help with the time required
> to optimize, and might make it slower.  Increasing the speed of the
> disks won't help either.  Optimizing involves a lot more than just
> copying data -- it will never use all the available disk bandwidth of
> modern disks.  SolrCloud does optimizes of the shard replicas making up
> a full collection sequentially, not simultaneously.
>
> >     - We are planning to reduce each shard index size to 30GB and the
> entire
> >     3.5 TB index will be distributed across more shards. In this case to
> almost
> >     70+ shards. Will this help?
>
> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
> of shards without adding additional servers, I would expect things to
> get worse, not better.
>
> > Kindly share your thoughts on how best we can use Solr with such a large
> > index size.
>
> Something to keep in mind -- memory is the resource that makes the most
> difference in performance.  Buying enough memory to get decent
> performance out of an index that big would probably be very expensive.
> You should probably explore ways to make your index smaller.  Another
> idea is to split things up so the most frequently accessed search data
> is in a relatively small index and lives on beefy servers, and data used
> for less frequent or data-mining queries (where performance doesn't
> matter as much) can live on less expensive servers.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Modassar Ather-2
Thanks Shawn for your response.

We have seen a performance increase in optimisation with a bigger number of
IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
whereas the same index took 5-6 hours to optimise with higher IOPs.
Yes the entire extra IOPs were never used to full other than a couple of
spike in its usage. So not able to understand how the increased IOPs makes
so much of difference.
Can you please help me understand what it involves to optimise? Is it the
more RAM/IOPs?

Search response time is very important. Please advise if we increase the
shard with extra servers how much effect it may have on search response
time.

Best,
Modassar

On Thu, May 21, 2020 at 2:16 PM Modassar Ather <[hidden email]>
wrote:

> Thanks Phill for your response.
>
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
>
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
>
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
>
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
>
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
>
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using
> zookeeper ensemble but would like to move to it sooner.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]> wrote:
>
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>> > Can you please help me with following few questions?
>> >
>> >     - What is the ideal index size per shard?
>>
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> >     - The optimisation takes lot of time and IOPs to complete. Will
>> >     increasing the number of shards help in reducing the optimisation
>> time and
>> >     IOPs?
>>
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>>
>> >     - We are planning to reduce each shard index size to 30GB and the
>> entire
>> >     3.5 TB index will be distributed across more shards. In this case
>> to almost
>> >     70+ shards. Will this help?
>>
>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>> of shards without adding additional servers, I would expect things to
>> get worse, not better.
>>
>> > Kindly share your thoughts on how best we can use Solr with such a large
>> > index size.
>>
>> Something to keep in mind -- memory is the resource that makes the most
>> difference in performance.  Buying enough memory to get decent
>> performance out of an index that big would probably be very expensive.
>> You should probably explore ways to make your index smaller.  Another
>> idea is to split things up so the most frequently accessed search data
>> is in a relatively small index and lives on beefy servers, and data used
>> for less frequent or data-mining queries (where performance doesn't
>> matter as much) can live on less expensive servers.
>>
>> Thanks,
>> Shawn
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Erick Erickson
Please consider _not_ optimizing. It’s kind of a misleading name anyway, and the
version of solr you’re using may have unintended consequences, see:

https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

There are situations where optimizing makes sense, but far too often people think
it’s A Good Thing (based almost entirely on the name, who _wouldn’t_ want an
optimized index?) without measuring, leading to tons of work to no real benefit.

Best,
Erick

> On May 21, 2020, at 4:58 AM, Modassar Ather <[hidden email]> wrote:
>
> Thanks Shawn for your response.
>
> We have seen a performance increase in optimisation with a bigger number of
> IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
> whereas the same index took 5-6 hours to optimise with higher IOPs.
> Yes the entire extra IOPs were never used to full other than a couple of
> spike in its usage. So not able to understand how the increased IOPs makes
> so much of difference.
> Can you please help me understand what it involves to optimise? Is it the
> more RAM/IOPs?
>
> Search response time is very important. Please advise if we increase the
> shard with extra servers how much effect it may have on search response
> time.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 2:16 PM Modassar Ather <[hidden email]>
> wrote:
>
>> Thanks Phill for your response.
>>
>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>> Hardware utilization?
>> We are optimising it for query speed. What I understand even if we set the
>> merge policy to any number the amount of hard disk will still be required
>> for the bigger segment merges. Please correct me if I am wrong.
>>
>> Optimizing the index is something I never do. We live with about 28%
>> deletes. You should check your configuration for your merge policy.
>> There is a delete of about 10-20% in our updates. We have no merge policy
>> set in configuration as we do a full optimisation after the indexing.
>>
>> Increased sharding has helped reduce query response time, but surely there
>> is a point where the colation of results starts to be the bottleneck.
>> The query response time is my concern. I understand the aggregation of
>> results may increase the search response time.
>>
>> *What does your schema look like? I index around 120 fields per document.*
>> The schema has a combination of text and string fields. None of the field
>> except Id field is stored. We also have around 120 fields. A few of them
>> have docValues enabled.
>>
>> *What does your queries look like? Mine are so varied that caching never
>> helps, the same query rarely comes through.*
>> Our search queries are combination of proximity, nested proximity and
>> wildcards most of the time. The query can be very complex with 100s of
>> wildcard and proximity terms in it. Different grouping option are also
>> enabled on search result. And the search queries vary a lot.
>>
>> Oh, another thing, are you concerned about  availability? Do you have a
>> replication factor > 1? Do you run those replicas in a different region for
>> safety?
>> How many zookeepers are you running and where are they?
>> As of now we do not have any replication factor. We are not using
>> zookeeper ensemble but would like to move to it sooner.
>>
>> Best,
>> Modassar
>>
>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]> wrote:
>>
>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>>> Can you please help me with following few questions?
>>>>
>>>>    - What is the ideal index size per shard?
>>>
>>> We have no way of knowing that.  A size that works well for one index
>>> use case may not work well for another, even if the index size in both
>>> cases is identical.  Determining the ideal shard size requires
>>> experimentation.
>>>
>>>
>>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>
>>>>    - The optimisation takes lot of time and IOPs to complete. Will
>>>>    increasing the number of shards help in reducing the optimisation
>>> time and
>>>>    IOPs?
>>>
>>> No, changing the number of shards will not help with the time required
>>> to optimize, and might make it slower.  Increasing the speed of the
>>> disks won't help either.  Optimizing involves a lot more than just
>>> copying data -- it will never use all the available disk bandwidth of
>>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>>> a full collection sequentially, not simultaneously.
>>>
>>>>    - We are planning to reduce each shard index size to 30GB and the
>>> entire
>>>>    3.5 TB index will be distributed across more shards. In this case
>>> to almost
>>>>    70+ shards. Will this help?
>>>
>>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>>> of shards without adding additional servers, I would expect things to
>>> get worse, not better.
>>>
>>>> Kindly share your thoughts on how best we can use Solr with such a large
>>>> index size.
>>>
>>> Something to keep in mind -- memory is the resource that makes the most
>>> difference in performance.  Buying enough memory to get decent
>>> performance out of an index that big would probably be very expensive.
>>> You should probably explore ways to make your index smaller.  Another
>>> idea is to split things up so the most frequently accessed search data
>>> is in a relatively small index and lives on beefy servers, and data used
>>> for less frequent or data-mining queries (where performance doesn't
>>> matter as much) can live on less expensive servers.
>>>
>>> Thanks,
>>> Shawn
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Phill Campbell
In reply to this post by Modassar Ather-2
The optimal size for a shard of the index is be definition what works best on the hardware with the JVM heap that is in use.
More shards mean smaller sizes of the index for the shard as you already know.

I spent months changing the sharing, the JVM heap, the GC values before taking the system live.
RAM is important, and I run with enough to allow Solr to load the entire index into RAM. From my understanding Solr uses the system to memory map the index files. I might be wrong.
I experimented with less RAM and SSD drives and found that was another way to get the performance I needed. Since RAM is cheaper, I choose that approach.

Again we never optimize. When we have to recover we rebuild the index by spinning up new machines and use a massive EMR (Map reduce job) to force the data into the system. Takes about 3 hours. Solr can ingest data at an amazing rate. Then we do a blue/green switch over.

Query time, from my experience with my environment, is improved with more sharding and additional hardware. Not just more sharding on the same hardware.

My fields are not stored either, except ID. There are some fields that are indexed and have DocValues and those are used for sorting and facets. My queries can have any number of wildcards as well, but my field’s data lengths are maybe a maximum of 100 characters so proximity searching is not too bad. I tokenize and index everything. I do not expand terms at query time to get broader results, I index the alternatives and let the indexer do what it does best.

If you are running in SolrCloud mode and you are using the embedded zookeeper I would change that. Solr and ZK are very chatty with each other, run ZK on machines in proximity to Solr.

Regards

> On May 21, 2020, at 2:46 AM, Modassar Ather <[hidden email]> wrote:
>
> Thanks Phill for your response.
>
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
>
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
>
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
>
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
>
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
>
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using zookeeper
> ensemble but would like to move to it sooner.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]> wrote:
>
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>> Can you please help me with following few questions?
>>>
>>>    - What is the ideal index size per shard?
>>
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>>>    - The optimisation takes lot of time and IOPs to complete. Will
>>>    increasing the number of shards help in reducing the optimisation
>> time and
>>>    IOPs?
>>
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>>
>>>    - We are planning to reduce each shard index size to 30GB and the
>> entire
>>>    3.5 TB index will be distributed across more shards. In this case to
>> almost
>>>    70+ shards. Will this help?
>>
>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>> of shards without adding additional servers, I would expect things to
>> get worse, not better.
>>
>>> Kindly share your thoughts on how best we can use Solr with such a large
>>> index size.
>>
>> Something to keep in mind -- memory is the resource that makes the most
>> difference in performance.  Buying enough memory to get decent
>> performance out of an index that big would probably be very expensive.
>> You should probably explore ways to make your index smaller.  Another
>> idea is to split things up so the most frequently accessed search data
>> is in a relatively small index and lives on beefy servers, and data used
>> for less frequent or data-mining queries (where performance doesn't
>> matter as much) can live on less expensive servers.
>>
>> Thanks,
>> Shawn
>>

Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Modassar Ather-2
Thanks Erick and Phill.

We index data weekly once and that is why we do the optimisation and it has
helped in faster query result. I will experiment with a fewer segments with
the current hardware.
The thing I am not  clear about is although there is no constant high usage
of extra IOPs other than a couple of spike during optimisation why there is
so much difference in optimisation time when there is extra IOPs vs no
Extra IOPs.
The optimisation on different datacenter machine which was of same
configuration with SSD used to take 4-5 hours to optimise. This time to
optimise is comparable to r5a.16xlarge with extra 30000 IOPs time.

Best,
Modassar

On Fri, May 22, 2020 at 12:56 AM Phill Campbell
<[hidden email]> wrote:

> The optimal size for a shard of the index is be definition what works best
> on the hardware with the JVM heap that is in use.
> More shards mean smaller sizes of the index for the shard as you already
> know.
>
> I spent months changing the sharing, the JVM heap, the GC values before
> taking the system live.
> RAM is important, and I run with enough to allow Solr to load the entire
> index into RAM. From my understanding Solr uses the system to memory map
> the index files. I might be wrong.
> I experimented with less RAM and SSD drives and found that was another way
> to get the performance I needed. Since RAM is cheaper, I choose that
> approach.
>
> Again we never optimize. When we have to recover we rebuild the index by
> spinning up new machines and use a massive EMR (Map reduce job) to force
> the data into the system. Takes about 3 hours. Solr can ingest data at an
> amazing rate. Then we do a blue/green switch over.
>
> Query time, from my experience with my environment, is improved with more
> sharding and additional hardware. Not just more sharding on the same
> hardware.
>
> My fields are not stored either, except ID. There are some fields that are
> indexed and have DocValues and those are used for sorting and facets. My
> queries can have any number of wildcards as well, but my field’s data
> lengths are maybe a maximum of 100 characters so proximity searching is not
> too bad. I tokenize and index everything. I do not expand terms at query
> time to get broader results, I index the alternatives and let the indexer
> do what it does best.
>
> If you are running in SolrCloud mode and you are using the embedded
> zookeeper I would change that. Solr and ZK are very chatty with each other,
> run ZK on machines in proximity to Solr.
>
> Regards
>
> > On May 21, 2020, at 2:46 AM, Modassar Ather <[hidden email]>
> wrote:
> >
> > Thanks Phill for your response.
> >
> > Optimal Index size: Depends on what you are optimizing for. Query Speed?
> > Hardware utilization?
> > We are optimising it for query speed. What I understand even if we set
> the
> > merge policy to any number the amount of hard disk will still be required
> > for the bigger segment merges. Please correct me if I am wrong.
> >
> > Optimizing the index is something I never do. We live with about 28%
> > deletes. You should check your configuration for your merge policy.
> > There is a delete of about 10-20% in our updates. We have no merge policy
> > set in configuration as we do a full optimisation after the indexing.
> >
> > Increased sharding has helped reduce query response time, but surely
> there
> > is a point where the colation of results starts to be the bottleneck.
> > The query response time is my concern. I understand the aggregation of
> > results may increase the search response time.
> >
> > *What does your schema look like? I index around 120 fields per
> document.*
> > The schema has a combination of text and string fields. None of the field
> > except Id field is stored. We also have around 120 fields. A few of them
> > have docValues enabled.
> >
> > *What does your queries look like? Mine are so varied that caching never
> > helps, the same query rarely comes through.*
> > Our search queries are combination of proximity, nested proximity and
> > wildcards most of the time. The query can be very complex with 100s of
> > wildcard and proximity terms in it. Different grouping option are also
> > enabled on search result. And the search queries vary a lot.
> >
> > Oh, another thing, are you concerned about  availability? Do you have a
> > replication factor > 1? Do you run those replicas in a different region
> for
> > safety?
> > How many zookeepers are you running and where are they?
> > As of now we do not have any replication factor. We are not using
> zookeeper
> > ensemble but would like to move to it sooner.
> >
> > Best,
> > Modassar
> >
> > On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]>
> wrote:
> >
> >> On 5/20/2020 11:43 AM, Modassar Ather wrote:
> >>> Can you please help me with following few questions?
> >>>
> >>>    - What is the ideal index size per shard?
> >>
> >> We have no way of knowing that.  A size that works well for one index
> >> use case may not work well for another, even if the index size in both
> >> cases is identical.  Determining the ideal shard size requires
> >> experimentation.
> >>
> >>
> >>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >>
> >>>    - The optimisation takes lot of time and IOPs to complete. Will
> >>>    increasing the number of shards help in reducing the optimisation
> >> time and
> >>>    IOPs?
> >>
> >> No, changing the number of shards will not help with the time required
> >> to optimize, and might make it slower.  Increasing the speed of the
> >> disks won't help either.  Optimizing involves a lot more than just
> >> copying data -- it will never use all the available disk bandwidth of
> >> modern disks.  SolrCloud does optimizes of the shard replicas making up
> >> a full collection sequentially, not simultaneously.
> >>
> >>>    - We are planning to reduce each shard index size to 30GB and the
> >> entire
> >>>    3.5 TB index will be distributed across more shards. In this case to
> >> almost
> >>>    70+ shards. Will this help?
> >>
> >> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
> >> of shards without adding additional servers, I would expect things to
> >> get worse, not better.
> >>
> >>> Kindly share your thoughts on how best we can use Solr with such a
> large
> >>> index size.
> >>
> >> Something to keep in mind -- memory is the resource that makes the most
> >> difference in performance.  Buying enough memory to get decent
> >> performance out of an index that big would probably be very expensive.
> >> You should probably explore ways to make your index smaller.  Another
> >> idea is to split things up so the most frequently accessed search data
> >> is in a relatively small index and lives on beefy servers, and data used
> >> for less frequent or data-mining queries (where performance doesn't
> >> matter as much) can live on less expensive servers.
> >>
> >> Thanks,
> >> Shawn
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Need help on handling large size of index.

Phill Campbell
Maybe your problems are in AWS land.


> On May 22, 2020, at 3:45 AM, Modassar Ather <[hidden email]> wrote:
>
> Thanks Erick and Phill.
>
> We index data weekly once and that is why we do the optimisation and it has
> helped in faster query result. I will experiment with a fewer segments with
> the current hardware.
> The thing I am not  clear about is although there is no constant high usage
> of extra IOPs other than a couple of spike during optimisation why there is
> so much difference in optimisation time when there is extra IOPs vs no
> Extra IOPs.
> The optimisation on different datacenter machine which was of same
> configuration with SSD used to take 4-5 hours to optimise. This time to
> optimise is comparable to r5a.16xlarge with extra 30000 IOPs time.
>
> Best,
> Modassar
>
> On Fri, May 22, 2020 at 12:56 AM Phill Campbell
> <[hidden email]> wrote:
>
>> The optimal size for a shard of the index is be definition what works best
>> on the hardware with the JVM heap that is in use.
>> More shards mean smaller sizes of the index for the shard as you already
>> know.
>>
>> I spent months changing the sharing, the JVM heap, the GC values before
>> taking the system live.
>> RAM is important, and I run with enough to allow Solr to load the entire
>> index into RAM. From my understanding Solr uses the system to memory map
>> the index files. I might be wrong.
>> I experimented with less RAM and SSD drives and found that was another way
>> to get the performance I needed. Since RAM is cheaper, I choose that
>> approach.
>>
>> Again we never optimize. When we have to recover we rebuild the index by
>> spinning up new machines and use a massive EMR (Map reduce job) to force
>> the data into the system. Takes about 3 hours. Solr can ingest data at an
>> amazing rate. Then we do a blue/green switch over.
>>
>> Query time, from my experience with my environment, is improved with more
>> sharding and additional hardware. Not just more sharding on the same
>> hardware.
>>
>> My fields are not stored either, except ID. There are some fields that are
>> indexed and have DocValues and those are used for sorting and facets. My
>> queries can have any number of wildcards as well, but my field’s data
>> lengths are maybe a maximum of 100 characters so proximity searching is not
>> too bad. I tokenize and index everything. I do not expand terms at query
>> time to get broader results, I index the alternatives and let the indexer
>> do what it does best.
>>
>> If you are running in SolrCloud mode and you are using the embedded
>> zookeeper I would change that. Solr and ZK are very chatty with each other,
>> run ZK on machines in proximity to Solr.
>>
>> Regards
>>
>>> On May 21, 2020, at 2:46 AM, Modassar Ather <[hidden email]>
>> wrote:
>>>
>>> Thanks Phill for your response.
>>>
>>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>>> Hardware utilization?
>>> We are optimising it for query speed. What I understand even if we set
>> the
>>> merge policy to any number the amount of hard disk will still be required
>>> for the bigger segment merges. Please correct me if I am wrong.
>>>
>>> Optimizing the index is something I never do. We live with about 28%
>>> deletes. You should check your configuration for your merge policy.
>>> There is a delete of about 10-20% in our updates. We have no merge policy
>>> set in configuration as we do a full optimisation after the indexing.
>>>
>>> Increased sharding has helped reduce query response time, but surely
>> there
>>> is a point where the colation of results starts to be the bottleneck.
>>> The query response time is my concern. I understand the aggregation of
>>> results may increase the search response time.
>>>
>>> *What does your schema look like? I index around 120 fields per
>> document.*
>>> The schema has a combination of text and string fields. None of the field
>>> except Id field is stored. We also have around 120 fields. A few of them
>>> have docValues enabled.
>>>
>>> *What does your queries look like? Mine are so varied that caching never
>>> helps, the same query rarely comes through.*
>>> Our search queries are combination of proximity, nested proximity and
>>> wildcards most of the time. The query can be very complex with 100s of
>>> wildcard and proximity terms in it. Different grouping option are also
>>> enabled on search result. And the search queries vary a lot.
>>>
>>> Oh, another thing, are you concerned about  availability? Do you have a
>>> replication factor > 1? Do you run those replicas in a different region
>> for
>>> safety?
>>> How many zookeepers are you running and where are they?
>>> As of now we do not have any replication factor. We are not using
>> zookeeper
>>> ensemble but would like to move to it sooner.
>>>
>>> Best,
>>> Modassar
>>>
>>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <[hidden email]>
>> wrote:
>>>
>>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>>>> Can you please help me with following few questions?
>>>>>
>>>>>   - What is the ideal index size per shard?
>>>>
>>>> We have no way of knowing that.  A size that works well for one index
>>>> use case may not work well for another, even if the index size in both
>>>> cases is identical.  Determining the ideal shard size requires
>>>> experimentation.
>>>>
>>>>
>>>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>>
>>>>>   - The optimisation takes lot of time and IOPs to complete. Will
>>>>>   increasing the number of shards help in reducing the optimisation
>>>> time and
>>>>>   IOPs?
>>>>
>>>> No, changing the number of shards will not help with the time required
>>>> to optimize, and might make it slower.  Increasing the speed of the
>>>> disks won't help either.  Optimizing involves a lot more than just
>>>> copying data -- it will never use all the available disk bandwidth of
>>>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>>>> a full collection sequentially, not simultaneously.
>>>>
>>>>>   - We are planning to reduce each shard index size to 30GB and the
>>>> entire
>>>>>   3.5 TB index will be distributed across more shards. In this case to
>>>> almost
>>>>>   70+ shards. Will this help?
>>>>
>>>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>>>> of shards without adding additional servers, I would expect things to
>>>> get worse, not better.
>>>>
>>>>> Kindly share your thoughts on how best we can use Solr with such a
>> large
>>>>> index size.
>>>>
>>>> Something to keep in mind -- memory is the resource that makes the most
>>>> difference in performance.  Buying enough memory to get decent
>>>> performance out of an index that big would probably be very expensive.
>>>> You should probably explore ways to make your index smaller.  Another
>>>> idea is to split things up so the most frequently accessed search data
>>>> is in a relatively small index and lives on beefy servers, and data used
>>>> for less frequent or data-mining queries (where performance doesn't
>>>> matter as much) can live on less expensive servers.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>
>>