NRT Merge Load on NAS SDD (Cloud) Advice

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

NRT Merge Load on NAS SDD (Cloud) Advice

Karl Stoney

Hi all.

I’m looking for some opinions on how to best configure the Merges to run optimally on GCP SSD’s (network attached).  For context; we have a 9 node NRT 8.8.1 Solr Cloud cluster, each node has an index which is between 25 and 35gb in size, depending on the current merge state / deleted docs.  The index is both heavy write, and heavy read, so we’re always typically merging (which is somewhat fine).

 

Now the SSD’s that we have are 512gb, and on GCP they scale with #cpus and ram amount.  The disk we have are therefore rated for:

 

  • Sustained read IOPS 15k
  • Sustained write IOPS 15k
  • Sustained read throughput 250mb/s
  • Sustained write throughput 250mb/s

 

Both read and write can be sustained in parallel at the peak.

 

Now what we observe, as you can see from this graph is that we typically have a mean write throughput of 16-20mbs (way below our peak), but we’re also peaking at above 250, which is causing us to get write throttled:

 

 

So really what I believe (if possible) we need is a configuration that is less “bursty”, but more sustained over perhaps a longer duration.  As they are network attached disk, they suffer from initial iops latency, but sustained throughput is high.

 

I’ve graphed the merge statistics out here, as you can see at any given time we have a maximum of 3 concurrent minor merges running, with the occasional major.  P95 on the minor is typically around 2 minutes, but occasionally (correlating with a throttle on the above graphs) we can see a minor merge taking 12->15mins.

 

 

Our index policy looks like this:

 

    <ramBufferSizeMB>512</ramBufferSizeMB>

    <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">

      <int name="maxMergeAtOnce">10</int>

      <int name="segmentsPerTier">10</int>

      <int name="maxMergedSegmentMB">5000</int>

      <int name="deletesPctAllowed">30</int>

    </mergePolicyFactory>

    <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">

      <int name="maxThreadCount">10</int>

      <int name="maxMergeCount">15</int>

      <bool name="ioThrottle">true</bool>

    </mergeScheduler>

    <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>

I feel like I’d be guessing which of these settings may help the scenario I describe above, which is somewhat fine – I can experiment and measure.  But the feedback loop is relatively slow so I wanted to lean on others experience/input first.  My instinct is to perhaps lower `maxThreadCount`, but seeing as we only ever peak at 3 in progress merges, it feels like I’d have to go low (2, or even 1) which is on par with spindle disks, which these aren’t.

 

Thanks in advance for any help

 

Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Jan Høydahl / Cominvent
Karl,

Your screen shots were lost at my end, perhaps they did not make it to the list. You may consider sharing graphics through an external service?
What do you mean by "heavy write"? Can you quantify your cluster in terms of e.g.
- Number of nodes and sepc of each node
- Nuber of shards and replicas
- Number of docs totally and per shard
- Update rate, and how you do commits?

Jan

> 1. apr. 2021 kl. 13:43 skrev Karl Stoney <[hidden email]>:
>
> Hi all.
> I’m looking for some opinions on how to best configure the Merges to run optimally on GCP SSD’s (network attached).  For context; we have a 9 node NRT 8.8.1 Solr Cloud cluster, each node has an index which is between 25 and 35gb in size, depending on the current merge state / deleted docs.  The index is both heavy write, and heavy read, so we’re always typically merging (which is somewhat fine).
>  
> Now the SSD’s that we have are 512gb, and on GCP they scale with #cpus and ram amount.  The disk we have are therefore rated for:
>  
> Sustained read IOPS 15k
> Sustained write IOPS 15k
> Sustained read throughput 250mb/s
> Sustained write throughput 250mb/s
>  
> Both read and write can be sustained in parallel at the peak.
>  
> Now what we observe, as you can see from this graph is that we typically have a mean write throughput of 16-20mbs (way below our peak), but we’re also peaking at above 250, which is causing us to get write throttled:
>  
>
>  
> So really what I believe (if possible) we need is a configuration that is less “bursty”, but more sustained over perhaps a longer duration.  As they are network attached disk, they suffer from initial iops latency, but sustained throughput is high.
>  
> I’ve graphed the merge statistics out here, as you can see at any given time we have a maximum of 3 concurrent minor merges running, with the occasional major.  P95 on the minor is typically around 2 minutes, but occasionally (correlating with a throttle on the above graphs) we can see a minor merge taking 12->15mins.
>  
>
>  
> Our index policy looks like this:
>  
>     <ramBufferSizeMB>512</ramBufferSizeMB>
>     <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
>       <int name="maxMergeAtOnce">10</int>
>       <int name="segmentsPerTier">10</int>
>       <int name="maxMergedSegmentMB">5000</int>
>       <int name="deletesPctAllowed">30</int>
>     </mergePolicyFactory>
>     <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
>       <int name="maxThreadCount">10</int>
>       <int name="maxMergeCount">15</int>
>       <bool name="ioThrottle">true</bool>
>     </mergeScheduler>
>     <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>
>
> I feel like I’d be guessing which of these settings may help the scenario I describe above, which is somewhat fine – I can experiment and measure.  But the feedback loop is relatively slow so I wanted to lean on others experience/input first.  My instinct is to perhaps lower `maxThreadCount`, but seeing as we only ever peak at 3 in progress merges, it feels like I’d have to go low (2, or even 1) which is on par with spindle disks, which these aren’t.
>  
> Thanks in advance for any help
>  
> Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.

Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Karl Stoney-2
Hello Jan!
Thanks for your reply.

    - Number of nodes and sepc of each node

9 nodes each 24vCPU 70GB ram, 28GB heap rest for OS.  Disks are 768GB SSDs now (going from 512 -> 768 made little difference really).

    - Nuber of shards and replicas

1 shard, 9 replicas, eg each node holds all data (but subsequently indexes all data too).  The logic being combined with shards.preference=replica.location:local nodes wouldn't need to proxy reads.

    - Number of docs totally and per shard

1.6 million @ 20gb (without deleted docs)

    - Update rate, and how you do commits?

Update rates very throughout the day, but range from 20ops/sec to 300ops/sec.  Commits are done using autoCommit on 1 min interval, softCommit on 15min interval.


On 05/04/2021, 14:13, "Jan Høydahl" <[hidden email]> wrote:

    Karl,

    Your screen shots were lost at my end, perhaps they did not make it to the list. You may consider sharing graphics through an external service?
    What do you mean by "heavy write"? Can you quantify your cluster in terms of e.g.
    - Number of nodes and sepc of each node
    - Nuber of shards and replicas
    - Number of docs totally and per shard
    - Update rate, and how you do commits?

    Jan

    > 1. apr. 2021 kl. 13:43 skrev Karl Stoney <[hidden email]>:
    >
    > Hi all.
    > I’m looking for some opinions on how to best configure the Merges to run optimally on GCP SSD’s (network attached).  For context; we have a 9 node NRT 8.8.1 Solr Cloud cluster, each node has an index which is between 25 and 35gb in size, depending on the current merge state / deleted docs.  The index is both heavy write, and heavy read, so we’re always typically merging (which is somewhat fine).
    >
    > Now the SSD’s that we have are 512gb, and on GCP they scale with #cpus and ram amount.  The disk we have are therefore rated for:
    >
    > Sustained read IOPS 15k
    > Sustained write IOPS 15k
    > Sustained read throughput 250mb/s
    > Sustained write throughput 250mb/s
    >
    > Both read and write can be sustained in parallel at the peak.
    >
    > Now what we observe, as you can see from this graph is that we typically have a mean write throughput of 16-20mbs (way below our peak), but we’re also peaking at above 250, which is causing us to get write throttled:
    >
    >
    >
    > So really what I believe (if possible) we need is a configuration that is less “bursty”, but more sustained over perhaps a longer duration.  As they are network attached disk, they suffer from initial iops latency, but sustained throughput is high.
    >
    > I’ve graphed the merge statistics out here, as you can see at any given time we have a maximum of 3 concurrent minor merges running, with the occasional major.  P95 on the minor is typically around 2 minutes, but occasionally (correlating with a throttle on the above graphs) we can see a minor merge taking 12->15mins.
    >
    >
    >
    > Our index policy looks like this:
    >
    >     <ramBufferSizeMB>512</ramBufferSizeMB>
    >     <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
    >       <int name="maxMergeAtOnce">10</int>
    >       <int name="segmentsPerTier">10</int>
    >       <int name="maxMergedSegmentMB">5000</int>
    >       <int name="deletesPctAllowed">30</int>
    >     </mergePolicyFactory>
    >     <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
    >       <int name="maxThreadCount">10</int>
    >       <int name="maxMergeCount">15</int>
    >       <bool name="ioThrottle">true</bool>
    >     </mergeScheduler>
    >     <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer"/>
    >
    > I feel like I’d be guessing which of these settings may help the scenario I describe above, which is somewhat fine – I can experiment and measure.  But the feedback loop is relatively slow so I wanted to lean on others experience/input first.  My instinct is to perhaps lower `maxThreadCount`, but seeing as we only ever peak at 3 in progress merges, it feels like I’d have to go low (2, or even 1) which is on par with spindle disks, which these aren’t.
    >
    > Thanks in advance for any help
    >
    > Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.


Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Jan Høydahl / Cominvent
>    - Update rate, and how you do commits?
>
> Update rates very throughout the day, but range from 20ops/sec to 300ops/sec.  Commits are done using autoCommit on 1 min interval, softCommit on 15min interval.

This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

I cannot see why these seeings would cause constant merging, unless you commit more frequently? Your documents are large so the RAM-buffer will fill up quickly and cause a flush, perhaps try increasing ramBufferSizeMb will lower number of flushes and merges.

Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

Jan
Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Karl Stoney
The documents are pretty large yes, 650 fields, circa 20kb/document so at peak (300/sec) that's circa 6meg/sec.  ramBufferSizeMB is 512 so we'd be averaging 1 segment every 90 seconds (ish)?

>    This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

Yes, this is correct

>    Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

I did, but we unfortunately rely on RealTimeGet so need NRT.



On 08/04/2021, 15:10, "Jan Høydahl" <[hidden email]> wrote:

    >    - Update rate, and how you do commits?
    >
    > Update rates very throughout the day, but range from 20ops/sec to 300ops/sec.  Commits are done using autoCommit on 1 min interval, softCommit on 15min interval.

    This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

    I cannot see why these seeings would cause constant merging, unless you commit more frequently? Your documents are large so the RAM-buffer will fill up quickly and cause a flush, perhaps try increasing ramBufferSizeMb will lower number of flushes and merges.

    Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

    Jan

Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Karl Stoney
We've lowered our autoCommit from 1 min to 3 min and that's help quite a lot with the number of small segments being constantly merged and has lowered the overall load on solr.  Will continue to monitor.  We're tempted to go to 5 minutes, but the size of tlogs then would be a bit uncomfortable (at 3mins under peak write load they're about 1.5gb each).



On 08/04/2021, 19:31, "Karl Stoney" <[hidden email]> wrote:

    The documents are pretty large yes, 650 fields, circa 20kb/document so at peak (300/sec) that's circa 6meg/sec.  ramBufferSizeMB is 512 so we'd be averaging 1 segment every 90 seconds (ish)?

    >    This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

    Yes, this is correct

    >    Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

    I did, but we unfortunately rely on RealTimeGet so need NRT.



    On 08/04/2021, 15:10, "Jan Høydahl" <[hidden email]> wrote:

        >    - Update rate, and how you do commits?
        >
        > Update rates very throughout the day, but range from 20ops/sec to 300ops/sec.  Commits are done using autoCommit on 1 min interval, softCommit on 15min interval.

        This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

        I cannot see why these seeings would cause constant merging, unless you commit more frequently? Your documents are large so the RAM-buffer will fill up quickly and cause a flush, perhaps try increasing ramBufferSizeMb will lower number of flushes and merges.

        Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

        Jan


Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.
Reply | Threaded
Open this post in threaded view
|

Re: NRT Merge Load on NAS SDD (Cloud) Advice

Karl Stoney
I felt the need to follow up on this for others reading.
We experimented switching `ioThrottle` to `false` on the CMS settings and on face value, all our problems have gone away.
Merges are significantly faster (average minor merge time down from 9mins to 2) and we're consistently using more of our disks available IO.  As a result we have less concurrently on going merges at any given time.

My concern was that this would impact search performance, however we heavily rely on caches (filterCache and IO cache) so actually apart from after a softCommit (15mins) we have very little disk activity at all.  Subsequently, somehow, has improved our p95 search latencies significantly during periods of heavy indexing (previously, we would see a 30-50% increase @ p95 during heavy update periods, and now we're seeing 5% at most).

I'd love anyone else with experience of running solr in cloud & who has used this option to speak up, as I can't help but feel I'm missing something critical here as it almost seems too good to be true.

Thanks
Karl


On 09/04/2021, 10:49, "Karl Stoney" <[hidden email]> wrote:

    We've lowered our autoCommit from 1 min to 3 min and that's help quite a lot with the number of small segments being constantly merged and has lowered the overall load on solr.  Will continue to monitor.  We're tempted to go to 5 minutes, but the size of tlogs then would be a bit uncomfortable (at 3mins under peak write load they're about 1.5gb each).



    On 08/04/2021, 19:31, "Karl Stoney" <[hidden email]> wrote:

        The documents are pretty large yes, 650 fields, circa 20kb/document so at peak (300/sec) that's circa 6meg/sec.  ramBufferSizeMB is 512 so we'd be averaging 1 segment every 90 seconds (ish)?

        >    This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

        Yes, this is correct

        >    Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

        I did, but we unfortunately rely on RealTimeGet so need NRT.



        On 08/04/2021, 15:10, "Jan Høydahl" <[hidden email]> wrote:

            >    - Update rate, and how you do commits?
            >
            > Update rates very throughout the day, but range from 20ops/sec to 300ops/sec.  Commits are done using autoCommit on 1 min interval, softCommit on 15min interval.

            This means you never explicitly commits from the client? But You autoCommit openSearcher=false every minute to flush transLog, and then autoSoftCommit every 15min to make changes visible?

            I cannot see why these seeings would cause constant merging, unless you commit more frequently? Your documents are large so the RAM-buffer will fill up quickly and cause a flush, perhaps try increasing ramBufferSizeMb will lower number of flushes and merges.

            Have you considered using PULL replicas for reading? Then you could tailor the HW on those servers to only serve reads, and they would replicate index from leader. You'd have e.g. 2xNRT + 7xPULL.

            Jan



Unless expressly stated otherwise in this email, this e-mail is sent on behalf of Auto Trader Limited Registered Office: 1 Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 03909628). Auto Trader Limited is part of the Auto Trader Group Plc group. This email and any files transmitted with it are confidential and may be legally privileged, and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the sender. This email message has been swept for the presence of computer viruses.