Solr performance on EC2 linux

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr performance on EC2 linux

Jeff Wartes

tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks.

This should’ve been a straight port: one datacenter server -> one EC2 node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that the entire index could be cached in memory, and the JVM settings were identical in both places. I applied what should’ve been a comfortable load to the EC2 cluster, and everything exploded. I had to back the rate down to something close to 10% of what I had been getting in the datacenter before latency improved.
Looking around, I was interested to note that under load, user-time CPU usage was being shadowed by an almost equal amount of system CPU time. This was not IOWait, but system time. Strace showed a bunch of time being spent in futex and restart_syscall, but I couldn’t see where to go from there.

Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations.

Eventually, we came across this: http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
In direct opposition to the author’s intent, (something about taking expired medication) we applied these settings blindly to see what happened. The difference was breathtaking. The system time usage disappeared, and I could apply load at and even a little above my expected rates, well within my latency goals.

There are a number of settings involved, and we haven’t isolated for sure which ones made the biggest difference, but my guess at the moment is that it’s the change of clocksource. I think this would be consistent with the observed system time. Note however that using the “tsc” clocksource on EC2 is generally discouraged, because it’s possible to get backwards clock drift.

I’m writing this for a few reasons:

1.       The performance difference was so crazy I really feel like this should really be broader knowledge.

2.       If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace.

3.       Has anyone run Solr with the “tsc” clocksource, and is aware of any concrete issues?


Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Erick Erickson
Well, 6.4.0 had a pretty severe performance issue so if you were using
that release you might see this, 6.4.2 is the most recent 6.4 release.
But I have no clue how changing linux settings would alter that and I
sure can't square that issue with you having such different
performance between local and EC2....

But thanks for telling us about this! It's totally baffling

Erick

On Fri, Apr 28, 2017 at 9:09 AM, Jeff Wartes <[hidden email]> wrote:

>
> tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks.
>
> This should’ve been a straight port: one datacenter server -> one EC2 node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that the entire index could be cached in memory, and the JVM settings were identical in both places. I applied what should’ve been a comfortable load to the EC2 cluster, and everything exploded. I had to back the rate down to something close to 10% of what I had been getting in the datacenter before latency improved.
> Looking around, I was interested to note that under load, user-time CPU usage was being shadowed by an almost equal amount of system CPU time. This was not IOWait, but system time. Strace showed a bunch of time being spent in futex and restart_syscall, but I couldn’t see where to go from there.
>
> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations.
>
> Eventually, we came across this: http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html
> In direct opposition to the author’s intent, (something about taking expired medication) we applied these settings blindly to see what happened. The difference was breathtaking. The system time usage disappeared, and I could apply load at and even a little above my expected rates, well within my latency goals.
>
> There are a number of settings involved, and we haven’t isolated for sure which ones made the biggest difference, but my guess at the moment is that it’s the change of clocksource. I think this would be consistent with the observed system time. Note however that using the “tsc” clocksource on EC2 is generally discouraged, because it’s possible to get backwards clock drift.
>
> I’m writing this for a few reasons:
>
> 1.       The performance difference was so crazy I really feel like this should really be broader knowledge.
>
> 2.       If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace.
>
> 3.       Has anyone run Solr with the “tsc” clocksource, and is aware of any concrete issues?
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

mganeshs
In reply to this post by Jeff Wartes
We use Solr 6.2 in EC2 instance with Cent OS 6.2 and we don't see any difference in performance between EC2 and in local environment.
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Jeff Wartes
In reply to this post by Erick Erickson
I’d like to think I helped a little with the metrics upgrade that got released in 6.4, so I was already watching that and I’m aware of the resulting performance issue.
This was 5.4 though, patched with https://github.com/whitepages/SOLR-4449 - an index we’ve been running for some time now.

Mganeshs’s comment that he doesn’t see a difference on EC2 with Solr 6.2 lends some additional strength to the thought that something changed between Lucene 5.4 and 6.2 (which is used in ES 5), but of course it’s all still pretty anecdotal.


On 4/28/17, 11:44 AM, "Erick Erickson" <[hidden email]> wrote:

    Well, 6.4.0 had a pretty severe performance issue so if you were using
    that release you might see this, 6.4.2 is the most recent 6.4 release.
    But I have no clue how changing linux settings would alter that and I
    sure can't square that issue with you having such different
    performance between local and EC2....
   
    But thanks for telling us about this! It's totally baffling
   
    Erick
   
    On Fri, Apr 28, 2017 at 9:09 AM, Jeff Wartes <[hidden email]> wrote:
    >
    > tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks.
    >
    > This should’ve been a straight port: one datacenter server -> one EC2 node. Solr 5.4, Solrcloud, Ubuntu xenial. Nodes were sized in both cases such that the entire index could be cached in memory, and the JVM settings were identical in both places. I applied what should’ve been a comfortable load to the EC2 cluster, and everything exploded. I had to back the rate down to something close to 10% of what I had been getting in the datacenter before latency improved.
    > Looking around, I was interested to note that under load, user-time CPU usage was being shadowed by an almost equal amount of system CPU time. This was not IOWait, but system time. Strace showed a bunch of time being spent in futex and restart_syscall, but I couldn’t see where to go from there.
    >
    > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations.
    >
    > Eventually, we came across this: https://linkprotect.cudasvc.com/url?a=http://www.brendangregg.com/blog/2015-03-03/performance-tuning-linux-instances-on-ec2.html&c=E,1,wrdb94Vzm3Hu0-Edzz8gwrCGG9MiHbLKDKltAaM0g2kqyw35-xRDD2azZNIQqp8aoVnP654tzZ3WyRGAhneL4AvPRfV4G6s4VoEeZtSzXgRIBXS62M4Zq4Q,&typo=0
    > In direct opposition to the author’s intent, (something about taking expired medication) we applied these settings blindly to see what happened. The difference was breathtaking. The system time usage disappeared, and I could apply load at and even a little above my expected rates, well within my latency goals.
    >
    > There are a number of settings involved, and we haven’t isolated for sure which ones made the biggest difference, but my guess at the moment is that it’s the change of clocksource. I think this would be consistent with the observed system time. Note however that using the “tsc” clocksource on EC2 is generally discouraged, because it’s possible to get backwards clock drift.
    >
    > I’m writing this for a few reasons:
    >
    > 1.       The performance difference was so crazy I really feel like this should really be broader knowledge.
    >
    > 2.       If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace.
    >
    > 3.       Has anyone run Solr with the “tsc” clocksource, and is aware of any concrete issues?
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Shawn Heisey-2
In reply to this post by Jeff Wartes
On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> tldr: Recently, I tried moving an existing solrcloud configuration from a local datacenter to EC2. Performance was roughly 1/10th what I’d expected, until I applied a bunch of linux tweaks.

How very strange.  I knew virtualization would have overheard, possibly
even measurable overhead, but that's insane.  Running on bare metal is
always better if you can do it.  I would be curious what would happen on
your original install if you applied similar tuning to that.  Would you
see a speedup there?

> Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a much more recent release) alternate implementation of the same index was not seeing this high-system-time behavior on EC2, and was getting throughput consistent with our general expectations.

That's even weirder.  ES 5.x will likely be using Points field types for
numeric fields, and although those are faster than what Solr currently
uses, I doubt it could explain that difference.  The implication here is
that the ES systems are running with stock EC2 settings, not the tuned
settings ... but I'd like you to confirm that.  Same Java version as
with Solr?  IMHO, Java itself is more likely to cause issues like you
saw than Solr.

> I’m writing this for a few reasons:
>
> 1.       The performance difference was so crazy I really feel like this should really be broader knowledge.

Definitely agree!  I would be very interested in learning which of the
tunables you changed were major contributors to the improvement.  If it
turns out that Solr's code is sub-optimal in some way, maybe we can fix it.

> 2.       If anyone is aware of anything that changed in Lucene between 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from this? If it’s the clocksource that’s the issue, there’s an implication that Solr was using tons more system calls like gettimeofday that the EC2 (xen) hypervisor doesn’t allow in userspace.

I had not considered the performance regression in 6.4.0 and 6.4.1 that
Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?

=============

Specific thoughts on the tuning:

The noatime option is very good to use.  I also use nodiratime on my
systems.  Turning these off can have *massive* impacts on disk
performance.  If these are the source of the speedup, then the machine
doesn't have enough spare memory.

I'd be wary of the "nobarrier" mount option.  If the underlying storage
has battery-backed write caches, or is SSD without write caching, it
wouldn't be a problem.  Here's info about the "discard" mount option, I
don't know whether it applies to your amazon storage:

       discard/nodiscard
              Controls  whether ext4 should issue discard/TRIM commands
to the
              underlying block device when blocks are freed.  This  is
useful
              for  SSD  devices  and sparse/thinly-provisioned LUNs, but
it is
              off by default until sufficient testing has been done.

The network tunables would have more of an effect in a distributed
environment like EC2 than they would on a LAN.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

JohnB
It's also very important to consider the type of EC2 instance you are
using...

We settled on the R4.2XL...  The R series is labeled "High-Memory"

Which instance type did you end up using?

On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <[hidden email]> wrote:

> On 4/28/2017 10:09 AM, Jeff Wartes wrote:
> > tldr: Recently, I tried moving an existing solrcloud configuration from
> a local datacenter to EC2. Performance was roughly 1/10th what I’d
> expected, until I applied a bunch of linux tweaks.
>
> How very strange.  I knew virtualization would have overheard, possibly
> even measurable overhead, but that's insane.  Running on bare metal is
> always better if you can do it.  I would be curious what would happen on
> your original install if you applied similar tuning to that.  Would you
> see a speedup there?
>
> > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
> much more recent release) alternate implementation of the same index was
> not seeing this high-system-time behavior on EC2, and was getting
> throughput consistent with our general expectations.
>
> That's even weirder.  ES 5.x will likely be using Points field types for
> numeric fields, and although those are faster than what Solr currently
> uses, I doubt it could explain that difference.  The implication here is
> that the ES systems are running with stock EC2 settings, not the tuned
> settings ... but I'd like you to confirm that.  Same Java version as
> with Solr?  IMHO, Java itself is more likely to cause issues like you
> saw than Solr.
>
> > I’m writing this for a few reasons:
> >
> > 1.       The performance difference was so crazy I really feel like this
> should really be broader knowledge.
>
> Definitely agree!  I would be very interested in learning which of the
> tunables you changed were major contributors to the improvement.  If it
> turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>
> > 2.       If anyone is aware of anything that changed in Lucene between
> 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
> this? If it’s the clocksource that’s the issue, there’s an implication that
> Solr was using tons more system calls like gettimeofday that the EC2 (xen)
> hypervisor doesn’t allow in userspace.
>
> I had not considered the performance regression in 6.4.0 and 6.4.1 that
> Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>
> =============
>
> Specific thoughts on the tuning:
>
> The noatime option is very good to use.  I also use nodiratime on my
> systems.  Turning these off can have *massive* impacts on disk
> performance.  If these are the source of the speedup, then the machine
> doesn't have enough spare memory.
>
> I'd be wary of the "nobarrier" mount option.  If the underlying storage
> has battery-backed write caches, or is SSD without write caching, it
> wouldn't be a problem.  Here's info about the "discard" mount option, I
> don't know whether it applies to your amazon storage:
>
>        discard/nodiscard
>               Controls  whether ext4 should issue discard/TRIM commands
> to the
>               underlying block device when blocks are freed.  This  is
> useful
>               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
> it is
>               off by default until sufficient testing has been done.
>
> The network tunables would have more of an effect in a distributed
> environment like EC2 than they would on a LAN.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Jeff Wartes
I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
  - EC2 instance type: r4, c4, and i3
  - Ubuntu version: Xenial and Trusty
  - EBS vs local storage
  - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)

Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.

Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though.
The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.

I definitely do want to binary-search those settings until I understand better what exactly did the trick.
It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.



On 5/1/17, 7:26 AM, "John Bickerstaff" <[hidden email]> wrote:

    It's also very important to consider the type of EC2 instance you are
    using...
   
    We settled on the R4.2XL...  The R series is labeled "High-Memory"
   
    Which instance type did you end up using?
   
    On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <[hidden email]> wrote:
   
    > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
    > > tldr: Recently, I tried moving an existing solrcloud configuration from
    > a local datacenter to EC2. Performance was roughly 1/10th what I’d
    > expected, until I applied a bunch of linux tweaks.
    >
    > How very strange.  I knew virtualization would have overheard, possibly
    > even measurable overhead, but that's insane.  Running on bare metal is
    > always better if you can do it.  I would be curious what would happen on
    > your original install if you applied similar tuning to that.  Would you
    > see a speedup there?
    >
    > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
    > much more recent release) alternate implementation of the same index was
    > not seeing this high-system-time behavior on EC2, and was getting
    > throughput consistent with our general expectations.
    >
    > That's even weirder.  ES 5.x will likely be using Points field types for
    > numeric fields, and although those are faster than what Solr currently
    > uses, I doubt it could explain that difference.  The implication here is
    > that the ES systems are running with stock EC2 settings, not the tuned
    > settings ... but I'd like you to confirm that.  Same Java version as
    > with Solr?  IMHO, Java itself is more likely to cause issues like you
    > saw than Solr.
    >
    > > I’m writing this for a few reasons:
    > >
    > > 1.       The performance difference was so crazy I really feel like this
    > should really be broader knowledge.
    >
    > Definitely agree!  I would be very interested in learning which of the
    > tunables you changed were major contributors to the improvement.  If it
    > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
    >
    > > 2.       If anyone is aware of anything that changed in Lucene between
    > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
    > this? If it’s the clocksource that’s the issue, there’s an implication that
    > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
    > hypervisor doesn’t allow in userspace.
    >
    > I had not considered the performance regression in 6.4.0 and 6.4.1 that
    > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
    >
    > =============
    >
    > Specific thoughts on the tuning:
    >
    > The noatime option is very good to use.  I also use nodiratime on my
    > systems.  Turning these off can have *massive* impacts on disk
    > performance.  If these are the source of the speedup, then the machine
    > doesn't have enough spare memory.
    >
    > I'd be wary of the "nobarrier" mount option.  If the underlying storage
    > has battery-backed write caches, or is SSD without write caching, it
    > wouldn't be a problem.  Here's info about the "discard" mount option, I
    > don't know whether it applies to your amazon storage:
    >
    >        discard/nodiscard
    >               Controls  whether ext4 should issue discard/TRIM commands
    > to the
    >               underlying block device when blocks are freed.  This  is
    > useful
    >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
    > it is
    >               off by default until sufficient testing has been done.
    >
    > The network tunables would have more of an effect in a distributed
    > environment like EC2 than they would on a LAN.
    >
    > Thanks,
    > Shawn
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Chris Hostetter-3
In reply to this post by Jeff Wartes

: tldr: Recently, I tried moving an existing solrcloud configuration from
: a local datacenter to EC2. Performance was roughly 1/10th what I’d
: expected, until I applied a bunch of linux tweaks.

How many total nodes in your cluster?  How many of them running ZooKeeper?

Did you observe the heavy increase in system time CPU usage on all nodes,
or just the ones running zookeeper?

I ask because if your speculation is correct and it is an issue of
clocksource, then perhaps ZK is where the majority of those system calls
are happening, and perhaps that's why you didn't see any similar heavy
system CPU load in ES?  

(Then again: at the lowest levels "lucene" really shouldn't care about
anything clock related at all Any "time" realted code would live in the
Solr level ... hmmm.)


-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Walter Underwood
Might want to measure the single CPU performance of your EC2 instance. The last time I checked, my MacBook was twice as fast as the EC2 instance I was using.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On May 1, 2017, at 6:24 PM, Chris Hostetter <[hidden email]> wrote:
>
>
> : tldr: Recently, I tried moving an existing solrcloud configuration from
> : a local datacenter to EC2. Performance was roughly 1/10th what I’d
> : expected, until I applied a bunch of linux tweaks.
>
> How many total nodes in your cluster?  How many of them running ZooKeeper?
>
> Did you observe the heavy increase in system time CPU usage on all nodes,
> or just the ones running zookeeper?
>
> I ask because if your speculation is correct and it is an issue of
> clocksource, then perhaps ZK is where the majority of those system calls
> are happening, and perhaps that's why you didn't see any similar heavy
> system CPU load in ES?  
>
> (Then again: at the lowest levels "lucene" really shouldn't care about
> anything clock related at all Any "time" realted code would live in the
> Solr level ... hmmm.)
>
>
> -Hoss
> http://www.lucidworks.com/

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Will Martin
In reply to this post by Jeff Wartes
Ubuntu 16.04 LTS - Xenial (HVM)

Is this your Xenial version?




On 5/1/2017 6:37 PM, Jeff Wartes wrote:

> I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
>    - EC2 instance type: r4, c4, and i3
>    - Ubuntu version: Xenial and Trusty
>    - EBS vs local storage
>    - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)
>
> Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.
>
> Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though.
> The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.
>
> I definitely do want to binary-search those settings until I understand better what exactly did the trick.
> It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.
>
>
>
> On 5/1/17, 7:26 AM, "John Bickerstaff" <[hidden email]> wrote:
>
>      It's also very important to consider the type of EC2 instance you are
>      using...
>      
>      We settled on the R4.2XL...  The R series is labeled "High-Memory"
>      
>      Which instance type did you end up using?
>      
>      On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <[hidden email]> wrote:
>      
>      > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
>      > > tldr: Recently, I tried moving an existing solrcloud configuration from
>      > a local datacenter to EC2. Performance was roughly 1/10th what I’d
>      > expected, until I applied a bunch of linux tweaks.
>      >
>      > How very strange.  I knew virtualization would have overheard, possibly
>      > even measurable overhead, but that's insane.  Running on bare metal is
>      > always better if you can do it.  I would be curious what would happen on
>      > your original install if you applied similar tuning to that.  Would you
>      > see a speedup there?
>      >
>      > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
>      > much more recent release) alternate implementation of the same index was
>      > not seeing this high-system-time behavior on EC2, and was getting
>      > throughput consistent with our general expectations.
>      >
>      > That's even weirder.  ES 5.x will likely be using Points field types for
>      > numeric fields, and although those are faster than what Solr currently
>      > uses, I doubt it could explain that difference.  The implication here is
>      > that the ES systems are running with stock EC2 settings, not the tuned
>      > settings ... but I'd like you to confirm that.  Same Java version as
>      > with Solr?  IMHO, Java itself is more likely to cause issues like you
>      > saw than Solr.
>      >
>      > > I’m writing this for a few reasons:
>      > >
>      > > 1.       The performance difference was so crazy I really feel like this
>      > should really be broader knowledge.
>      >
>      > Definitely agree!  I would be very interested in learning which of the
>      > tunables you changed were major contributors to the improvement.  If it
>      > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
>      >
>      > > 2.       If anyone is aware of anything that changed in Lucene between
>      > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
>      > this? If it’s the clocksource that’s the issue, there’s an implication that
>      > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
>      > hypervisor doesn’t allow in userspace.
>      >
>      > I had not considered the performance regression in 6.4.0 and 6.4.1 that
>      > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
>      >
>      > =============
>      >
>      > Specific thoughts on the tuning:
>      >
>      > The noatime option is very good to use.  I also use nodiratime on my
>      > systems.  Turning these off can have *massive* impacts on disk
>      > performance.  If these are the source of the speedup, then the machine
>      > doesn't have enough spare memory.
>      >
>      > I'd be wary of the "nobarrier" mount option.  If the underlying storage
>      > has battery-backed write caches, or is SSD without write caching, it
>      > wouldn't be a problem.  Here's info about the "discard" mount option, I
>      > don't know whether it applies to your amazon storage:
>      >
>      >        discard/nodiscard
>      >               Controls  whether ext4 should issue discard/TRIM commands
>      > to the
>      >               underlying block device when blocks are freed.  This  is
>      > useful
>      >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
>      > it is
>      >               off by default until sufficient testing has been done.
>      >
>      > The network tunables would have more of an effect in a distributed
>      > environment like EC2 than they would on a LAN.
>      >
>      > Thanks,
>      > Shawn
>      >
>      >
>      
>

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Jeff Wartes
In reply to this post by Walter Underwood
I started with the same three-node 15-shard configuration I’d been used to, in an RF1 cluster. (the index is almost 700G so this takes three r4.8xlarge’s if I want to be entirely memory-resident) I eventually dropped down to a 1/3rd size index on a single node (so 5 shards, 100M docs each) so I could test configurations more quickly. The system time usage was present on all solr nodes regardless. I adjusted for a difference in the CPU count on the EC2 nodes when I picked my load testing rates.

Zookeeper is a separate cluster on separate nodes. It is NOT collocated with Solr, although it’s dedicated exclusively to Solr’s use.

I specify a timeout on all queries, and as mentioned, use SOLR-4449. So there’s possibly an argument I’m doing a lot more timing related calls than most. There’s nothing particularly exotic there though, just another Executor Service, and you’ll never get a backup request on an RF1 cluster because there’s no alternate to try.


On 5/1/17, 6:28 PM, "Walter Underwood" <[hidden email]> wrote:

    Might want to measure the single CPU performance of your EC2 instance. The last time I checked, my MacBook was twice as fast as the EC2 instance I was using.
   
    wunder
    Walter Underwood
    [hidden email]
    https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/&c=E,1,L0yDngRyy1MwN7dh5tRFW86sVcn6tcLZH4c03j0EdQSsGBMn0SLDqeB_sHQjB4DdbRMOLka5MnyeXnKS_CEUEv4qIgU5wuyhZBMHciVoH6e8uo7KGr09mXTtDw,,&typo=0  (my blog)
   
   
    > On May 1, 2017, at 6:24 PM, Chris Hostetter <[hidden email]> wrote:
    >
    >
    > : tldr: Recently, I tried moving an existing solrcloud configuration from
    > : a local datacenter to EC2. Performance was roughly 1/10th what I’d
    > : expected, until I applied a bunch of linux tweaks.
    >
    > How many total nodes in your cluster?  How many of them running ZooKeeper?
    >
    > Did you observe the heavy increase in system time CPU usage on all nodes,
    > or just the ones running zookeeper?
    >
    > I ask because if your speculation is correct and it is an issue of
    > clocksource, then perhaps ZK is where the majority of those system calls
    > are happening, and perhaps that's why you didn't see any similar heavy
    > system CPU load in ES?  
    >
    > (Then again: at the lowest levels "lucene" really shouldn't care about
    > anything clock related at all Any "time" realted code would live in the
    > Solr level ... hmmm.)
    >
    >
    > -Hoss
    > https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/&c=E,1,ooHM-f4KYxxASNvbLSSYXKwDzWVBK-9orXh84oAZsxzfcPKZ8AF2m_U8K7wc8D5EUvaoHJCrb3O6BPCQIJucBxQaqJMOakPTxCnMW1BDHsyBf13HxMyCeEM_&typo=0
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Jeff Wartes
In reply to this post by Will Martin
Yes, that’s the Xenial I tried. Ubuntu 16.04.2 LTS.

On 5/1/17, 7:22 PM, "Will Martin" <[hidden email]> wrote:

    Ubuntu 16.04 LTS - Xenial (HVM)
   
    Is this your Xenial version?
   
   
   
   
    On 5/1/2017 6:37 PM, Jeff Wartes wrote:
    > I tried a few variations of various things before we found and tried that linux/EC2 tuning page, including:
    >    - EC2 instance type: r4, c4, and i3
    >    - Ubuntu version: Xenial and Trusty
    >    - EBS vs local storage
    >    - Stock openjdk vs Zulu openjdk (Recent java8 in both cases - I’m aware of the issues with early java8 versions and I’m not using G1)
    >
    > Most of those attempts were to help reduce differences between the data center and the EC2 cluster. In all cases I re-indexed from scratch. I got the same very high system-time symptom in all cases. With the linux changes in place, we settled on r4/Xenial/EBS/Stock.
    >
    > Again, this was a slightly modified Solr 5.4, (I added backup requests, and two memory allocation rate tweaks that have long since been merged into mainline - released in 6.2 I think. I can dig up the jira numbers if anyone’s interested) I’ve never used Solr 6.x in production though.
    > The only reason I mentioned 6.x at all is because I’m aware that ES 5.x is based on Lucene 6.2. I don’t believe my coworker spent any time on tuning his ES setup, although I think he did try G1.
    >
    > I definitely do want to binary-search those settings until I understand better what exactly did the trick.
    > It’s a long cycle time per test is the problem, but hopefully in the next couple of weeks.
    >
    >
    >
    > On 5/1/17, 7:26 AM, "John Bickerstaff" <[hidden email]> wrote:
    >
    >      It's also very important to consider the type of EC2 instance you are
    >      using...
    >      
    >      We settled on the R4.2XL...  The R series is labeled "High-Memory"
    >      
    >      Which instance type did you end up using?
    >      
    >      On Mon, May 1, 2017 at 8:22 AM, Shawn Heisey <[hidden email]> wrote:
    >      
    >      > On 4/28/2017 10:09 AM, Jeff Wartes wrote:
    >      > > tldr: Recently, I tried moving an existing solrcloud configuration from
    >      > a local datacenter to EC2. Performance was roughly 1/10th what I’d
    >      > expected, until I applied a bunch of linux tweaks.
    >      >
    >      > How very strange.  I knew virtualization would have overheard, possibly
    >      > even measurable overhead, but that's insane.  Running on bare metal is
    >      > always better if you can do it.  I would be curious what would happen on
    >      > your original install if you applied similar tuning to that.  Would you
    >      > see a speedup there?
    >      >
    >      > > Interestingly, a coworker playing with a ElasticSearch (ES 5.x, so a
    >      > much more recent release) alternate implementation of the same index was
    >      > not seeing this high-system-time behavior on EC2, and was getting
    >      > throughput consistent with our general expectations.
    >      >
    >      > That's even weirder.  ES 5.x will likely be using Points field types for
    >      > numeric fields, and although those are faster than what Solr currently
    >      > uses, I doubt it could explain that difference.  The implication here is
    >      > that the ES systems are running with stock EC2 settings, not the tuned
    >      > settings ... but I'd like you to confirm that.  Same Java version as
    >      > with Solr?  IMHO, Java itself is more likely to cause issues like you
    >      > saw than Solr.
    >      >
    >      > > I’m writing this for a few reasons:
    >      > >
    >      > > 1.       The performance difference was so crazy I really feel like this
    >      > should really be broader knowledge.
    >      >
    >      > Definitely agree!  I would be very interested in learning which of the
    >      > tunables you changed were major contributors to the improvement.  If it
    >      > turns out that Solr's code is sub-optimal in some way, maybe we can fix it.
    >      >
    >      > > 2.       If anyone is aware of anything that changed in Lucene between
    >      > 5.4 and 6.x that could explain why Elasticsearch wasn’t suffering from
    >      > this? If it’s the clocksource that’s the issue, there’s an implication that
    >      > Solr was using tons more system calls like gettimeofday that the EC2 (xen)
    >      > hypervisor doesn’t allow in userspace.
    >      >
    >      > I had not considered the performance regression in 6.4.0 and 6.4.1 that
    >      > Erick mentioned.  Were you still running Solr 5.4, or was it a 6.x version?
    >      >
    >      > =============
    >      >
    >      > Specific thoughts on the tuning:
    >      >
    >      > The noatime option is very good to use.  I also use nodiratime on my
    >      > systems.  Turning these off can have *massive* impacts on disk
    >      > performance.  If these are the source of the speedup, then the machine
    >      > doesn't have enough spare memory.
    >      >
    >      > I'd be wary of the "nobarrier" mount option.  If the underlying storage
    >      > has battery-backed write caches, or is SSD without write caching, it
    >      > wouldn't be a problem.  Here's info about the "discard" mount option, I
    >      > don't know whether it applies to your amazon storage:
    >      >
    >      >        discard/nodiscard
    >      >               Controls  whether ext4 should issue discard/TRIM commands
    >      > to the
    >      >               underlying block device when blocks are freed.  This  is
    >      > useful
    >      >               for  SSD  devices  and sparse/thinly-provisioned LUNs, but
    >      > it is
    >      >               off by default until sufficient testing has been done.
    >      >
    >      > The network tunables would have more of an effect in a distributed
    >      > environment like EC2 than they would on a LAN.
    >      >
    >      > Thanks,
    >      > Shawn
    >      >
    >      >
    >      
    >
   
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Chris Hostetter-3
In reply to this post by Jeff Wartes

: I specify a timeout on all queries, ....

Ah -- ok, yeah -- you mean using "timeAllowed" correct?

If the root issue you were seeing is in fact clocksource related,
then using timeAllowed would probably be a significant compounding
factor there since it would involve a lot of time checks in a single
request (even w/o any debugging enabled)

(did your coworker's experiements with ES use any sort of equivilent
timeout feature?)





-Hoss
http://www.lucidworks.com/
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Walter Underwood
Hmm, has anyone measured the overhead of timeAllowed? We use it all the time.

If nobody has, I’ll run a benchmark with and without it.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On May 2, 2017, at 9:52 AM, Chris Hostetter <[hidden email]> wrote:
>
>
> : I specify a timeout on all queries, ....
>
> Ah -- ok, yeah -- you mean using "timeAllowed" correct?
>
> If the root issue you were seeing is in fact clocksource related,
> then using timeAllowed would probably be a significant compounding
> factor there since it would involve a lot of time checks in a single
> request (even w/o any debugging enabled)
>
> (did your coworker's experiements with ES use any sort of equivilent
> timeout feature?)
>
>
>
>
>
> -Hoss
> http://www.lucidworks.com/

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Tomás Fernández Löbbe
I remember seeing some performance impact (even when not using it) and it
was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876
(fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is
not used, but I don't know if there were more changes to improve the
performance of the feature itself. The problem was that System.nanoTime may
be called too many times on indices with many different terms. If this is
the problem Jeff is seeing, a small degradation of System.nanoTime could
have a big impact.

Tomás

On Tue, May 2, 2017 at 10:23 AM, Walter Underwood <[hidden email]>
wrote:

> Hmm, has anyone measured the overhead of timeAllowed? We use it all the
> time.
>
> If nobody has, I’ll run a benchmark with and without it.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>
> > On May 2, 2017, at 9:52 AM, Chris Hostetter <[hidden email]>
> wrote:
> >
> >
> > : I specify a timeout on all queries, ....
> >
> > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
> >
> > If the root issue you were seeing is in fact clocksource related,
> > then using timeAllowed would probably be a significant compounding
> > factor there since it would involve a lot of time checks in a single
> > request (even w/o any debugging enabled)
> >
> > (did your coworker's experiements with ES use any sort of equivilent
> > timeout feature?)
> >
> >
> >
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Jeff Wartes

It’s presumably not a small degradation - this guy very recently suggested it’s 77% slower:
https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/

The other reason that blog post is interesting to me is that his benchmark utility showed the work of entering the kernel as high system time, which is also what I was seeing.

I really want to go back and try some more tests, including (now) disabling the timeAllowed param in my query corpus.
I think I’m still a few weeks of higher priority issues away from that though.


On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" <[hidden email]> wrote:

    I remember seeing some performance impact (even when not using it) and it
    was attributed to the calls to System.nanoTime. See SOLR-7875 and SOLR-7876
    (fixed for 5.3 and 5.4). Those two Jiras fix the impact when timeAllowed is
    not used, but I don't know if there were more changes to improve the
    performance of the feature itself. The problem was that System.nanoTime may
    be called too many times on indices with many different terms. If this is
    the problem Jeff is seeing, a small degradation of System.nanoTime could
    have a big impact.
   
    Tomás
   
    On Tue, May 2, 2017 at 10:23 AM, Walter Underwood <[hidden email]>
    wrote:
   
    > Hmm, has anyone measured the overhead of timeAllowed? We use it all the
    > time.
    >
    > If nobody has, I’ll run a benchmark with and without it.
    >
    > wunder
    > Walter Underwood
    > [hidden email]
    > https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/&c=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,&typo=0  (my blog)
    >
    >
    > > On May 2, 2017, at 9:52 AM, Chris Hostetter <[hidden email]>
    > wrote:
    > >
    > >
    > > : I specify a timeout on all queries, ....
    > >
    > > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
    > >
    > > If the root issue you were seeing is in fact clocksource related,
    > > then using timeAllowed would probably be a significant compounding
    > > factor there since it would involve a lot of time checks in a single
    > > request (even w/o any debugging enabled)
    > >
    > > (did your coworker's experiements with ES use any sort of equivilent
    > > timeout feature?)
    > >
    > >
    > >
    > >
    > >
    > > -Hoss
    > > https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/&c=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7&typo=0
    >
    >
   

Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Rick Leir-2
+Walter test it

Jeff,
How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a normal workload, and is mostly consumed during system calls or context changes. So it is quite understandable that frequent time calls would take a bigger bite in the AWS cloud compared to bare metal. Sorry, my words are mostly conjecture so please ignore. Cheers -- Rick

On May 3, 2017 2:35:33 PM EDT, Jeff Wartes <[hidden email]> wrote:

>
>It’s presumably not a small degradation - this guy very recently
>suggested it’s 77% slower:
>https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
>
>The other reason that blog post is interesting to me is that his
>benchmark utility showed the work of entering the kernel as high system
>time, which is also what I was seeing.
>
>I really want to go back and try some more tests, including (now)
>disabling the timeAllowed param in my query corpus.
>I think I’m still a few weeks of higher priority issues away from that
>though.
>
>
>On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" <[hidden email]>
>wrote:
>
>I remember seeing some performance impact (even when not using it) and
>it
>was attributed to the calls to System.nanoTime. See SOLR-7875 and
>SOLR-7876
>(fixed for 5.3 and 5.4). Those two Jiras fix the impact when
>timeAllowed is
>   not used, but I don't know if there were more changes to improve the
>performance of the feature itself. The problem was that System.nanoTime
>may
>be called too many times on indices with many different terms. If this
>is
>the problem Jeff is seeing, a small degradation of System.nanoTime
>could
>    have a big impact.
>    
>    Tomás
>    
>On Tue, May 2, 2017 at 10:23 AM, Walter Underwood
><[hidden email]>
>    wrote:
>    
>> Hmm, has anyone measured the overhead of timeAllowed? We use it all
>the
>    > time.
>    >
>    > If nobody has, I’ll run a benchmark with and without it.
>    >
>    > wunder
>    > Walter Underwood
>    > [hidden email]
>>
>https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/&c=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,&typo=0
> (my blog)
>    >
>    >
>> > On May 2, 2017, at 9:52 AM, Chris Hostetter
><[hidden email]>
>    > wrote:
>    > >
>    > >
>    > > : I specify a timeout on all queries, ....
>    > >
>    > > Ah -- ok, yeah -- you mean using "timeAllowed" correct?
>    > >
>  > > If the root issue you were seeing is in fact clocksource related,
> > > then using timeAllowed would probably be a significant compounding
>> > factor there since it would involve a lot of time checks in a
>single
>    > > request (even w/o any debugging enabled)
>    > >
>> > (did your coworker's experiements with ES use any sort of
>equivilent
>    > > timeout feature?)
>    > >
>    > >
>    > >
>    > >
>    > >
>    > > -Hoss
>> >
>https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/&c=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7&typo=0
>    >
>    >
>    

--
Sorry for being brief. Alternate email is rickleir at yahoo dot com
Reply | Threaded
Open this post in threaded view
|

Re: Solr performance on EC2 linux

Walter Underwood
Already have a Jira issue for next week. I have a script to run prod logs against a cluster. I’ll be testing a four shard by two replica cluster with 17 million docs and very long queries. We are working on getting the 95th percentile under one second, so we should exercise the timeAllowed feature.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)


> On May 3, 2017, at 3:53 PM, Rick Leir <[hidden email]> wrote:
>
> +Walter test it
>
> Jeff,
> How much CPU does the EC2 hypervisor use? I have heard 5% but that is for a normal workload, and is mostly consumed during system calls or context changes. So it is quite understandable that frequent time calls would take a bigger bite in the AWS cloud compared to bare metal. Sorry, my words are mostly conjecture so please ignore. Cheers -- Rick
>
> On May 3, 2017 2:35:33 PM EDT, Jeff Wartes <[hidden email]> wrote:
>>
>> It’s presumably not a small degradation - this guy very recently
>> suggested it’s 77% slower:
>> https://blog.packagecloud.io/eng/2017/03/08/system-calls-are-much-slower-on-ec2/
>>
>> The other reason that blog post is interesting to me is that his
>> benchmark utility showed the work of entering the kernel as high system
>> time, which is also what I was seeing.
>>
>> I really want to go back and try some more tests, including (now)
>> disabling the timeAllowed param in my query corpus.
>> I think I’m still a few weeks of higher priority issues away from that
>> though.
>>
>>
>> On 5/2/17, 1:45 PM, "Tomás Fernández Löbbe" <[hidden email]>
>> wrote:
>>
>> I remember seeing some performance impact (even when not using it) and
>> it
>> was attributed to the calls to System.nanoTime. See SOLR-7875 and
>> SOLR-7876
>> (fixed for 5.3 and 5.4). Those two Jiras fix the impact when
>> timeAllowed is
>>  not used, but I don't know if there were more changes to improve the
>> performance of the feature itself. The problem was that System.nanoTime
>> may
>> be called too many times on indices with many different terms. If this
>> is
>> the problem Jeff is seeing, a small degradation of System.nanoTime
>> could
>>   have a big impact.
>>
>>   Tomás
>>
>> On Tue, May 2, 2017 at 10:23 AM, Walter Underwood
>> <[hidden email]>
>>   wrote:
>>
>>> Hmm, has anyone measured the overhead of timeAllowed? We use it all
>> the
>>> time.
>>>
>>> If nobody has, I’ll run a benchmark with and without it.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>>
>> https://linkprotect.cudasvc.com/url?a=http://observer.wunderwood.org/&c=E,1,7uGY1VtJPqam-MhMKpspcb31C9NQ_Jh4nI0gzkQP4gVJkhcC5l031vMIHH0j38EdMESOM5Chjav3lUu1rpTdohTNTPdchTkl4TGNEHWJpJFJ-MR6RrjnTQ,,&typo=0
>> (my blog)
>>>
>>>
>>>> On May 2, 2017, at 9:52 AM, Chris Hostetter
>> <[hidden email]>
>>> wrote:
>>>>
>>>>
>>>> : I specify a timeout on all queries, ....
>>>>
>>>> Ah -- ok, yeah -- you mean using "timeAllowed" correct?
>>>>
>>>> If the root issue you were seeing is in fact clocksource related,
>>>> then using timeAllowed would probably be a significant compounding
>>>> factor there since it would involve a lot of time checks in a
>> single
>>>> request (even w/o any debugging enabled)
>>>>
>>>> (did your coworker's experiements with ES use any sort of
>> equivilent
>>>> timeout feature?)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -Hoss
>>>>
>> https://linkprotect.cudasvc.com/url?a=http://www.lucidworks.com/&c=E,1,DwDibSb7PG6wpqsnn-u9uKdCuujyokjeyc6ero6bEdoUjs4Hn_X1jj_z6QAEDmorDqAP_TcaEJX8k5HYYJI7bJ7jQxTDpKUX9MvWAaP6ICoyVmpmQ8X7&typo=0
>>>
>>>
>>
>
> --
> Sorry for being brief. Alternate email is rickleir at yahoo dot com