Solrcloud performance issues

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Solrcloud performance issues

Vijay Sekhri
Hi Erick,
We have following configuration of our solr cloud

   1. 10 Shards
   2. 15 replicas per shard
   3. 9 GB of index size per shard
   4. a total of around 90 mil documents
   5. 2 collection viz search1 serving live traffic and search 2 for
   indexing. We swap collection when indexing finishes
   6. On 150 hosts we have 2 JVMs running one for search1 collection and
   other for search2 collection
   7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
   total
   8. Each host has 16 processors
   9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
   UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
   10. We have two ways to index data.
   1. Bulk indexing . All 90 million docs pumped in from 14 parallel
      process (on 14 different client hosts). This is done on
collection that is
      not serving live traffic
      2.  Incremental indexing . Only delta changes (Range from 100K to 5
      Mil) every two hours. This is done on collection also serving live traffic
   11. The request per second count on live collection is around 300 TPS
   12. Hard commit setting is every 30 second with open searcher false and
   soft commit setting is every 15 minutes . We have tried a lot of different
   setting here BTW.




Now we have two issues with indexing
1) Solr just could not keep up with the bulk indexing when replicas are
also active. We have concluded this by changing the number of replicas to
just 2 , to 4 and then to 15. When the number of replicas increases the
bulk indexing time increase almost exponentially
We seem to have encountered the same issue reported here
https://issues.apache.org/jira/browse/SOLR-6816
It gets to a point that even to index 100 docs the solr cluster would take
300 second. It would start of indexing 100 docs in 55 millisecond and
slowly increase over time and within hour and a half just could not keep
up. We have a workaround for this and i.e we stop all the replicas , do the
bulk indexing and bring all the replicas up one by one . This sort of
defeats the purpose of solr cloud but we can still work with this
workaround. We can do this because , bulk indexing happen on the collection
that is not serving live traffic. However we would love to have a solution
from the solr cloud itself like ask it to stop replication and start via an
API at the end of indexing.

2) This issues is related to soft commit with incremental indexing . When
we do incremental indexing, it is done on the same collection serving live
traffic with 300 request per second throughput.  Everything is fine except
whenever the soft commit happens. Each time soft commit (autosoftcommit in
sorlconfig.xml) happens which BTW happens almost at the same time
throughout the cluster , there is a spike in the response times and
throughput decreases almost to 150 tps. The spike continues for 2 minutes
and then it happens again at the exact interval when the soft commit
happens. We have monitored the logs and found a direct co relation when the
soft commit happens and when the response time tanks.

Now the latter issue is quite disturbing , because it is serving live
traffic and we cannot sustain these periodic degradation. We have played
around with different soft commit setting . Interval ranging from 2 minutes
to 30 minutes . Auto warming half cache  , auto warming full cache, auto
warming only 10 %. Doing warm up queries on every new searcher , doing NONE
warm up queries on every new searching and all the different setting yields
the same results . As and when soft commit happens the response time tanks
and throughput deceases. The difference is almost 50 % in response times
and 50 % in throughput


Our workaround for this solution is to also do incremental delta indexing
on the collection not serving live traffic and swap when it is done. As you
can see that this also defeats the purpose of solr cloud . We cannot do
bulk indexing because replicas cannot keeps up and we cannot do incremental
indexing because of soft commit performance.

Is there a way to make the cluster not do soft commit all at the same time
or is there a way to make soft commit not cause this degradation ?
We are open to any ideas at this time now.






--
*********************************************
Vijay Sekhri
*********************************************
Reply | Threaded
Open this post in threaded view
|

Re: Solrcloud performance issues

Timothy Potter
Hi Vijay,


We're working on SOLR-6816 ... would love for you to be a test site for any
improvements we make ;-)

Curious if you've experimented with changing the mergeFactor to a higher
value, such as 25 and what happens if you set soft-auto-commits to
something lower like 15 seconds? Also, make sure your indexing clients are
not sending hard-commits as well, i.e. just rely on auto-commits.

re: "When the number of replicas increases the bulk indexing time increase
almost exponentially" ... ugh ... I'm wondering what your CPU utilization /
thread counts are? the Leader sends updates to all replicas in parallel, so
it shouldn't be a huge impact if you're doing 1 replica or 15 (probably a
little more overhead with 15, but not exponential for sure) ... what are
threads waiting on when this huge slow down occurs? jstack -l <PID> should
give you some idea.

Lastly, do you have GC logging enabled and have you ruled out GC pauses
causing the big slow down?

On Thu, Feb 12, 2015 at 4:07 PM, Vijay Sekhri <[hidden email]> wrote:

> Hi Erick,
> We have following configuration of our solr cloud
>
>    1. 10 Shards
>    2. 15 replicas per shard
>    3. 9 GB of index size per shard
>    4. a total of around 90 mil documents
>    5. 2 collection viz search1 serving live traffic and search 2 for
>    indexing. We swap collection when indexing finishes
>    6. On 150 hosts we have 2 JVMs running one for search1 collection and
>    other for search2 collection
>    7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
>    total
>    8. Each host has 16 processors
>    9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
>    UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>    10. We have two ways to index data.
>    1. Bulk indexing . All 90 million docs pumped in from 14 parallel
>       process (on 14 different client hosts). This is done on
> collection that is
>       not serving live traffic
>       2.  Incremental indexing . Only delta changes (Range from 100K to 5
>       Mil) every two hours. This is done on collection also serving live
> traffic
>    11. The request per second count on live collection is around 300 TPS
>    12. Hard commit setting is every 30 second with open searcher false and
>    soft commit setting is every 15 minutes . We have tried a lot of
> different
>    setting here BTW.
>
>
>
>
> Now we have two issues with indexing
> 1) Solr just could not keep up with the bulk indexing when replicas are
> also active. We have concluded this by changing the number of replicas to
> just 2 , to 4 and then to 15. When the number of replicas increases the
> bulk indexing time increase almost exponentially
> We seem to have encountered the same issue reported here
> https://issues.apache.org/jira/browse/SOLR-6816
> It gets to a point that even to index 100 docs the solr cluster would take
> 300 second. It would start of indexing 100 docs in 55 millisecond and
> slowly increase over time and within hour and a half just could not keep
> up. We have a workaround for this and i.e we stop all the replicas , do the
> bulk indexing and bring all the replicas up one by one . This sort of
> defeats the purpose of solr cloud but we can still work with this
> workaround. We can do this because , bulk indexing happen on the collection
> that is not serving live traffic. However we would love to have a solution
> from the solr cloud itself like ask it to stop replication and start via an
> API at the end of indexing.
>
> 2) This issues is related to soft commit with incremental indexing . When
> we do incremental indexing, it is done on the same collection serving live
> traffic with 300 request per second throughput.  Everything is fine except
> whenever the soft commit happens. Each time soft commit (autosoftcommit in
> sorlconfig.xml) happens which BTW happens almost at the same time
> throughout the cluster , there is a spike in the response times and
> throughput decreases almost to 150 tps. The spike continues for 2 minutes
> and then it happens again at the exact interval when the soft commit
> happens. We have monitored the logs and found a direct co relation when the
> soft commit happens and when the response time tanks.
>
> Now the latter issue is quite disturbing , because it is serving live
> traffic and we cannot sustain these periodic degradation. We have played
> around with different soft commit setting . Interval ranging from 2 minutes
> to 30 minutes . Auto warming half cache  , auto warming full cache, auto
> warming only 10 %. Doing warm up queries on every new searcher , doing NONE
> warm up queries on every new searching and all the different setting yields
> the same results . As and when soft commit happens the response time tanks
> and throughput deceases. The difference is almost 50 % in response times
> and 50 % in throughput
>
>
> Our workaround for this solution is to also do incremental delta indexing
> on the collection not serving live traffic and swap when it is done. As you
> can see that this also defeats the purpose of solr cloud . We cannot do
> bulk indexing because replicas cannot keeps up and we cannot do incremental
> indexing because of soft commit performance.
>
> Is there a way to make the cluster not do soft commit all at the same time
> or is there a way to make soft commit not cause this degradation ?
> We are open to any ideas at this time now.
>
>
>
>
>
>
> --
> *********************************************
> Vijay Sekhri
> *********************************************
>
Reply | Threaded
Open this post in threaded view
|

Re: Solrcloud performance issues

Otis Gospodnetić
In reply to this post by Vijay Sekhri
Hi,

Did you say you have 150 servers in this cluster?  And 10 shards for just
90M docs?  If so, that 150 hosts sounds like too much for all other numbers
I see here.  I'd love to see some metrics here.  e.g. what happens with
disk IO around those commits?  How about GC time/size info?  Are JVM memory
pools full-ish and is the CPU jumping like crazy?  Can you share more info
to give us a more complete picture of your system? SPM for Solr
<http://sematext.com/spm/> will help if you don't already capture these
types of things.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Feb 12, 2015 at 11:07 AM, Vijay Sekhri <[hidden email]>
wrote:

> Hi Erick,
> We have following configuration of our solr cloud
>
>    1. 10 Shards
>    2. 15 replicas per shard
>    3. 9 GB of index size per shard
>    4. a total of around 90 mil documents
>    5. 2 collection viz search1 serving live traffic and search 2 for
>    indexing. We swap collection when indexing finishes
>    6. On 150 hosts we have 2 JVMs running one for search1 collection and
>    other for search2 collection
>    7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in
>    total
>    8. Each host has 16 processors
>    9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43
>    UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
>    10. We have two ways to index data.
>    1. Bulk indexing . All 90 million docs pumped in from 14 parallel
>       process (on 14 different client hosts). This is done on
> collection that is
>       not serving live traffic
>       2.  Incremental indexing . Only delta changes (Range from 100K to 5
>       Mil) every two hours. This is done on collection also serving live
> traffic
>    11. The request per second count on live collection is around 300 TPS
>    12. Hard commit setting is every 30 second with open searcher false and
>    soft commit setting is every 15 minutes . We have tried a lot of
> different
>    setting here BTW.
>
>
>
>
> Now we have two issues with indexing
> 1) Solr just could not keep up with the bulk indexing when replicas are
> also active. We have concluded this by changing the number of replicas to
> just 2 , to 4 and then to 15. When the number of replicas increases the
> bulk indexing time increase almost exponentially
> We seem to have encountered the same issue reported here
> https://issues.apache.org/jira/browse/SOLR-6816
> It gets to a point that even to index 100 docs the solr cluster would take
> 300 second. It would start of indexing 100 docs in 55 millisecond and
> slowly increase over time and within hour and a half just could not keep
> up. We have a workaround for this and i.e we stop all the replicas , do the
> bulk indexing and bring all the replicas up one by one . This sort of
> defeats the purpose of solr cloud but we can still work with this
> workaround. We can do this because , bulk indexing happen on the collection
> that is not serving live traffic. However we would love to have a solution
> from the solr cloud itself like ask it to stop replication and start via an
> API at the end of indexing.
>
> 2) This issues is related to soft commit with incremental indexing . When
> we do incremental indexing, it is done on the same collection serving live
> traffic with 300 request per second throughput.  Everything is fine except
> whenever the soft commit happens. Each time soft commit (autosoftcommit in
> sorlconfig.xml) happens which BTW happens almost at the same time
> throughout the cluster , there is a spike in the response times and
> throughput decreases almost to 150 tps. The spike continues for 2 minutes
> and then it happens again at the exact interval when the soft commit
> happens. We have monitored the logs and found a direct co relation when the
> soft commit happens and when the response time tanks.
>
> Now the latter issue is quite disturbing , because it is serving live
> traffic and we cannot sustain these periodic degradation. We have played
> around with different soft commit setting . Interval ranging from 2 minutes
> to 30 minutes . Auto warming half cache  , auto warming full cache, auto
> warming only 10 %. Doing warm up queries on every new searcher , doing NONE
> warm up queries on every new searching and all the different setting yields
> the same results . As and when soft commit happens the response time tanks
> and throughput deceases. The difference is almost 50 % in response times
> and 50 % in throughput
>
>
> Our workaround for this solution is to also do incremental delta indexing
> on the collection not serving live traffic and swap when it is done. As you
> can see that this also defeats the purpose of solr cloud . We cannot do
> bulk indexing because replicas cannot keeps up and we cannot do incremental
> indexing because of soft commit performance.
>
> Is there a way to make the cluster not do soft commit all at the same time
> or is there a way to make soft commit not cause this degradation ?
> We are open to any ideas at this time now.
>
>
>
>
>
>
> --
> *********************************************
> Vijay Sekhri
> *********************************************
>
Reply | Threaded
Open this post in threaded view
|

Re: Solrcloud performance issues

longsan
In reply to this post by Vijay Sekhri
why you use 15 replicas?

more replicas more slower.