NRT vs TLOG bulk indexing performances

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

NRT vs TLOG bulk indexing performances

Dominique Bejean
Hi,

I made some benchmarks for bulk indexing in order to compare performances
and ressources usage for NRT versus TLOG replica.

Environnent :
* Solrcloud with 4 Solr nodes (8 Gb RAM, 4 Gb Heap)
* 1 collection with 2 shards x 2 replicas (all NRT or all TLOG)
* 1 core per Solr Server

Indexing of a 10.000.000 documents in one json file with bin/post script

If I compare NRT vs TLOG indexing, I see :

For collection created with all replicas as NRT

* Indexing time : 22 minutes
* GC times : identical on all nodes
* GC count : identical on all nodes
* Heap size : identical on all nodes
* CPU Load / CPU usage : identical on all nodes

For collection created with all replicas as TLOG

* Indexing time : 34 minutes
* GC times : identical on all nodes
* GC count : identical on all nodes
* Heap size : identical on all nodes
* CPU Load / CPU usage : identical on NRT leaders, divide by 4 on TLOG not
leaders


The conclusion seems to be that by using TLOG :

* You save CPU resources on non leaders nodes at index time
* The JVM Heap and GC are the same
* Indexing performance ares really less with TLOG

I am disappointed in TLOG mode by very slower indexing time and by JVM Heap
/ GC.

Are these results conform to what we could expect ?
What can explain bad batch indexing performances in TLOG mode ?

I have Grafana graph for all these metrics during tests.

Rergards.

Dominique
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Jörn Franke
Which Solr version are you using and how often you repeated the test?

> Am 25.10.2019 um 09:16 schrieb Dominique Bejean <[hidden email]>:
>
> Hi,
>
> I made some benchmarks for bulk indexing in order to compare performances
> and ressources usage for NRT versus TLOG replica.
>
> Environnent :
> * Solrcloud with 4 Solr nodes (8 Gb RAM, 4 Gb Heap)
> * 1 collection with 2 shards x 2 replicas (all NRT or all TLOG)
> * 1 core per Solr Server
>
> Indexing of a 10.000.000 documents in one json file with bin/post script
>
> If I compare NRT vs TLOG indexing, I see :
>
> For collection created with all replicas as NRT
>
> * Indexing time : 22 minutes
> * GC times : identical on all nodes
> * GC count : identical on all nodes
> * Heap size : identical on all nodes
> * CPU Load / CPU usage : identical on all nodes
>
> For collection created with all replicas as TLOG
>
> * Indexing time : 34 minutes
> * GC times : identical on all nodes
> * GC count : identical on all nodes
> * Heap size : identical on all nodes
> * CPU Load / CPU usage : identical on NRT leaders, divide by 4 on TLOG not
> leaders
>
>
> The conclusion seems to be that by using TLOG :
>
> * You save CPU resources on non leaders nodes at index time
> * The JVM Heap and GC are the same
> * Indexing performance ares really less with TLOG
>
> I am disappointed in TLOG mode by very slower indexing time and by JVM Heap
> / GC.
>
> Are these results conform to what we could expect ?
> What can explain bad batch indexing performances in TLOG mode ?
>
> I have Grafana graph for all these metrics during tests.
>
> Rergards.
>
> Dominique
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Dominique Bejean
Hi Jörn ,

I am using version 8.2.
I repeated the test twice for each mode.
I restarted solr nodes and deleted / created empty collection each time.

Regards.

Dominique


Le ven. 25 oct. 2019 à 09:20, Jörn Franke <[hidden email]> a écrit :

> Which Solr version are you using and how often you repeated the test?
>
> > Am 25.10.2019 um 09:16 schrieb Dominique Bejean <
> [hidden email]>:
> >
> > Hi,
> >
> > I made some benchmarks for bulk indexing in order to compare performances
> > and ressources usage for NRT versus TLOG replica.
> >
> > Environnent :
> > * Solrcloud with 4 Solr nodes (8 Gb RAM, 4 Gb Heap)
> > * 1 collection with 2 shards x 2 replicas (all NRT or all TLOG)
> > * 1 core per Solr Server
> >
> > Indexing of a 10.000.000 documents in one json file with bin/post script
> >
> > If I compare NRT vs TLOG indexing, I see :
> >
> > For collection created with all replicas as NRT
> >
> > * Indexing time : 22 minutes
> > * GC times : identical on all nodes
> > * GC count : identical on all nodes
> > * Heap size : identical on all nodes
> > * CPU Load / CPU usage : identical on all nodes
> >
> > For collection created with all replicas as TLOG
> >
> > * Indexing time : 34 minutes
> > * GC times : identical on all nodes
> > * GC count : identical on all nodes
> > * Heap size : identical on all nodes
> > * CPU Load / CPU usage : identical on NRT leaders, divide by 4 on TLOG
> not
> > leaders
> >
> >
> > The conclusion seems to be that by using TLOG :
> >
> > * You save CPU resources on non leaders nodes at index time
> > * The JVM Heap and GC are the same
> > * Indexing performance ares really less with TLOG
> >
> > I am disappointed in TLOG mode by very slower indexing time and by JVM
> Heap
> > / GC.
> >
> > Are these results conform to what we could expect ?
> > What can explain bad batch indexing performances in TLOG mode ?
> >
> > I have Grafana graph for all these metrics during tests.
> >
> > Rergards.
> >
> > Dominique
>
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Shawn Heisey-2
In reply to this post by Dominique Bejean
On 10/25/2019 1:16 AM, Dominique Bejean wrote:
> For collection created with all replicas as NRT
>
> * Indexing time : 22 minutes

<snip>

> For collection created with all replicas as TLOG
>
> * Indexing time : 34 minutes

NRT indexes simultaneously on all replicas.  So when indexing is done on
one, it is also done on all the others.

PULL and non-leader TLOG replicas must copy the index from the leader.
The leader will do the indexing and the other replicas will copy the
completed index from the leader.  This takes time.  If the index is
large, it can take a LOT of time, especially if the disks or network are
slow.  TLOG replicas can become leader and PULL replicas cannot.

What I would do personally is set two replicas for each shard to TLOG
and all the rest to PULL.  When a TLOG replica is acting as leader, it
will function exactly like an NRT replica.

> The conclusion seems to be that by using TLOG :
>
> * You save CPU resources on non leaders nodes at index time
> * The JVM Heap and GC are the same
> * Indexing performance ares really less with TLOG

Java works in such a way that it will always eventually allocate and use
the entire max heap that it is allowed.  It is not always possible to
determine how much heap is truly needed, though analyzing large GC logs
will sometimes reveal that info.

Non-leader replicas will probably require less heap if they are TLOG or
PULL.  I cannot say how much less, that will be something that has to be
determined.  Those replicas will also use less CPU.

With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
for querying, so queries will be targeted to those replicas, unless they
all go down, in which case it will go to non-preferred replica types.  I
do not know how to do this, I only know that it is possible.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Dominique Bejean
Shawn,

So, I understand that while non leader TLOG is copying the index from
leader, the leader stop indexing.
One shot large heavy bulk indexing should be very much more impacted than
continus ligth indexing.

Regards.

Dominique


Le ven. 25 oct. 2019 à 13:54, Shawn Heisey <[hidden email]> a écrit :

> On 10/25/2019 1:16 AM, Dominique Bejean wrote:
> > For collection created with all replicas as NRT
> >
> > * Indexing time : 22 minutes
>
> <snip>
>
> > For collection created with all replicas as TLOG
> >
> > * Indexing time : 34 minutes
>
> NRT indexes simultaneously on all replicas.  So when indexing is done on
> one, it is also done on all the others.
>
> PULL and non-leader TLOG replicas must copy the index from the leader.
> The leader will do the indexing and the other replicas will copy the
> completed index from the leader.  This takes time.  If the index is
> large, it can take a LOT of time, especially if the disks or network are
> slow.  TLOG replicas can become leader and PULL replicas cannot.
>
> What I would do personally is set two replicas for each shard to TLOG
> and all the rest to PULL.  When a TLOG replica is acting as leader, it
> will function exactly like an NRT replica.
>
> > The conclusion seems to be that by using TLOG :
> >
> > * You save CPU resources on non leaders nodes at index time
> > * The JVM Heap and GC are the same
> > * Indexing performance ares really less with TLOG
>
> Java works in such a way that it will always eventually allocate and use
> the entire max heap that it is allowed.  It is not always possible to
> determine how much heap is truly needed, though analyzing large GC logs
> will sometimes reveal that info.
>
> Non-leader replicas will probably require less heap if they are TLOG or
> PULL.  I cannot say how much less, that will be something that has to be
> determined.  Those replicas will also use less CPU.
>
> With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
> for querying, so queries will be targeted to those replicas, unless they
> all go down, in which case it will go to non-preferred replica types.  I
> do not know how to do this, I only know that it is possible.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Ere Maijala
In reply to this post by Shawn Heisey-2
Shawn Heisey kirjoitti 25.10.2019 klo 14.54:
> With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
> for querying, so queries will be targeted to those replicas, unless they
> all go down, in which case it will go to non-preferred replica types.  I
> do not know how to do this, I only know that it is possible.
It's controlled by the shards.preference parameter. Docs:

https://lucene.apache.org/solr/guide/8_2/distributed-requests.html#shards-preference-parameter

It also allows one to prefer certain replica locations. This could be
useful e.g. if you want to avoid the indexing server handling queries.
It can also be used to prefer local replicas to minimize network access.

--Ere

--
Ere Maijala
Kansalliskirjasto / The National Library of Finland
Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Erick Erickson
In reply to this post by Dominique Bejean
I’m also surpised that you see a slowdown, it’s worth investigating.

Let’s take the NRT case with only a leader. I’ve seen the NRT indexing time increase when even a single follower was added (30-40% in this case). We believed that the issue was the time the leader sat waiting around for the follower to acknowledge receipt of the documents. Also note that these were very short documents.

You’d still pay that price with more than one TLOG replica. But again, I’d expect the two times to be roughly equivalent.

Indexing does not stop during index replication. That said, if you commit very frequently, you’ll be pushing lots of info around the network. Was your CPU running hot in the TLOG case or idling? If idling, then Solr isn’t getting fed fast enough. Perhaps there’s increased network traffic with the TLOG replicas replicating changed segments and that’s slowing down ingestion?

It’d be interesting to index to NRT, leader-only and also a single TLOG collection.


Best,
Erick

> On Oct 25, 2019, at 8:28 AM, Dominique Bejean <[hidden email]> wrote:
>
> Shawn,
>
> So, I understand that while non leader TLOG is copying the index from
> leader, the leader stop indexing.
> One shot large heavy bulk indexing should be very much more impacted than
> continus ligth indexing.
>
> Regards.
>
> Dominique
>
>
> Le ven. 25 oct. 2019 à 13:54, Shawn Heisey <[hidden email]> a écrit :
>
>> On 10/25/2019 1:16 AM, Dominique Bejean wrote:
>>> For collection created with all replicas as NRT
>>>
>>> * Indexing time : 22 minutes
>>
>> <snip>
>>
>>> For collection created with all replicas as TLOG
>>>
>>> * Indexing time : 34 minutes
>>
>> NRT indexes simultaneously on all replicas.  So when indexing is done on
>> one, it is also done on all the others.
>>
>> PULL and non-leader TLOG replicas must copy the index from the leader.
>> The leader will do the indexing and the other replicas will copy the
>> completed index from the leader.  This takes time.  If the index is
>> large, it can take a LOT of time, especially if the disks or network are
>> slow.  TLOG replicas can become leader and PULL replicas cannot.
>>
>> What I would do personally is set two replicas for each shard to TLOG
>> and all the rest to PULL.  When a TLOG replica is acting as leader, it
>> will function exactly like an NRT replica.
>>
>>> The conclusion seems to be that by using TLOG :
>>>
>>> * You save CPU resources on non leaders nodes at index time
>>> * The JVM Heap and GC are the same
>>> * Indexing performance ares really less with TLOG
>>
>> Java works in such a way that it will always eventually allocate and use
>> the entire max heap that it is allowed.  It is not always possible to
>> determine how much heap is truly needed, though analyzing large GC logs
>> will sometimes reveal that info.
>>
>> Non-leader replicas will probably require less heap if they are TLOG or
>> PULL.  I cannot say how much less, that will be something that has to be
>> determined.  Those replicas will also use less CPU.
>>
>> With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
>> for querying, so queries will be targeted to those replicas, unless they
>> all go down, in which case it will go to non-preferred replica types.  I
>> do not know how to do this, I only know that it is possible.
>>
>> Thanks,
>> Shawn
>>

Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Erick Erickson
In reply to this post by Dominique Bejean
"I understand that while non leader TLOG is copying the index from
leader, the leader stop indexing”

This _better_ not be happening. If you can demonstrate this let’s open a JIRA.

> On Oct 25, 2019, at 8:28 AM, Dominique Bejean <[hidden email]> wrote:
>
> I understand that while non leader TLOG is copying the index from
> leader, the leader stop indexing

Reply | Threaded
Open this post in threaded view
|

Re: NRT vs TLOG bulk indexing performances

Dominique Bejean
In reply to this post by Erick Erickson
Hi,

<http://gofile.me/2dlpH/66hv2NPhJ>Thank you Erick for your response.

My documents are small. Here is a sample csv file
http://gofile.me/2dlpH/66hv2NPhJ

In the TLOG case, the CPU is not hot and not idling

on leaders :

   - 1m load average between 1.5 and 2.5 (4 cpu cores)
   - CPU % between 20% and 50% with average at 30%
   - CPU I/O wait % average : 2.5


on followers :

   - 1m load average between 0.5 and 2.0 (4 cpu cores)
   - CPU % between 5% and 35% with average at 15%
   - CPU I/O wait % average : 2.0


I made more tests. The difference is not always so big as my first tests  :

   - One shard leader only NRT or TLOG : 36 minutes
   - All NRT timing is between 23 and 27 minutes
   - All TLOG timing is between 28 and 34 minutes


I also changed the autoCommit maxtime from 15000 et 30000 in order to get
the 28 minutes in TLOG mode.

With one shard and no replica, create the collection as NRT or as TLOG
gives the same indexing time and the same CPU usage.

My impression is that use TLOG replica produce 10% to 20% indexing time
increase according to autoCommit maxtime setting.

Regards

Dominique


Le ven. 25 oct. 2019 à 15:46, Erick Erickson <[hidden email]> a
écrit :

> I’m also surpised that you see a slowdown, it’s worth investigating.
>
> Let’s take the NRT case with only a leader. I’ve seen the NRT indexing
> time increase when even a single follower was added (30-40% in this case).
> We believed that the issue was the time the leader sat waiting around for
> the follower to acknowledge receipt of the documents. Also note that these
> were very short documents.
>
> You’d still pay that price with more than one TLOG replica. But again, I’d
> expect the two times to be roughly equivalent.
>
> Indexing does not stop during index replication. That said, if you commit
> very frequently, you’ll be pushing lots of info around the network. Was
> your CPU running hot in the TLOG case or idling? If idling, then Solr isn’t
> getting fed fast enough. Perhaps there’s increased network traffic with the
> TLOG replicas replicating changed segments and that’s slowing down
> ingestion?
>
> It’d be interesting to index to NRT, leader-only and also a single TLOG
> collection.
>
>
> Best,
> Erick
>
> > On Oct 25, 2019, at 8:28 AM, Dominique Bejean <[hidden email]>
> wrote:
> >
> > Shawn,
> >
> > So, I understand that while non leader TLOG is copying the index from
> > leader, the leader stop indexing.
> > One shot large heavy bulk indexing should be very much more impacted than
> > continus ligth indexing.
> >
> > Regards.
> >
> > Dominique
> >
> >
> > Le ven. 25 oct. 2019 à 13:54, Shawn Heisey <[hidden email]> a
> écrit :
> >
> >> On 10/25/2019 1:16 AM, Dominique Bejean wrote:
> >>> For collection created with all replicas as NRT
> >>>
> >>> * Indexing time : 22 minutes
> >>
> >> <snip>
> >>
> >>> For collection created with all replicas as TLOG
> >>>
> >>> * Indexing time : 34 minutes
> >>
> >> NRT indexes simultaneously on all replicas.  So when indexing is done on
> >> one, it is also done on all the others.
> >>
> >> PULL and non-leader TLOG replicas must copy the index from the leader.
> >> The leader will do the indexing and the other replicas will copy the
> >> completed index from the leader.  This takes time.  If the index is
> >> large, it can take a LOT of time, especially if the disks or network are
> >> slow.  TLOG replicas can become leader and PULL replicas cannot.
> >>
> >> What I would do personally is set two replicas for each shard to TLOG
> >> and all the rest to PULL.  When a TLOG replica is acting as leader, it
> >> will function exactly like an NRT replica.
> >>
> >>> The conclusion seems to be that by using TLOG :
> >>>
> >>> * You save CPU resources on non leaders nodes at index time
> >>> * The JVM Heap and GC are the same
> >>> * Indexing performance ares really less with TLOG
> >>
> >> Java works in such a way that it will always eventually allocate and use
> >> the entire max heap that it is allowed.  It is not always possible to
> >> determine how much heap is truly needed, though analyzing large GC logs
> >> will sometimes reveal that info.
> >>
> >> Non-leader replicas will probably require less heap if they are TLOG or
> >> PULL.  I cannot say how much less, that will be something that has to be
> >> determined.  Those replicas will also use less CPU.
> >>
> >> With newer Solr versions, you can ask SolrCloud to prefer PULL replicas
> >> for querying, so queries will be targeted to those replicas, unless they
> >> all go down, in which case it will go to non-preferred replica types.  I
> >> do not know how to do this, I only know that it is possible.
> >>
> >> Thanks,
> >> Shawn
> >>
>
>
>