question about updates to shard leaders only

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

question about updates to shard leaders only

Bernd Fehling
Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Erick Erickson
You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
<[hidden email]> wrote:

> Hi list,
>
> while going from single core master/slave to cloud multi core/node
> with leader/replica I want to change my SolrJ loading, because
> ConcurrentUpdateSolrClient isn't cloud aware and has performance
> impacts.
> I want to use CloudSolrClient with LBHttpSolrClient and updates
> should only go to shard leaders.
>
> Question, what is the difference between sendUpdatesOnlyToShardLeaders
> and sendDirectUpdatesToShardLeadersOnly?
>
> Regards,
> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Mark Miller-3
In reply to this post by Bernd Fehling
It's been a while since I've been in this deeply, but it should be
something like:

sendUpdateOnlyToShardLeaders will select the leaders for each shard as the
load balanced targets for update. The updates may not go to the *right*
leader, but only the leaders will be chosen, followers (non leader
replicas) will not be part of the load balanced server list.

sendDirectUpdatesToShardLeadersOnly is the same, followers are not part of
the mix, but also, updates are sent directly to the right leader as long as
the right hashing field is specified (id by default). We hash the id client
side and know where it should end up.

Optimally, you want sendDirectUpdatesToShardLeadersOnly to be true
configured with the correct id field.

- Mark

On Wed, May 9, 2018 at 4:54 AM Bernd Fehling <[hidden email]>
wrote:

> Hi list,
>
> while going from single core master/slave to cloud multi core/node
> with leader/replica I want to change my SolrJ loading, because
> ConcurrentUpdateSolrClient isn't cloud aware and has performance
> impacts.
> I want to use CloudSolrClient with LBHttpSolrClient and updates
> should only go to shard leaders.
>
> Question, what is the difference between sendUpdatesOnlyToShardLeaders
> and sendDirectUpdatesToShardLeadersOnly?
>
> Regards,
> Bernd
>
--
- Mark
about.me/markrmiller
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Bernd Fehling
In reply to this post by Bernd Fehling
OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

> You may not need to deal with any of this.
>
> The default CloudSolrClient call creates a new LBHttpSolrClient for
> you. So unless you're doing something custom with any LBHttpSolrClient
> you create, you don't need to create one yourself.
>
> Second, the default for CloudSolrClient.add() is to take the list of
> documents you provide into sub-lists that consist of the docs destined
> for a particular shard and sends those to the leader.
>
> Do the default not work for you?
>
> Best,
> Erick
>
> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
> <[hidden email]> wrote:
>> Hi list,
>>
>> while going from single core master/slave to cloud multi core/node
>> with leader/replica I want to change my SolrJ loading, because
>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>> impacts.
>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>> should only go to shard leaders.
>>
>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>> and sendDirectUpdatesToShardLeadersOnly?
>>
>> Regards,
>> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Bernd Fehling
Thanks, solved, performance is good now.

Regards,
Bernd

Am 15.05.2018 um 08:12 schrieb Bernd Fehling:

> OK, I have the CloudSolrClient with SolrJ now running but it seams
> a bit slower compared to ConcurrentUpdateSolrClient.
> This was not expected.
> The logs show that CloudSolrClient send the docs only to the leaders.
>
> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>
> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
> With CloudSolrClient I get only about 1200 docs/sec.
>
> The system monitoring shows that with CloudSolrClient all nodes and cores
> are under heavy load. I thought that only the leaders are under load
> until any commit and then replicate to the other replicas.
> And that the replicas which are no leader have capacity to answer search requests.
>
> I think I still don't get the advantage of CloudSolrClient?
>
> Regards,
> Bernd
>
>
>
> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>> You may not need to deal with any of this.
>>
>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>> you. So unless you're doing something custom with any LBHttpSolrClient
>> you create, you don't need to create one yourself.
>>
>> Second, the default for CloudSolrClient.add() is to take the list of
>> documents you provide into sub-lists that consist of the docs destined
>> for a particular shard and sends those to the leader.
>>
>> Do the default not work for you?
>>
>> Best,
>> Erick
>>
>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>> <[hidden email]> wrote:
>>> Hi list,
>>>
>>> while going from single core master/slave to cloud multi core/node
>>> with leader/replica I want to change my SolrJ loading, because
>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>> impacts.
>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>> should only go to shard leaders.
>>>
>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>> and sendDirectUpdatesToShardLeadersOnly?
>>>
>>> Regards,
>>> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Erick Erickson
What did you do to solve your performance problem?

Batching updates is one thing that helps performance.

bq.  I thought that only the leaders are under load
until any commit and then replicate to the other replicas.

True if (and only if) you're using PULL or TLOG replicas.
When using the default NRT replicas, every replica indexes
the docs, it doesn't matter whether they are the leader or replica.
That's required for NRT. Using CloudSolrClient has no bearing
on that functionality.

Best,
Erick

On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
<[hidden email]> wrote:

> Thanks, solved, performance is good now.
>
> Regards,
> Bernd
>
>
> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>
>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>> a bit slower compared to ConcurrentUpdateSolrClient.
>> This was not expected.
>> The logs show that CloudSolrClient send the docs only to the leaders.
>>
>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>
>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>> With CloudSolrClient I get only about 1200 docs/sec.
>>
>> The system monitoring shows that with CloudSolrClient all nodes and cores
>> are under heavy load. I thought that only the leaders are under load
>> until any commit and then replicate to the other replicas.
>> And that the replicas which are no leader have capacity to answer search
>> requests.
>>
>> I think I still don't get the advantage of CloudSolrClient?
>>
>> Regards,
>> Bernd
>>
>>
>>
>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>
>>> You may not need to deal with any of this.
>>>
>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>> you create, you don't need to create one yourself.
>>>
>>> Second, the default for CloudSolrClient.add() is to take the list of
>>> documents you provide into sub-lists that consist of the docs destined
>>> for a particular shard and sends those to the leader.
>>>
>>> Do the default not work for you?
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>> <[hidden email]> wrote:
>>>>
>>>> Hi list,
>>>>
>>>> while going from single core master/slave to cloud multi core/node
>>>> with leader/replica I want to change my SolrJ loading, because
>>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>>> impacts.
>>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>>> should only go to shard leaders.
>>>>
>>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>>> and sendDirectUpdatesToShardLeadersOnly?
>>>>
>>>> Regards,
>>>> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Bernd Fehling
In reply to this post by Bernd Fehling
Hi Erik,

yes indeed, batching solved it.
I used ConcurrentUpdateSolrClient with queue size of 10000 but
CloudSolrClient doesn't have this feature.
I build my own queue now.

Ah!!! So I obviously use default NRT but actually don't need it because
I don't have any NRT data to index. A latency of several hours is OK for me.
Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per server).

I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which performed
better, less influence of GarbageCollection.

I have to read more about PULL or TLOG replicas, how to set this up and so on.
If it is to complex I will go with NRT and indexing is anyway during the night.
Thanks for pointing this out.

Regards,
Bernd


Am 15.05.2018 um 13:28 schrieb Erick Erickson:

> What did you do to solve your performance problem?
>
> Batching updates is one thing that helps performance.
>
> bq.  I thought that only the leaders are under load
> until any commit and then replicate to the other replicas.
>
> True if (and only if) you're using PULL or TLOG replicas.
> When using the default NRT replicas, every replica indexes
> the docs, it doesn't matter whether they are the leader or replica.
> That's required for NRT. Using CloudSolrClient has no bearing
> on that functionality.
>
> Best,
> Erick
>
> On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
> <[hidden email]> wrote:
>> Thanks, solved, performance is good now.
>>
>> Regards,
>> Bernd
>>
>>
>> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>>
>>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>>> a bit slower compared to ConcurrentUpdateSolrClient.
>>> This was not expected.
>>> The logs show that CloudSolrClient send the docs only to the leaders.
>>>
>>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>>
>>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>>> With CloudSolrClient I get only about 1200 docs/sec.
>>>
>>> The system monitoring shows that with CloudSolrClient all nodes and cores
>>> are under heavy load. I thought that only the leaders are under load
>>> until any commit and then replicate to the other replicas.
>>> And that the replicas which are no leader have capacity to answer search
>>> requests.
>>>
>>> I think I still don't get the advantage of CloudSolrClient?
>>>
>>> Regards,
>>> Bernd
>>>
>>>
>>>
>>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>>
>>>> You may not need to deal with any of this.
>>>>
>>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>>> you create, you don't need to create one yourself.
>>>>
>>>> Second, the default for CloudSolrClient.add() is to take the list of
>>>> documents you provide into sub-lists that consist of the docs destined
>>>> for a particular shard and sends those to the leader.
>>>>
>>>> Do the default not work for you?
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>>> <[hidden email]> wrote:
>>>>>
>>>>> Hi list,
>>>>>
>>>>> while going from single core master/slave to cloud multi core/node
>>>>> with leader/replica I want to change my SolrJ loading, because
>>>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>>>> impacts.
>>>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>>>> should only go to shard leaders.
>>>>>
>>>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>>>> and sendDirectUpdatesToShardLeadersOnly?
>>>>>
>>>>> Regards,
>>>>> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Erick Erickson
You might find this useful:

https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

One tricky bit: Assuming docs have a random distribution amongst
shards, you should batch so at least 100 docs go to each _shard_. You
can see from the link that the speedup is mostly going from 1 to 100.
So if you have 5 shards, I'd create batches of at least 500. That was
a fairly simple test with stupid-simple docs. Large complicated
documents wouldn't show the same curve.

Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
PULL replicas you want at collection creation time. NOTE: this is only
on Solr 7x. See:
https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas

About creating your own queue, mine usually look like
List<SolrInputDocument> list...
while (more docs) {
  list.add(new_doc);
  if (list.size > X) {
      client.add(list);
      list.clear();
  }
}

Not exactly a sophisticated queue ;).....

On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
<[hidden email]> wrote:

> Hi Erik,
>
> yes indeed, batching solved it.
> I used ConcurrentUpdateSolrClient with queue size of 10000 but
> CloudSolrClient doesn't have this feature.
> I build my own queue now.
>
> Ah!!! So I obviously use default NRT but actually don't need it because
> I don't have any NRT data to index. A latency of several hours is OK for me.
> Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
> server).
>
> I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
> performed
> better, less influence of GarbageCollection.
>
> I have to read more about PULL or TLOG replicas, how to set this up and so
> on.
> If it is to complex I will go with NRT and indexing is anyway during the
> night.
> Thanks for pointing this out.
>
> Regards,
> Bernd
>
>
> Am 15.05.2018 um 13:28 schrieb Erick Erickson:
>>
>> What did you do to solve your performance problem?
>>
>> Batching updates is one thing that helps performance.
>>
>> bq.  I thought that only the leaders are under load
>> until any commit and then replicate to the other replicas.
>>
>> True if (and only if) you're using PULL or TLOG replicas.
>> When using the default NRT replicas, every replica indexes
>> the docs, it doesn't matter whether they are the leader or replica.
>> That's required for NRT. Using CloudSolrClient has no bearing
>> on that functionality.
>>
>> Best,
>> Erick
>>
>> On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
>> <[hidden email]> wrote:
>>>
>>> Thanks, solved, performance is good now.
>>>
>>> Regards,
>>> Bernd
>>>
>>>
>>> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>>>
>>>>
>>>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>>>> a bit slower compared to ConcurrentUpdateSolrClient.
>>>> This was not expected.
>>>> The logs show that CloudSolrClient send the docs only to the leaders.
>>>>
>>>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>>>
>>>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>>>> With CloudSolrClient I get only about 1200 docs/sec.
>>>>
>>>> The system monitoring shows that with CloudSolrClient all nodes and
>>>> cores
>>>> are under heavy load. I thought that only the leaders are under load
>>>> until any commit and then replicate to the other replicas.
>>>> And that the replicas which are no leader have capacity to answer search
>>>> requests.
>>>>
>>>> I think I still don't get the advantage of CloudSolrClient?
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>>
>>>>
>>>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>>>
>>>>>
>>>>> You may not need to deal with any of this.
>>>>>
>>>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>>>> you create, you don't need to create one yourself.
>>>>>
>>>>> Second, the default for CloudSolrClient.add() is to take the list of
>>>>> documents you provide into sub-lists that consist of the docs destined
>>>>> for a particular shard and sends those to the leader.
>>>>>
>>>>> Do the default not work for you?
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi list,
>>>>>>
>>>>>> while going from single core master/slave to cloud multi core/node
>>>>>> with leader/replica I want to change my SolrJ loading, because
>>>>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>>>>> impacts.
>>>>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>>>>> should only go to shard leaders.
>>>>>>
>>>>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>>>>> and sendDirectUpdatesToShardLeadersOnly?
>>>>>>
>>>>>> Regards,
>>>>>> Bernd
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Bernd Fehling
In reply to this post by Bernd Fehling


Am 15.05.2018 um 14:33 schrieb Erick Erickson:
> You might find this useful:
>
> https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

I have seen that already and can confirm it.
 From my observations about a 3x3 cluster with 3 server and my hardware:
- have at least 6 CPUs on each server to keep search performance during NRT indexing
- I tried with batch-/queue-size between 100 and 10000.
--- with a batch size of 100 and nearly even distribution accross 3
     shards I get about 33 docs per update per shard.
--- with a batch size of 1000 I get about 333 docs per update per shard
--- with a batch size of 10000 it can go up to 3333 docs per shard

Yes, the last is "it can go up to" because the size is obviuosly to high
and I get lots of smaler updates "FROMLEADER". So somewhere between
1000 and 10000 is the best size for my 3x3 cluster with my hardware.

Another observation in a 3x3 cluster, a multi-node (3 JVM 4G instances per
server [3 nodes]) outperforms a multi-core (1 JVM 12G instance per
server [3 cores]) due to JAVA GC impact at multi-core.
A multi-node at 60qps has nearly the performance as a multi-core at 30qps.


>
> One tricky bit: Assuming docs have a random distribution amongst
> shards, you should batch so at least 100 docs go to each _shard_. You
> can see from the link that the speedup is mostly going from 1 to 100.
> So if you have 5 shards, I'd create batches of at least 500. That was
> a fairly simple test with stupid-simple docs. Large complicated
> documents wouldn't show the same curve.
>
> Setup for PULL and TLOG isn't hard, just specify the number of TLOG or
> PULL replicas you want at collection creation time. NOTE: this is only
> on Solr 7x. See:
> https://lucene.apache.org/solr/guide/7_3/shards-and-indexing-data-in-solrcloud.html#types-of-replicas

Unfortunately I'm still at solr 6.4.2 and therefore have to stay with NRT.

>
> About creating your own queue, mine usually look like
> List<SolrInputDocument> list...
> while (more docs) {
>    list.add(new_doc);
>    if (list.size > X) {
>        client.add(list);
>        list.clear();
>    }
> }

Yes, mine looks similar, a recursive file traverser with for-loop over files.
But don't forget a final client.add(list) after the while-loop ;-)


>
> Not exactly a sophisticated queue ;).....
>
> On Tue, May 15, 2018 at 8:15 AM, Bernd Fehling
> <[hidden email]> wrote:
>> Hi Erik,
>>
>> yes indeed, batching solved it.
>> I used ConcurrentUpdateSolrClient with queue size of 10000 but
>> CloudSolrClient doesn't have this feature.
>> I build my own queue now.
>>
>> Ah!!! So I obviously use default NRT but actually don't need it because
>> I don't have any NRT data to index. A latency of several hours is OK for me.
>> Currently I'm testing with a 3x3 core-cluster (3 server, 3 cores per
>> server).
>>
>> I also tested with 3x3 node-cluster (3 server, 3 nodes per server) which
>> performed
>> better, less influence of GarbageCollection.
>>
>> I have to read more about PULL or TLOG replicas, how to set this up and so
>> on.
>> If it is to complex I will go with NRT and indexing is anyway during the
>> night.
>> Thanks for pointing this out.
>>
>> Regards,
>> Bernd
>>
>>
>> Am 15.05.2018 um 13:28 schrieb Erick Erickson:
>>>
>>> What did you do to solve your performance problem?
>>>
>>> Batching updates is one thing that helps performance.
>>>
>>> bq.  I thought that only the leaders are under load
>>> until any commit and then replicate to the other replicas.
>>>
>>> True if (and only if) you're using PULL or TLOG replicas.
>>> When using the default NRT replicas, every replica indexes
>>> the docs, it doesn't matter whether they are the leader or replica.
>>> That's required for NRT. Using CloudSolrClient has no bearing
>>> on that functionality.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, May 15, 2018 at 6:53 AM, Bernd Fehling
>>> <[hidden email]> wrote:
>>>>
>>>> Thanks, solved, performance is good now.
>>>>
>>>> Regards,
>>>> Bernd
>>>>
>>>>
>>>> Am 15.05.2018 um 08:12 schrieb Bernd Fehling:
>>>>>
>>>>>
>>>>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>>>>> a bit slower compared to ConcurrentUpdateSolrClient.
>>>>> This was not expected.
>>>>> The logs show that CloudSolrClient send the docs only to the leaders.
>>>>>
>>>>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>>>>
>>>>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>>>>> With CloudSolrClient I get only about 1200 docs/sec.
>>>>>
>>>>> The system monitoring shows that with CloudSolrClient all nodes and
>>>>> cores
>>>>> are under heavy load. I thought that only the leaders are under load
>>>>> until any commit and then replicate to the other replicas.
>>>>> And that the replicas which are no leader have capacity to answer search
>>>>> requests.
>>>>>
>>>>> I think I still don't get the advantage of CloudSolrClient?
>>>>>
>>>>> Regards,
>>>>> Bernd
>>>>>
>>>>>
>>>>>
>>>>> Am 09.05.2018 um 19:15 schrieb Erick Erickson:
>>>>>>
>>>>>>
>>>>>> You may not need to deal with any of this.
>>>>>>
>>>>>> The default CloudSolrClient call creates a new LBHttpSolrClient for
>>>>>> you. So unless you're doing something custom with any LBHttpSolrClient
>>>>>> you create, you don't need to create one yourself.
>>>>>>
>>>>>> Second, the default for CloudSolrClient.add() is to take the list of
>>>>>> documents you provide into sub-lists that consist of the docs destined
>>>>>> for a particular shard and sends those to the leader.
>>>>>>
>>>>>> Do the default not work for you?
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi list,
>>>>>>>
>>>>>>> while going from single core master/slave to cloud multi core/node
>>>>>>> with leader/replica I want to change my SolrJ loading, because
>>>>>>> ConcurrentUpdateSolrClient isn't cloud aware and has performance
>>>>>>> impacts.
>>>>>>> I want to use CloudSolrClient with LBHttpSolrClient and updates
>>>>>>> should only go to shard leaders.
>>>>>>>
>>>>>>> Question, what is the difference between sendUpdatesOnlyToShardLeaders
>>>>>>> and sendDirectUpdatesToShardLeadersOnly?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Bernd

--
*************************************************************
Bernd Fehling                    Bielefeld University Library
Dipl.-Inform. (FH)                LibTec - Library Technology
Universitätsstr. 25                  and Knowledge Management
33615 Bielefeld
Tel. +49 521 106-4060       bernd.fehling(at)uni-bielefeld.de
           https://www.ub.uni-bielefeld.de/~befehl/

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Shawn Heisey-2
In reply to this post by Bernd Fehling
On 5/15/2018 12:12 AM, Bernd Fehling wrote:
> OK, I have the CloudSolrClient with SolrJ now running but it seams
> a bit slower compared to ConcurrentUpdateSolrClient.
> This was not expected.
> The logs show that CloudSolrClient send the docs only to the leaders.
>
> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>
> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
> With CloudSolrClient I get only about 1200 docs/sec.

ConcurrentUpdateSolrClient internally puts all indexing requests on a
queue and then can use multiple threads to do parallel indexing in the
backround.  The design of the client has one big disadvantage -- it
returns control to your program immediately (before indexing actually
begins) and always indicates success.  All indexing errors are
swallowed.  They are logged, but the calling program is never informed
that any errors have occurred.

Like all other SolrClient implementations, CloudSolrClient is
thread-safe, but it is not multi-threaded unless YOU create multiple
threads that all use the same client object.  Full error handling is
possible with this client.  It is also fully cloud aware, adding and
removing Solr servers as the SolrCloud changes, without needing to be
reconfigured or recreated.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Erick Erickson
bq. But don't forget a final client.add(list) after the while-loop ;-)

Ha! But only "if (list.size() > 0)"

And then there was the memorable time I forgot the "list.clear()" when
I sent the batch and wondered why my indexing progress got slower and
slower...

Not to mention the time I re-used the same SolrInputDocument that got
bigger and bigger and bigger.....

Not to mention the other zillion screw-ups I've managed to perpetrate
in my career.... "Who wrote this stupid code? Oh, wait, it was me.
DON'T LOOK!!!"...

Astronomy anecdote....

Dale Vrabeck...was at a party with [Rudolph] Minkowski and Dale said
he’d heard about the astronomer who had exposed a plate all night and
then put it in the hypo first. Minkowski said, “It was three nights,
and it was me.”

On Tue, May 15, 2018 at 10:10 AM, Shawn Heisey <[hidden email]> wrote:

> On 5/15/2018 12:12 AM, Bernd Fehling wrote:
>>
>> OK, I have the CloudSolrClient with SolrJ now running but it seams
>> a bit slower compared to ConcurrentUpdateSolrClient.
>> This was not expected.
>> The logs show that CloudSolrClient send the docs only to the leaders.
>>
>> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
>>
>> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
>> With CloudSolrClient I get only about 1200 docs/sec.
>
>
> ConcurrentUpdateSolrClient internally puts all indexing requests on a queue
> and then can use multiple threads to do parallel indexing in the backround.
> The design of the client has one big disadvantage -- it returns control to
> your program immediately (before indexing actually begins) and always
> indicates success.  All indexing errors are swallowed.  They are logged, but
> the calling program is never informed that any errors have occurred.
>
> Like all other SolrClient implementations, CloudSolrClient is thread-safe,
> but it is not multi-threaded unless YOU create multiple threads that all use
> the same client object.  Full error handling is possible with this client.
> It is also fully cloud aware, adding and removing Solr servers as the
> SolrCloud changes, without needing to be reconfigured or recreated.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: question about updates to shard leaders only

Mark Miller-3
Yeah, basically ConcurrentUpdateSolrClient is a shortcut to getting multi
threaded bulk API updates out of the single threaded, single update API.
The downsides to this are: It is not cloud aware - you have to point it at
a server, you have to add special code to see if there are any errors, you
don't get any fine grained error information back, you still basically have
to break up updates into batches of success/fail units but with fewer
guard rails.

If you want to bulk load it usually makes much more sense to use the bulk
api on CloudSolrServer and treat the whole group of updates as a single
success/fail unit.

- Mark

On Tue, May 15, 2018 at 9:25 AM Erick Erickson <[hidden email]>
wrote:

> bq. But don't forget a final client.add(list) after the while-loop ;-)
>
> Ha! But only "if (list.size() > 0)"
>
> And then there was the memorable time I forgot the "list.clear()" when
> I sent the batch and wondered why my indexing progress got slower and
> slower...
>
> Not to mention the time I re-used the same SolrInputDocument that got
> bigger and bigger and bigger.....
>
> Not to mention the other zillion screw-ups I've managed to perpetrate
> in my career.... "Who wrote this stupid code? Oh, wait, it was me.
> DON'T LOOK!!!"...
>
> Astronomy anecdote....
>
> Dale Vrabeck...was at a party with [Rudolph] Minkowski and Dale said
> he’d heard about the astronomer who had exposed a plate all night and
> then put it in the hypo first. Minkowski said, “It was three nights,
> and it was me.”
>
> On Tue, May 15, 2018 at 10:10 AM, Shawn Heisey <[hidden email]>
> wrote:
> > On 5/15/2018 12:12 AM, Bernd Fehling wrote:
> >>
> >> OK, I have the CloudSolrClient with SolrJ now running but it seams
> >> a bit slower compared to ConcurrentUpdateSolrClient.
> >> This was not expected.
> >> The logs show that CloudSolrClient send the docs only to the leaders.
> >>
> >> So the only advantage of CloudSolrClient is that it is "Cloud aware"?
> >>
> >> With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
> >> With CloudSolrClient I get only about 1200 docs/sec.
> >
> >
> > ConcurrentUpdateSolrClient internally puts all indexing requests on a
> queue
> > and then can use multiple threads to do parallel indexing in the
> backround.
> > The design of the client has one big disadvantage -- it returns control
> to
> > your program immediately (before indexing actually begins) and always
> > indicates success.  All indexing errors are swallowed.  They are logged,
> but
> > the calling program is never informed that any errors have occurred.
> >
> > Like all other SolrClient implementations, CloudSolrClient is
> thread-safe,
> > but it is not multi-threaded unless YOU create multiple threads that all
> use
> > the same client object.  Full error handling is possible with this
> client.
> > It is also fully cloud aware, adding and removing Solr servers as the
> > SolrCloud changes, without needing to be reconfigured or recreated.
> >
> > Thanks,
> > Shawn
> >
>
--
- Mark
about.me/markrmiller