is ConcurrentUpdateSolrClient.Builder thread safe?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

is ConcurrentUpdateSolrClient.Builder thread safe?

Bernd Fehling
Hi list,

after some strange search results I was trying to locate the problem
and it turned out that it starts with bulk loading with SolrJ
and ConcurrentUpdateSolrClient.Builder with several threads.

I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
according the docs send to the indexer?

It feels like documents with the same doc_id are not always indexed
in the order they are sent to the indexer. It is some kind of random generator.

Example:
file LR00010.xml
<doc>
  <str name="id">my_uniq_id_1234</str>
  <date name="date">2017-03-28T23:21:40Z</date>
  ...

file LR01000.xml
<doc>
  <str name="id">my_uniq_id_1234</str>
  <date name="date">2017-04-26T00:42:10Z</date>
  ...


The files are in the same subdir.
They are loaded, processed, and send to the indexer in ascending natural order.
LR00010.xml is handled way before LR01000.xml.

But the result is that sometimes the older doc of LR00010.xml is in the index
and the newer doc from LR01000.xml is marked as deleted, and sometimes the
newer doc of LR01000.xml is in the index and the older doc from LR00010.xml
is marked as deleted.

Anyone seens this?

I could try ConcurrentUpdateSolrClient.Builder with only one thread and
see if the problem still exists.

Regards
Bernd


Reply | Threaded
Open this post in threaded view
|

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Shawn Heisey-2
On 1/10/2018 8:33 AM, Bernd Fehling wrote:
> after some strange search results I was trying to locate the problem
> and it turned out that it starts with bulk loading with SolrJ
> and ConcurrentUpdateSolrClient.Builder with several threads.
>
> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
> according the docs send to the indexer?

Why would you need the Builder to be threadsafe?

The actual client object (ConcurrentUpdateSolrClient) should be
perfectly threadsafe, but the Builder probably isn't, and I can't think
of any reason to try and use it with multiple threads.  In a
well-constructed program, you will use the Builder exactly once, in an
initialization thread, and then have all the indexing threads use the
client object that the Builder creates.

I hope you're aware that the concurrent client swallows all indexing
errors and does not tell your program about them.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Bernd Fehling
In reply to this post by Bernd Fehling
Hi Shawn,

from your answer I see that you are obviously not using ConcurrentUpdateSolrClient.
I didn't say that I use ConcurrentUpdateSolrClient in multiple threads.
I say that ConcurrentUpdateSolrClient.Builder has a method to set
"withThreadCount", to empty the Clients queue with multiple threads.
This is useful for bulk loading huge data volumes or replay backup into index.

As I can see at the indexer with infostream, there are _no_ indexing errors.

I tried now with one thread several times and everything was fine.
The newer docs replaced the older docs (wich were marked deleted) in the index.
With more than 1 "threadCount" for emtying the queue there are problems with
ConcurrentUpdateSolrClient.

This will nerver pass a Jepsen test and I call it _NOT_ thread safe.

I haven't looked into the code yet, to see if the queue is FIFO, otherwise
this would be stupid.

Regards
Bernd


Am 11.01.2018 um 02:27 schrieb Shawn Heisey:

> On 1/10/2018 8:33 AM, Bernd Fehling wrote:
>> after some strange search results I was trying to locate the problem
>> and it turned out that it starts with bulk loading with SolrJ
>> and ConcurrentUpdateSolrClient.Builder with several threads.
>>
>> I assume that ConcurrentUpdateSolrClient.Builder is _NOT_ thread safe
>> according the docs send to the indexer?
>
> Why would you need the Builder to be threadsafe?
>
> The actual client object (ConcurrentUpdateSolrClient) should be perfectly threadsafe, but the Builder probably isn't, and I can't think of any
> reason to try and use it with multiple threads.  In a well-constructed program, you will use the Builder exactly once, in an initialization
> thread, and then have all the indexing threads use the client object that the Builder creates.
>
> I hope you're aware that the concurrent client swallows all indexing errors and does not tell your program about them.
>
> Thanks,
> Shawn
>
Reply | Threaded
Open this post in threaded view
|

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Shawn Heisey-2
On 1/11/2018 12:05 AM, Bernd Fehling wrote:
> This will nerver pass a Jepsen test and I call it _NOT_ thread safe.
>
> I haven't looked into the code yet, to see if the queue is FIFO, otherwise
> this would be stupid.

I was not thinking about order of operations when I said that the client
was threadsafe.  I meant that one client object can be used
simultaneously by multiple threads without anything getting
cross-contaminated within the program.

If you are absolutely reliant on operations happening in a precise
order, such that a document could get indexed in one request and then
replaced (or updated) with a later request, you should not use the
concurrent client.  You could define it with a single thread, but if you
do that, then the concurrent client doesn't work any faster than the
standard client.

When a concurrent client is built, it creates the specified number of
processing threads.  When updates are sent, they are added to an
internal queue.  The processing threads will handle requests from the
queue as long as the queue is not empty.

Those threads will process the requests they have been assigned
simultaneously.  Although I'm sure that each thread pulls requests off
the queue in a FIFO manner, I have a scenario for you to consider.  This
scenario is not just an intellectual exercise, it is the kind of thing
that can easily happen in the wild.

Let's say that when document X is initially indexed, it is at position
997 in a batch of 1000 documents.  Then two update requests later, the
new version of document X is at position 2 in another batch of 1000
documents.

If there are at least three threads in the concurrent client, those
update requests may begin execution at nearly the same time.  In that
situation, Solr is likely to index document X in the request added later
before it indexes document X in the request added earlier, resulting in
outdated information ending up in the index.

The same thing can happen even with a non-concurrent client when it is
used in a multi-threaded manner.

Preserving order of operations cannot be guaranteed if there are
multiple threads.  It could be possible to add some VERY sophisticated
synchronization capabilities, but writing code to do that would be very
difficult, and it wouldn't be trivial to use either.

Thanks,
Shawn
Reply | Threaded
Open this post in threaded view
|

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Bernd Fehling
In reply to this post by Bernd Fehling
To sum it up, there is no way for bulk loading in solr, due to the lack
of preserving the order of operation.
Solr can only supply bulk loading if you really have unique data, right?

By the way, the queue used is java.util.concurrent.BlockingQueue.
Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I guess.
Because the bottleneck is not reading the content from filesystem, but
analyzing and indexing.

Any other options for bulk loading?

You say "If there are at least three threads in the concurrent client...", but
two threads would work?

How are other users doing bulk loading with archived backups and preserving the order?
Can't believe that I'm the only one on earth having this need.

Regards
Bernd


Am 11.01.2018 um 08:53 schrieb Shawn Heisey:

> On 1/11/2018 12:05 AM, Bernd Fehling wrote:
>> This will nerver pass a Jepsen test and I call it _NOT_ thread safe.
>>
>> I haven't looked into the code yet, to see if the queue is FIFO, otherwise
>> this would be stupid.
>
> I was not thinking about order of operations when I said that the client was threadsafe.  I meant that one client object can be used
> simultaneously by multiple threads without anything getting cross-contaminated within the program.
>
> If you are absolutely reliant on operations happening in a precise order, such that a document could get indexed in one request and then
> replaced (or updated) with a later request, you should not use the concurrent client.  You could define it with a single thread, but if you do
> that, then the concurrent client doesn't work any faster than the standard client.
>
> When a concurrent client is built, it creates the specified number of processing threads.  When updates are sent, they are added to an internal
> queue.  The processing threads will handle requests from the queue as long as the queue is not empty.
>
> Those threads will process the requests they have been assigned simultaneously.  Although I'm sure that each thread pulls requests off the queue
> in a FIFO manner, I have a scenario for you to consider.  This scenario is not just an intellectual exercise, it is the kind of thing that can
> easily happen in the wild.
>
> Let's say that when document X is initially indexed, it is at position 997 in a batch of 1000 documents.  Then two update requests later, the
> new version of document X is at position 2 in another batch of 1000 documents.
>
> If there are at least three threads in the concurrent client, those update requests may begin execution at nearly the same time.  In that
> situation, Solr is likely to index document X in the request added later before it indexes document X in the request added earlier, resulting in
> outdated information ending up in the index.
>
> The same thing can happen even with a non-concurrent client when it is used in a multi-threaded manner.
>
> Preserving order of operations cannot be guaranteed if there are multiple threads.  It could be possible to add some VERY sophisticated
> synchronization capabilities, but writing code to do that would be very difficult, and it wouldn't be trivial to use either.
>
> Thanks,
> Shawn
Reply | Threaded
Open this post in threaded view
|

Re: is ConcurrentUpdateSolrClient.Builder thread safe?

Shawn Heisey-2
On 1/11/2018 1:38 AM, Bernd Fehling wrote:
> To sum it up, there is no way for bulk loading in solr, due to the lack
> of preserving the order of operation.
> Solr can only supply bulk loading if you really have unique data, right?

Bulk loading implies that every document is inserted exactly once and
that there are no other operations, like updates or deletes.  If there
are other operations, then in my mind, it's not bulk loading.

> By the way, the queue used is java.util.concurrent.BlockingQueue.
> Changing that to ArrayBlockingQueue (to force FIFO) would not really help, I guess.

Correct, the issue is that updates are processed simultaneously.  Making
absolutely sure that removal is FIFO wouldn't make any difference.
Although I think that the current implementation is probably just as
FIFO as the array implementation.

> You say "If there are at least three threads in the concurrent client...", but
> two threads would work?

The thread count of three was specific to the exact scenario I
described, where update 1 contains the initial indexing and update 3
(two updates later) contains the new version.  If it were update 1 and
update 7, then there would need to be a thread count of seven to see the
problem.

> How are other users doing bulk loading with archived backups and preserving the order?
> Can't believe that I'm the only one on earth having this need.

If the backup is a log of changes rather than an info dump, then the
only reliable way you could guarantee correct operation is to do the
indexing with one thread.  But then indexing will be slower, possibly a
LOT slower.

Thanks,
Shawn