Possible synchronization bug in Solr reader

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible synchronization bug in Solr reader

Bram Biesbrouck-2
Hi all,

I think I might have discovered a synchronization bug when ingesting a lot
of data into Solr, but want to check with the specialists first ;-)

I'm using a little custom written map/reduce framework that boots a
20-something threads to do some heavy processing on data-preparation. When
this processing is done, the results of these threads are gathers in a
reduce step, where they are ingested into an (embedded) Solr instance. To
maximize throughput, I'm ingesting the data in parallel in a couple of
threads of their own and this is where I run into a synchronization error.

As with all synchronization bugs, it happens "some" of the time and they're
hard to debug, but I think I managed to get my finger on the root (I'm
using Solr 8.3):

in class org.apache.lucene.index.CodecReader, throws a NPE on line 84:
getFieldsReader().visitDocument(docID, visitor);

The issue is that the getFieldsReader() getter is mapped to a ThreadLocal
(more explicitly,
org.apache.lucene.index.SegmentCoreReaders.fieldsReaderLocal) that seems to
be released (set to null) somewhere automatically, and read afterwards,
without synchronizing the two.

I don't think I should set any resource locks of my own, since I'm only
using the SolrJ API and the /update endpoint.

I know this is quite a low-level question, but could anyone point me in the
right direction to further investigate this issue? Ie, what could be the
reason the reader is released out-of-sync?

best,

b.
Reply | Threaded
Open this post in threaded view
|

Re: Possible synchronization bug in Solr reader

Bram Biesbrouck-2
Please allow me to answer my own question.
I was using the ThreadLocalCleaner
<https://github.com/apache/sling-org-apache-sling-commons-threads/blob/master/src/main/java/org/apache/sling/commons/threads/impl/ThreadLocalCleaner.java>
class from the Apache Sling project that is a very useful (but dangerous)
tool.
Bottom line: it doesn't like weak references in ThreadLocals, like in
Solr's CloseableThreadLocal class.

b.


On Tue, Nov 19, 2019 at 4:34 PM Bram Biesbrouck <
[hidden email]> wrote:

> Hi all,
>
> I think I might have discovered a synchronization bug when ingesting a lot
> of data into Solr, but want to check with the specialists first ;-)
>
> I'm using a little custom written map/reduce framework that boots a
> 20-something threads to do some heavy processing on data-preparation. When
> this processing is done, the results of these threads are gathers in a
> reduce step, where they are ingested into an (embedded) Solr instance. To
> maximize throughput, I'm ingesting the data in parallel in a couple of
> threads of their own and this is where I run into a synchronization error.
>
> As with all synchronization bugs, it happens "some" of the time and
> they're hard to debug, but I think I managed to get my finger on the root
> (I'm using Solr 8.3):
>
> in class org.apache.lucene.index.CodecReader, throws a NPE on line 84:
> getFieldsReader().visitDocument(docID, visitor);
>
> The issue is that the getFieldsReader() getter is mapped to a ThreadLocal
> (more explicitly,
> org.apache.lucene.index.SegmentCoreReaders.fieldsReaderLocal) that seems to
> be released (set to null) somewhere automatically, and read afterwards,
> without synchronizing the two.
>
> I don't think I should set any resource locks of my own, since I'm only
> using the SolrJ API and the /update endpoint.
>
> I know this is quite a low-level question, but could anyone point me in
> the right direction to further investigate this issue? Ie, what could be
> the reason the reader is released out-of-sync?
>
> best,
>
> b.
>