Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Santosh Narayan
Hi all,
I'm using ConcurrentUpdateSolrClient to push data into Solr. Currently, I'm
initializing it as follows:

ConcurrentUpdateSolrClientclient = new
ConcurrentUpdateSolrClient.Builder(serverUrl).withThreadCount(100).withQueueSize(50).build();

This works fine when there are 50 requests coming in a short span of time(a
few seconds). The problem is when there aren't many requests. If there are
not many requests for say 5 minutes, then the queue size may not touch 50
and the data is not sent to the Solr Server. Is there a way I can add
another condition to this, where I can say either the QueueSize if 50 or a
timeout of 60 seconds, whichever is earliest? This way, in case there are
less requests, the records that came in would get pushed to the server
every 60 seconds.

Thanks in advance for your guidance.
Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Shawn Heisey
On 2/21/2018 1:21 AM, Santosh Narayan wrote:

> I'm using ConcurrentUpdateSolrClient to push data into Solr. Currently, I'm
> initializing it as follows:
>
> ConcurrentUpdateSolrClientclient = new
> ConcurrentUpdateSolrClient.Builder(serverUrl).withThreadCount(100).withQueueSize(50).build();
>
> This works fine when there are 50 requests coming in a short span of time(a
> few seconds). The problem is when there aren't many requests. If there are
> not many requests for say 5 minutes, then the queue size may not touch 50
> and the data is not sent to the Solr Server. Is there a way I can add
> another condition to this, where I can say either the QueueSize if 50 or a
> timeout of 60 seconds, whichever is earliest? This way, in case there are
> less requests, the records that came in would get pushed to the server
> every 60 seconds.

The client should begin processing requests as soon as they are added,
not when the queue fills up.  If you're seeing something different, then
either there's a bug in ConcurrentUpdateSolrClient or your code is doing
something very unusual.  Can you share the rest of the code using that
client object?  What version of SolrJ are you using?

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Santosh Narayan
Hi Shawn,
May be it is my understanding of the documentation. As per the
JavaDoc, ConcurrentUpdateSolrClient
buffers all added documents and writes them into open HTTP connections.

So I thought that this class would buffer documents in the client side
itself till the QueueSize is reached and then send all the cached documents
together in one HTTP request. Is this not the case?

On Wed, Feb 21, 2018 at 7:26 PM, Shawn Heisey <[hidden email]> wrote:

> On 2/21/2018 1:21 AM, Santosh Narayan wrote:
>
>> I'm using ConcurrentUpdateSolrClient to push data into Solr. Currently,
>> I'm
>> initializing it as follows:
>>
>> ConcurrentUpdateSolrClientclient = new
>> ConcurrentUpdateSolrClient.Builder(serverUrl).withThreadCoun
>> t(100).withQueueSize(50).build();
>>
>> This works fine when there are 50 requests coming in a short span of
>> time(a
>> few seconds). The problem is when there aren't many requests. If there are
>> not many requests for say 5 minutes, then the queue size may not touch 50
>> and the data is not sent to the Solr Server. Is there a way I can add
>> another condition to this, where I can say either the QueueSize if 50 or a
>> timeout of 60 seconds, whichever is earliest? This way, in case there are
>> less requests, the records that came in would get pushed to the server
>> every 60 seconds.
>>
>
> The client should begin processing requests as soon as they are added, not
> when the queue fills up.  If you're seeing something different, then either
> there's a bug in ConcurrentUpdateSolrClient or your code is doing something
> very unusual.  Can you share the rest of the code using that client
> object?  What version of SolrJ are you using?
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Shawn Heisey-2
On 2/21/2018 7:41 AM, Santosh Narayan wrote:
> May be it is my understanding of the documentation. As per the
> JavaDoc, ConcurrentUpdateSolrClient
> buffers all added documents and writes them into open HTTP connections.
>
> So I thought that this class would buffer documents in the client side
> itself till the QueueSize is reached and then send all the cached documents
> together in one HTTP request. Is this not the case?

That's not how it's designed.

What ConcurrentUpdateSolrClient does differently than HttpSolrClient or
CloudSolrClient is return control immediately to your program when you
send an update, and begin processing that update in the background.  If
you send a LOT of updates very quickly, then the queue will get larger,
and will typically be processed in parallel by multiple threads.  The
client won't wait for the queue to fill.  Processing of the first update
you send should begin right after you add it.

Something to consider:  Because control is returned to your program
immediately, and the response is always a success, your program will
never be informed about any problems with your adds when you use the
concurrent client.  The concurrent client is a great choice for initial
bulk indexing, because it offers multi-threaded indexing without any
need to handle the threads yourself.  But you don't get any kind of
error handling.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Santosh Narayan
Thanks for the explanation Shawn. Very helpful. I think I got misled by the
JavaDoc text for
*ConcurrentUpdateSolrClient.Builder.withQueueSize*
    /**
     * The number of documents to batch together before sending to Solr. If
not set, this defaults to 10.
     */
    public Builder withQueueSize(int queueSize) {
      if (queueSize <= 0) {
        throw new IllegalArgumentException("queueSize must be a positive
integer.");
      }
      this.queueSize = queueSize;
      return this;
    }



On Thu, Feb 22, 2018 at 9:41 AM, Shawn Heisey <[hidden email]> wrote:

> On 2/21/2018 7:41 AM, Santosh Narayan wrote:
> > May be it is my understanding of the documentation. As per the
> > JavaDoc, ConcurrentUpdateSolrClient
> > buffers all added documents and writes them into open HTTP connections.
> >
> > So I thought that this class would buffer documents in the client side
> > itself till the QueueSize is reached and then send all the cached
> documents
> > together in one HTTP request. Is this not the case?
>
> That's not how it's designed.
>
> What ConcurrentUpdateSolrClient does differently than HttpSolrClient or
> CloudSolrClient is return control immediately to your program when you
> send an update, and begin processing that update in the background.  If
> you send a LOT of updates very quickly, then the queue will get larger,
> and will typically be processed in parallel by multiple threads.  The
> client won't wait for the queue to fill.  Processing of the first update
> you send should begin right after you add it.
>
> Something to consider:  Because control is returned to your program
> immediately, and the response is always a success, your program will
> never be informed about any problems with your adds when you use the
> concurrent client.  The concurrent client is a great choice for initial
> bulk indexing, because it offers multi-threaded indexing without any
> need to handle the threads yourself.  But you don't get any kind of
> error handling.
>
> Thanks,
> Shawn
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Jason Gerlowski
My apologies Santosh.  I added that comment a few releases back based
on a misunderstanding I've only recently been disabused of.  I will
correct it.

Anyway, Shawn's explanation above is correct.  The queueSize parameter
doesn't control batching, as he clarified.  Sorry for the trouble.

Best,

Jason

On Wed, Feb 21, 2018 at 8:50 PM, Santosh Narayan
<[hidden email]> wrote:

> Thanks for the explanation Shawn. Very helpful. I think I got misled by the
> JavaDoc text for
> *ConcurrentUpdateSolrClient.Builder.withQueueSize*
>     /**
>      * The number of documents to batch together before sending to Solr. If
> not set, this defaults to 10.
>      */
>     public Builder withQueueSize(int queueSize) {
>       if (queueSize <= 0) {
>         throw new IllegalArgumentException("queueSize must be a positive
> integer.");
>       }
>       this.queueSize = queueSize;
>       return this;
>     }
>
>
>
> On Thu, Feb 22, 2018 at 9:41 AM, Shawn Heisey <[hidden email]> wrote:
>
>> On 2/21/2018 7:41 AM, Santosh Narayan wrote:
>> > May be it is my understanding of the documentation. As per the
>> > JavaDoc, ConcurrentUpdateSolrClient
>> > buffers all added documents and writes them into open HTTP connections.
>> >
>> > So I thought that this class would buffer documents in the client side
>> > itself till the QueueSize is reached and then send all the cached
>> documents
>> > together in one HTTP request. Is this not the case?
>>
>> That's not how it's designed.
>>
>> What ConcurrentUpdateSolrClient does differently than HttpSolrClient or
>> CloudSolrClient is return control immediately to your program when you
>> send an update, and begin processing that update in the background.  If
>> you send a LOT of updates very quickly, then the queue will get larger,
>> and will typically be processed in parallel by multiple threads.  The
>> client won't wait for the queue to fill.  Processing of the first update
>> you send should begin right after you add it.
>>
>> Something to consider:  Because control is returned to your program
>> immediately, and the response is always a success, your program will
>> never be informed about any problems with your adds when you use the
>> concurrent client.  The concurrent client is a great choice for initial
>> bulk indexing, because it offers multi-threaded indexing without any
>> need to handle the threads yourself.  But you don't get any kind of
>> error handling.
>>
>> Thanks,
>> Shawn
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Solrj : ConcurrentUpdateSolrClient based on QueueSize and Time

Santosh Narayan
Thanks Jason. Hope this can be fixed in the next update of SolrJ.



On Thu, Feb 22, 2018 at 10:49 AM, Jason Gerlowski <[hidden email]>
wrote:

> My apologies Santosh.  I added that comment a few releases back based
> on a misunderstanding I've only recently been disabused of.  I will
> correct it.
>
> Anyway, Shawn's explanation above is correct.  The queueSize parameter
> doesn't control batching, as he clarified.  Sorry for the trouble.
>
> Best,
>
> Jason
>
> On Wed, Feb 21, 2018 at 8:50 PM, Santosh Narayan
> <[hidden email]> wrote:
> > Thanks for the explanation Shawn. Very helpful. I think I got misled by
> the
> > JavaDoc text for
> > *ConcurrentUpdateSolrClient.Builder.withQueueSize*
> >     /**
> >      * The number of documents to batch together before sending to Solr.
> If
> > not set, this defaults to 10.
> >      */
> >     public Builder withQueueSize(int queueSize) {
> >       if (queueSize <= 0) {
> >         throw new IllegalArgumentException("queueSize must be a positive
> > integer.");
> >       }
> >       this.queueSize = queueSize;
> >       return this;
> >     }
> >
> >
> >
> > On Thu, Feb 22, 2018 at 9:41 AM, Shawn Heisey <[hidden email]>
> wrote:
> >
> >> On 2/21/2018 7:41 AM, Santosh Narayan wrote:
> >> > May be it is my understanding of the documentation. As per the
> >> > JavaDoc, ConcurrentUpdateSolrClient
> >> > buffers all added documents and writes them into open HTTP
> connections.
> >> >
> >> > So I thought that this class would buffer documents in the client side
> >> > itself till the QueueSize is reached and then send all the cached
> >> documents
> >> > together in one HTTP request. Is this not the case?
> >>
> >> That's not how it's designed.
> >>
> >> What ConcurrentUpdateSolrClient does differently than HttpSolrClient or
> >> CloudSolrClient is return control immediately to your program when you
> >> send an update, and begin processing that update in the background.  If
> >> you send a LOT of updates very quickly, then the queue will get larger,
> >> and will typically be processed in parallel by multiple threads.  The
> >> client won't wait for the queue to fill.  Processing of the first update
> >> you send should begin right after you add it.
> >>
> >> Something to consider:  Because control is returned to your program
> >> immediately, and the response is always a success, your program will
> >> never be informed about any problems with your adds when you use the
> >> concurrent client.  The concurrent client is a great choice for initial
> >> bulk indexing, because it offers multi-threaded indexing without any
> >> need to handle the threads yourself.  But you don't get any kind of
> >> error handling.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>