Performance of /export requests

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance of /export requests

Justin Sweeney
Hi,

We are currently working on a project where we are making heavy use of
/export in Solr in order to stream data back. We have an index with about
16 fields that are all docvalues fields and any number of them may be
requested to be streamed in results. Our index has ~450 million documents
spread across 10 shards.

We are creating a CloudSolrStream and when we call CloudSolrStream.open()
we see that call being slower than we had hoped. For some queries, that
call can take 800 ms. What we found interesting was that doing the same
request repeatedly resulted in the same time of 800 ms, which seems to
indicate that /export does not take advantage of caching or there is
something else at play.

I’m starting to dig through the code to better understand, but I wanted to
reach out to see what sort of expectations we should have here and if there
is anything we can do to increase performance of these requests.

We are currently using Solr 5, but we’ve also tried with Solr 7 and seen
similar results. If I can provide any additional information, please let me
know.

Thank you!
Justin
Reply | Threaded
Open this post in threaded view
|

Re: Performance of /export requests

Toke Eskildsen-2
Justin Sweeney <[hidden email]> wrote:

[Index: 10 shards, 450M docs]

> We are creating a CloudSolrStream and when we call CloudSolrStream.open()
> we see that call being slower than we had hoped. For some queries, that
> call can take 800 ms. [...]

As far as I can see in the code, CloudSolrStream.open() opens streams against the relevant shards and checks if there is a result. The last step is important as that means the first batch of tuples must be calculated in the shards. Streaming works internally by having a sliding window of 30K tuples through the result set in each shard, so open() results in (up to) 30K tuples being calculated. On the other hand, getting the first 30K tuples should be very fast after open().

> We are currently using Solr 5, but we’ve also tried with Solr 7 and seen
> similar results.

Solr 7 has a performance regression for export (or rather a regression for DocValues that is very visible when using export. See https://issues.apache.org/jira/browse/SOLR-13013), so I would expect it to be slower than Solr 5. You could try with Solr 8 where this regression should be mitigated somewhat.

- Toke Eskildsen
Reply | Threaded
Open this post in threaded view
|

Re: Performance of /export requests

Joel Bernstein
Can you share the sort criteria and search query? The main strategy for
improving performance of the export handler is adding more shards. This is
different than with typical distributed search, where deep paging issues
get worse as you add more shards. With the export handler if you double the
shards you double the pushing power. There are no deep paging drawbacks to
adding more shards.

On Sat, May 11, 2019 at 2:17 PM Toke Eskildsen <[hidden email]> wrote:

> Justin Sweeney <[hidden email]> wrote:
>
> [Index: 10 shards, 450M docs]
>
> > We are creating a CloudSolrStream and when we call CloudSolrStream.open()
> > we see that call being slower than we had hoped. For some queries, that
> > call can take 800 ms. [...]
>
> As far as I can see in the code, CloudSolrStream.open() opens streams
> against the relevant shards and checks if there is a result. The last step
> is important as that means the first batch of tuples must be calculated in
> the shards. Streaming works internally by having a sliding window of 30K
> tuples through the result set in each shard, so open() results in (up to)
> 30K tuples being calculated. On the other hand, getting the first 30K
> tuples should be very fast after open().
>
> > We are currently using Solr 5, but we’ve also tried with Solr 7 and seen
> > similar results.
>
> Solr 7 has a performance regression for export (or rather a regression for
> DocValues that is very visible when using export. See
> https://issues.apache.org/jira/browse/SOLR-13013), so I would expect it
> to be slower than Solr 5. You could try with Solr 8 where this regression
> should be mitigated somewhat.
>
> - Toke Eskildsen
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance of /export requests

Justin Sweeney
Thanks for the quick response. We are generally seeing exports from Solr 5
and 7 to be roughly the same, but I’ll check out Solr 8.

Joel - We are generally sorting a on tlong field and criteria can vary from
searching everything (*:*) to searching on a combination of a few tint and
string types.

All of our 16 fields are docvalues. Is there any performance degradation as
the number of docvalues fields increases or should that not have an impact?
Also, is the 30k sliding window configurable? In many cases we are
streaming back a few thousand, maybe up to 10k and then cutting off the
stream. If we could configure the size of that window, could that speed
things up some?

Thanks again for the info.

On Sat, May 11, 2019 at 2:38 PM Joel Bernstein <[hidden email]> wrote:

> Can you share the sort criteria and search query? The main strategy for
> improving performance of the export handler is adding more shards. This is
> different than with typical distributed search, where deep paging issues
> get worse as you add more shards. With the export handler if you double the
> shards you double the pushing power. There are no deep paging drawbacks to
> adding more shards.
>
> On Sat, May 11, 2019 at 2:17 PM Toke Eskildsen <[hidden email]> wrote:
>
> > Justin Sweeney <[hidden email]> wrote:
> >
> > [Index: 10 shards, 450M docs]
> >
> > > We are creating a CloudSolrStream and when we call
> CloudSolrStream.open()
> > > we see that call being slower than we had hoped. For some queries, that
> > > call can take 800 ms. [...]
> >
> > As far as I can see in the code, CloudSolrStream.open() opens streams
> > against the relevant shards and checks if there is a result. The last
> step
> > is important as that means the first batch of tuples must be calculated
> in
> > the shards. Streaming works internally by having a sliding window of 30K
> > tuples through the result set in each shard, so open() results in (up to)
> > 30K tuples being calculated. On the other hand, getting the first 30K
> > tuples should be very fast after open().
> >
> > > We are currently using Solr 5, but we’ve also tried with Solr 7 and
> seen
> > > similar results.
> >
> > Solr 7 has a performance regression for export (or rather a regression
> for
> > DocValues that is very visible when using export. See
> > https://issues.apache.org/jira/browse/SOLR-13013), so I would expect it
> > to be slower than Solr 5. You could try with Solr 8 where this regression
> > should be mitigated somewhat.
> >
> > - Toke Eskildsen
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Performance of /export requests

Joel Bernstein
Your query and sort criteria sound like they should be fast.

In general if you are cutting off the stream at 10K don't use the /export
handler. Use the /select handler, it will be faster for sure. The reason
for the 30K sliding winding was that it maximized throughput over a long
export (many millions of documents). If you're not doing a long export than
the export handler is likely not the most efficient approach.

Each field being exported slows down the export handler, and 16 is a lot if
fields to export. Again the only way increase the performance of exporting
16 fields is to add more shards.

Are you exporting with Streaming Expressions?




On Sun, May 12, 2019 at 8:44 AM Justin Sweeney <[hidden email]>
wrote:

> Thanks for the quick response. We are generally seeing exports from Solr 5
> and 7 to be roughly the same, but I’ll check out Solr 8.
>
> Joel - We are generally sorting a on tlong field and criteria can vary from
> searching everything (*:*) to searching on a combination of a few tint and
> string types.
>
> All of our 16 fields are docvalues. Is there any performance degradation as
> the number of docvalues fields increases or should that not have an impact?
> Also, is the 30k sliding window configurable? In many cases we are
> streaming back a few thousand, maybe up to 10k and then cutting off the
> stream. If we could configure the size of that window, could that speed
> things up some?
>
> Thanks again for the info.
>
> On Sat, May 11, 2019 at 2:38 PM Joel Bernstein <[hidden email]> wrote:
>
> > Can you share the sort criteria and search query? The main strategy for
> > improving performance of the export handler is adding more shards. This
> is
> > different than with typical distributed search, where deep paging issues
> > get worse as you add more shards. With the export handler if you double
> the
> > shards you double the pushing power. There are no deep paging drawbacks
> to
> > adding more shards.
> >
> > On Sat, May 11, 2019 at 2:17 PM Toke Eskildsen <[hidden email]> wrote:
> >
> > > Justin Sweeney <[hidden email]> wrote:
> > >
> > > [Index: 10 shards, 450M docs]
> > >
> > > > We are creating a CloudSolrStream and when we call
> > CloudSolrStream.open()
> > > > we see that call being slower than we had hoped. For some queries,
> that
> > > > call can take 800 ms. [...]
> > >
> > > As far as I can see in the code, CloudSolrStream.open() opens streams
> > > against the relevant shards and checks if there is a result. The last
> > step
> > > is important as that means the first batch of tuples must be calculated
> > in
> > > the shards. Streaming works internally by having a sliding window of
> 30K
> > > tuples through the result set in each shard, so open() results in (up
> to)
> > > 30K tuples being calculated. On the other hand, getting the first 30K
> > > tuples should be very fast after open().
> > >
> > > > We are currently using Solr 5, but we’ve also tried with Solr 7 and
> > seen
> > > > similar results.
> > >
> > > Solr 7 has a performance regression for export (or rather a regression
> > for
> > > DocValues that is very visible when using export. See
> > > https://issues.apache.org/jira/browse/SOLR-13013), so I would expect
> it
> > > to be slower than Solr 5. You could try with Solr 8 where this
> regression
> > > should be mitigated somewhat.
> > >
> > > - Toke Eskildsen
> > >
> >
>