Streaming and large resultsets

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Streaming and large resultsets

Lanny Ripple
We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
our way to 7 and I'm getting surprising /stream results.

In one example I /select (wt=csv) and /stream [using
search(...,wt=javabin)] with a query that gives a resultset size of 541
tuples.  The select comes back in under a second.  The stream takes 70
seconds.  Should I expect this much difference?

I then /select and /stream over a query with a resultset size of 3.5M
documents.  The select takes 14 minutes.  The stream takes just under 7
minutes using `curl`.  When I use solrj I get

Truncated chunk ( expected size: 32768; actual size:
13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk (
expected size: 32768; actual size: 13830)
        at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
        at
org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:215)
        at
org.apache.http.impl.io.ChunkedInputStream.close(ChunkedInputStream.java:316)
        at
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
        at
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
        ...

I found a reference to this being from a timeout of the HTTP session in
CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
around in the source (yay OSS) I found that I could get hold of the
ClouldSolrClient and up the SOTimeout so that's working now.

The documentation describes /stream as "returning data as soon as
available" but there seems to be a HUGE startup latency.  Any thoughts on
how to reduce that?
Reply | Threaded
Open this post in threaded view
|

Re: Streaming and large resultsets

Joel Bernstein
Can you post the exact streaming query you are using? The size of the index
and field types will help understand the issue as well. Also are you seeing
different performance behaviors after the upgrade or just testing the
streaming for the first time on 6.6.1?

When using the /export handler to stream, the results are pulled from the
DocValues caches which have in-memory structures that need to be built. If
your query returns a large number of string fields you could see a delay as
those cashes are built for the first time or rebuilt following a commit.
You can use static warming queries to warm the fields in the background
following a commit. After caches are built the queries should start
streaming immediately on subsequent queries. The 70 second response time
with only 541 tuples sounds like it might be caused by the caches being
rebuilt.

In general though the /export handler will slow down as you add more fields
to the field list and sort list. But if you post your query and information
on what types the fields are in the field list I can give you an idea of
the type of performance I would expect.

Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <[hidden email]> wrote:

> We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
> our way to 7 and I'm getting surprising /stream results.
>
> In one example I /select (wt=csv) and /stream [using
> search(...,wt=javabin)] with a query that gives a resultset size of 541
> tuples.  The select comes back in under a second.  The stream takes 70
> seconds.  Should I expect this much difference?
>
> I then /select and /stream over a query with a resultset size of 3.5M
> documents.  The select takes 14 minutes.  The stream takes just under 7
> minutes using `curl`.  When I use solrj I get
>
> Truncated chunk ( expected size: 32768; actual size:
> 13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk
> (
> expected size: 32768; actual size: 13830)
>         at
> org.apache.http.impl.io.ChunkedInputStream.read(
> ChunkedInputStream.java:200)
>         at
> org.apache.http.impl.io.ChunkedInputStream.read(
> ChunkedInputStream.java:215)
>         at
> org.apache.http.impl.io.ChunkedInputStream.close(
> ChunkedInputStream.java:316)
>         at
> org.apache.http.conn.BasicManagedEntity.streamClosed(
> BasicManagedEntity.java:164)
>         at
> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
>         ...
>
> I found a reference to this being from a timeout of the HTTP session in
> CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
> around in the source (yay OSS) I found that I could get hold of the
> ClouldSolrClient and up the SOTimeout so that's working now.
>
> The documentation describes /stream as "returning data as soon as
> available" but there seems to be a HUGE startup latency.  Any thoughts on
> how to reduce that?
>
Reply | Threaded
Open this post in threaded view
|

Re: Streaming and large resultsets

Lanny Ripple
Happy to do so.  I am testing streams for the first time so we don't have
any 5.x experience.  The collection I'm testing was loaded after going to
6.6.1 and fixing up the solrconfig for lucene_version and removing the
/export clause.  The indexes run 57G per replica.  We are using 64G hosts
with 48G heaps using G1GC but this isn't our only large collection.  This
morning as I'm running these our cluster is quiet.  I realize some of the
performance we are seeing is going to be our data size so not expecting any
silver bullets.

We are storing 858M documents that are basically

id: String
token: String
outs: String[]
outsCount: Int

All stored=true, docvalues=true.

The `outs` reference a select number of tokens (1.26M).  Here are current
percentiles of our outsCount

`outsCount`
50%       12
85%      127
98%      937
99.9% 16,284

I'll display the /stream query but I'm setting up the same thing in solrj.
I'm going to name our small result set "tokenA" and our large one "tokenV".

  search(graph-october,
    q="outs:tokenA",
    fl="id,token",
    sort="id asc",
    qt="/export",
    wt=javabin)

I've placed this in file /tmp/expr and invoke with

  curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
http://host/solr/graph-october/stream

The large resultset query replaces "tokenA" with "tokenV".

My /select query is

  curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d
fl="id,token" http://host/solr/graph-october/select

So the behavior I'm seeing (times are with running the commands back to
back)

/select tokenA 0.296s
/search tokenA 8.787s
/search tokenA 8.513s
/select tokenA 0.292s

I was going to get some "tokenV" times here but curl is hitting "Truncated
chunk" errors.

I can pick a "tokenB" which hasn't been run but from the originating data I
know has a resultset size of about 5800.  Times are

/search tokenB 11.486s
/search tokenB 0.960s
/select tokenB 2.534s
/select tokenB 2.117s
/search tokenB 1.208s

So that's closer to the behavior I'd expect and desire.  Back to "tokenA".

/search tokenA 0.289s
/search tokenA 0.249s
/select tokenA 0.342s
/select tokenA 0.329s
/search tokenA 11.848s

Here's "tokenV" out of a solrj couting job.

_run1_
Token tokenV has 3497203 outs.
Time to first read: PT37.577S
Runtime: PT1M43.822S

_run2_
Token tokenV has 3497203 outs.
Time to first read: PT18.764S
Runtime: PT1M16.655S

So hmm.  Given this I'd hazard there's nothing unexpected about /stream and
the problem is the load on our cluster.

So I thought I'd run "tokenA" though the count job.

_run1_
Token tokenA has 513* outs.
Time to first read: PT12.105S
Runtime: PT12.114S

* The correct count.  Last post I also counted csv header (514) then
transposed the digits.  :/

_run2_
Token tokenA has 513 outs.
Time to first read: PT7.85S
Runtime: PT7.86S

_run3_
/select tokenA 0.229s

_run4_
Token tokenA has 513 outs.
Time to first read: PT11.485S
Runtime: PT11.495S

And we are back to not nearly as fast.

Here's my counting code.
https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9

On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <[hidden email]> wrote:

> Can you post the exact streaming query you are using? The size of the index
> and field types will help understand the issue as well. Also are you seeing
> different performance behaviors after the upgrade or just testing the
> streaming for the first time on 6.6.1?
>
> When using the /export handler to stream, the results are pulled from the
> DocValues caches which have in-memory structures that need to be built. If
> your query returns a large number of string fields you could see a delay as
> those cashes are built for the first time or rebuilt following a commit.
> You can use static warming queries to warm the fields in the background
> following a commit. After caches are built the queries should start
> streaming immediately on subsequent queries. The 70 second response time
> with only 541 tuples sounds like it might be caused by the caches being
> rebuilt.
>
> In general though the /export handler will slow down as you add more fields
> to the field list and sort list. But if you post your query and information
> on what types the fields are in the field list I can give you an idea of
> the type of performance I would expect.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <[hidden email]> wrote:
>
> > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1 on
> > our way to 7 and I'm getting surprising /stream results.
> >
> > In one example I /select (wt=csv) and /stream [using
> > search(...,wt=javabin)] with a query that gives a resultset size of 541
> > tuples.  The select comes back in under a second.  The stream takes 70
> > seconds.  Should I expect this much difference?
> >
> > I then /select and /stream over a query with a resultset size of 3.5M
> > documents.  The select takes 14 minutes.  The stream takes just under 7
> > minutes using `curl`.  When I use solrj I get
> >
> > Truncated chunk ( expected size: 32768; actual size:
> > 13830)","trace":"org.apache.http.TruncatedChunkException: Truncated chunk
> > (
> > expected size: 32768; actual size: 13830)
> >         at
> > org.apache.http.impl.io.ChunkedInputStream.read(
> > ChunkedInputStream.java:200)
> >         at
> > org.apache.http.impl.io.ChunkedInputStream.read(
> > ChunkedInputStream.java:215)
> >         at
> > org.apache.http.impl.io.ChunkedInputStream.close(
> > ChunkedInputStream.java:316)
> >         at
> > org.apache.http.conn.BasicManagedEntity.streamClosed(
> > BasicManagedEntity.java:164)
> >         at
> > org.apache.http.conn.EofSensorInputStream.checkClose(
> > EofSensorInputStream.java:228)
> >         ...
> >
> > I found a reference to this being from a timeout of the HTTP session in
> > CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
> > around in the source (yay OSS) I found that I could get hold of the
> > ClouldSolrClient and up the SOTimeout so that's working now.
> >
> > The documentation describes /stream as "returning data as soon as
> > available" but there seems to be a HUGE startup latency.  Any thoughts on
> > how to reduce that?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Streaming and large resultsets

Joel Bernstein
In my experience this should be very fast:

 search(graph-october,
    q="outs:tokenA",
    fl="id,token",
    sort="id asc",
    qt="/export",
    wt=javabin)


When the DocValues cache is statically warmed for the two output fields I
would see somewhere around 500,000 docs per second exported from a single
node.

You have sixteen shards which would give you 16 times the throughput. But
off course the data is being sent back through the single aggregator node
so your throughput is only as fast as the aggregator node can process the
results.

This does not explain the slowness that you are seeing. I see a couple of
possible reasons:

1) The memory on the system is not tuned optimally. You allocated a large
amount of memory to the heap and are not providing enough memory to OS
filesystem. Lucene DocValues use the OS filesystem cache for the DocValues
caches. So I would bump down the size of heap considerably.

2) Are you indexing while querying at all? If you are you would need to be
statically warming the DocValues caches for the id field which is used for
sorting. Following each commit there is a top level docvalues cache that is
rebuilt for sorting on string fields. If you use a static warming query it
will warm the cache before making the new searcher live for searchers. I
would also pause indexing if possible and run queries only to see how it
runs without indexing.

3) Try running a query directly to /export handler on each node. Possibly
one of your nodes is slow for some reason and that is causing the entire
query to respond slowly.





Joel Bernstein
http://joelsolr.blogspot.com/

On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple <[hidden email]> wrote:

> Happy to do so.  I am testing streams for the first time so we don't have
> any 5.x experience.  The collection I'm testing was loaded after going to
> 6.6.1 and fixing up the solrconfig for lucene_version and removing the
> /export clause.  The indexes run 57G per replica.  We are using 64G hosts
> with 48G heaps using G1GC but this isn't our only large collection.  This
> morning as I'm running these our cluster is quiet.  I realize some of the
> performance we are seeing is going to be our data size so not expecting any
> silver bullets.
>
> We are storing 858M documents that are basically
>
> id: String
> token: String
> outs: String[]
> outsCount: Int
>
> All stored=true, docvalues=true.
>
> The `outs` reference a select number of tokens (1.26M).  Here are current
> percentiles of our outsCount
>
> `outsCount`
> 50%       12
> 85%      127
> 98%      937
> 99.9% 16,284
>
> I'll display the /stream query but I'm setting up the same thing in solrj.
> I'm going to name our small result set "tokenA" and our large one "tokenV".
>
>   search(graph-october,
>     q="outs:tokenA",
>     fl="id,token",
>     sort="id asc",
>     qt="/export",
>     wt=javabin)
>
> I've placed this in file /tmp/expr and invoke with
>
>   curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
> http://host/solr/graph-october/stream
>
> The large resultset query replaces "tokenA" with "tokenV".
>
> My /select query is
>
>   curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d
> fl="id,token" http://host/solr/graph-october/select
>
> So the behavior I'm seeing (times are with running the commands back to
> back)
>
> /select tokenA 0.296s
> /search tokenA 8.787s
> /search tokenA 8.513s
> /select tokenA 0.292s
>
> I was going to get some "tokenV" times here but curl is hitting "Truncated
> chunk" errors.
>
> I can pick a "tokenB" which hasn't been run but from the originating data I
> know has a resultset size of about 5800.  Times are
>
> /search tokenB 11.486s
> /search tokenB 0.960s
> /select tokenB 2.534s
> /select tokenB 2.117s
> /search tokenB 1.208s
>
> So that's closer to the behavior I'd expect and desire.  Back to "tokenA".
>
> /search tokenA 0.289s
> /search tokenA 0.249s
> /select tokenA 0.342s
> /select tokenA 0.329s
> /search tokenA 11.848s
>
> Here's "tokenV" out of a solrj couting job.
>
> _run1_
> Token tokenV has 3497203 outs.
> Time to first read: PT37.577S
> Runtime: PT1M43.822S
>
> _run2_
> Token tokenV has 3497203 outs.
> Time to first read: PT18.764S
> Runtime: PT1M16.655S
>
> So hmm.  Given this I'd hazard there's nothing unexpected about /stream and
> the problem is the load on our cluster.
>
> So I thought I'd run "tokenA" though the count job.
>
> _run1_
> Token tokenA has 513* outs.
> Time to first read: PT12.105S
> Runtime: PT12.114S
>
> * The correct count.  Last post I also counted csv header (514) then
> transposed the digits.  :/
>
> _run2_
> Token tokenA has 513 outs.
> Time to first read: PT7.85S
> Runtime: PT7.86S
>
> _run3_
> /select tokenA 0.229s
>
> _run4_
> Token tokenA has 513 outs.
> Time to first read: PT11.485S
> Runtime: PT11.495S
>
> And we are back to not nearly as fast.
>
> Here's my counting code.
> https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9
>
> On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <[hidden email]> wrote:
>
> > Can you post the exact streaming query you are using? The size of the
> index
> > and field types will help understand the issue as well. Also are you
> seeing
> > different performance behaviors after the upgrade or just testing the
> > streaming for the first time on 6.6.1?
> >
> > When using the /export handler to stream, the results are pulled from the
> > DocValues caches which have in-memory structures that need to be built.
> If
> > your query returns a large number of string fields you could see a delay
> as
> > those cashes are built for the first time or rebuilt following a commit.
> > You can use static warming queries to warm the fields in the background
> > following a commit. After caches are built the queries should start
> > streaming immediately on subsequent queries. The 70 second response time
> > with only 541 tuples sounds like it might be caused by the caches being
> > rebuilt.
> >
> > In general though the /export handler will slow down as you add more
> fields
> > to the field list and sort list. But if you post your query and
> information
> > on what types the fields are in the field list I can give you an idea of
> > the type of performance I would expect.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <[hidden email]>
> wrote:
> >
> > > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to 6.6.1
> on
> > > our way to 7 and I'm getting surprising /stream results.
> > >
> > > In one example I /select (wt=csv) and /stream [using
> > > search(...,wt=javabin)] with a query that gives a resultset size of 541
> > > tuples.  The select comes back in under a second.  The stream takes 70
> > > seconds.  Should I expect this much difference?
> > >
> > > I then /select and /stream over a query with a resultset size of 3.5M
> > > documents.  The select takes 14 minutes.  The stream takes just under 7
> > > minutes using `curl`.  When I use solrj I get
> > >
> > > Truncated chunk ( expected size: 32768; actual size:
> > > 13830)","trace":"org.apache.http.TruncatedChunkException: Truncated
> chunk
> > > (
> > > expected size: 32768; actual size: 13830)
> > >         at
> > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > ChunkedInputStream.java:200)
> > >         at
> > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > ChunkedInputStream.java:215)
> > >         at
> > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > ChunkedInputStream.java:316)
> > >         at
> > > org.apache.http.conn.BasicManagedEntity.streamClosed(
> > > BasicManagedEntity.java:164)
> > >         at
> > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > EofSensorInputStream.java:228)
> > >         ...
> > >
> > > I found a reference to this being from a timeout of the HTTP session in
> > > CloudSolrStream but couldn't find a bug in Jira on the topic.  Digging
> > > around in the source (yay OSS) I found that I could get hold of the
> > > ClouldSolrClient and up the SOTimeout so that's working now.
> > >
> > > The documentation describes /stream as "returning data as soon as
> > > available" but there seems to be a HUGE startup latency.  Any thoughts
> on
> > > how to reduce that?
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Streaming and large resultsets

Lanny Ripple
First, Joel, thanks for your help on this.

1) I have to admit we really haven't played with a lot of system tuning
recently (before DocValues for sure).   We'll go through another tuning
round.

2) At the time I ran these numbers this morning we were not indexing.  We
build this collection once a month and then client jobs can update it.  I
was watching our job queue and there were no jobs running at that time.
It's possible someone else was querying against other collections but they
wouldn't have been updating this collection.

3) I'll try /export on each node.  We're pretty cookie-cutter with all
nodes being the same and configuration controlled with puppet.  We collect
system metrics to a Graphite display panel and no host looks out of sorts
relative to the others.  That said I wouldn't be surprised if a node was
out of whack.

Thanks again.
  -ljr

On Thu, Nov 9, 2017 at 2:34 PM Joel Bernstein <[hidden email]> wrote:

> In my experience this should be very fast:
>
>  search(graph-october,
>     q="outs:tokenA",
>     fl="id,token",
>     sort="id asc",
>     qt="/export",
>     wt=javabin)
>
>
> When the DocValues cache is statically warmed for the two output fields I
> would see somewhere around 500,000 docs per second exported from a single
> node.
>
> You have sixteen shards which would give you 16 times the throughput. But
> off course the data is being sent back through the single aggregator node
> so your throughput is only as fast as the aggregator node can process the
> results.
>
> This does not explain the slowness that you are seeing. I see a couple of
> possible reasons:
>
> 1) The memory on the system is not tuned optimally. You allocated a large
> amount of memory to the heap and are not providing enough memory to OS
> filesystem. Lucene DocValues use the OS filesystem cache for the DocValues
> caches. So I would bump down the size of heap considerably.
>
> 2) Are you indexing while querying at all? If you are you would need to be
> statically warming the DocValues caches for the id field which is used for
> sorting. Following each commit there is a top level docvalues cache that is
> rebuilt for sorting on string fields. If you use a static warming query it
> will warm the cache before making the new searcher live for searchers. I
> would also pause indexing if possible and run queries only to see how it
> runs without indexing.
>
> 3) Try running a query directly to /export handler on each node. Possibly
> one of your nodes is slow for some reason and that is causing the entire
> query to respond slowly.
>
>
>
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple <[hidden email]> wrote:
>
> > Happy to do so.  I am testing streams for the first time so we don't have
> > any 5.x experience.  The collection I'm testing was loaded after going to
> > 6.6.1 and fixing up the solrconfig for lucene_version and removing the
> > /export clause.  The indexes run 57G per replica.  We are using 64G hosts
> > with 48G heaps using G1GC but this isn't our only large collection.  This
> > morning as I'm running these our cluster is quiet.  I realize some of the
> > performance we are seeing is going to be our data size so not expecting
> any
> > silver bullets.
> >
> > We are storing 858M documents that are basically
> >
> > id: String
> > token: String
> > outs: String[]
> > outsCount: Int
> >
> > All stored=true, docvalues=true.
> >
> > The `outs` reference a select number of tokens (1.26M).  Here are current
> > percentiles of our outsCount
> >
> > `outsCount`
> > 50%       12
> > 85%      127
> > 98%      937
> > 99.9% 16,284
> >
> > I'll display the /stream query but I'm setting up the same thing in
> solrj.
> > I'm going to name our small result set "tokenA" and our large one
> "tokenV".
> >
> >   search(graph-october,
> >     q="outs:tokenA",
> >     fl="id,token",
> >     sort="id asc",
> >     qt="/export",
> >     wt=javabin)
> >
> > I've placed this in file /tmp/expr and invoke with
> >
> >   curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
> > http://host/solr/graph-october/stream
> >
> > The large resultset query replaces "tokenA" with "tokenV".
> >
> > My /select query is
> >
> >   curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d
> > fl="id,token" http://host/solr/graph-october/select
> >
> > So the behavior I'm seeing (times are with running the commands back to
> > back)
> >
> > /select tokenA 0.296s
> > /search tokenA 8.787s
> > /search tokenA 8.513s
> > /select tokenA 0.292s
> >
> > I was going to get some "tokenV" times here but curl is hitting
> "Truncated
> > chunk" errors.
> >
> > I can pick a "tokenB" which hasn't been run but from the originating
> data I
> > know has a resultset size of about 5800.  Times are
> >
> > /search tokenB 11.486s
> > /search tokenB 0.960s
> > /select tokenB 2.534s
> > /select tokenB 2.117s
> > /search tokenB 1.208s
> >
> > So that's closer to the behavior I'd expect and desire.  Back to
> "tokenA".
> >
> > /search tokenA 0.289s
> > /search tokenA 0.249s
> > /select tokenA 0.342s
> > /select tokenA 0.329s
> > /search tokenA 11.848s
> >
> > Here's "tokenV" out of a solrj couting job.
> >
> > _run1_
> > Token tokenV has 3497203 outs.
> > Time to first read: PT37.577S
> > Runtime: PT1M43.822S
> >
> > _run2_
> > Token tokenV has 3497203 outs.
> > Time to first read: PT18.764S
> > Runtime: PT1M16.655S
> >
> > So hmm.  Given this I'd hazard there's nothing unexpected about /stream
> and
> > the problem is the load on our cluster.
> >
> > So I thought I'd run "tokenA" though the count job.
> >
> > _run1_
> > Token tokenA has 513* outs.
> > Time to first read: PT12.105S
> > Runtime: PT12.114S
> >
> > * The correct count.  Last post I also counted csv header (514) then
> > transposed the digits.  :/
> >
> > _run2_
> > Token tokenA has 513 outs.
> > Time to first read: PT7.85S
> > Runtime: PT7.86S
> >
> > _run3_
> > /select tokenA 0.229s
> >
> > _run4_
> > Token tokenA has 513 outs.
> > Time to first read: PT11.485S
> > Runtime: PT11.495S
> >
> > And we are back to not nearly as fast.
> >
> > Here's my counting code.
> > https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9
> >
> > On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <[hidden email]>
> wrote:
> >
> > > Can you post the exact streaming query you are using? The size of the
> > index
> > > and field types will help understand the issue as well. Also are you
> > seeing
> > > different performance behaviors after the upgrade or just testing the
> > > streaming for the first time on 6.6.1?
> > >
> > > When using the /export handler to stream, the results are pulled from
> the
> > > DocValues caches which have in-memory structures that need to be built.
> > If
> > > your query returns a large number of string fields you could see a
> delay
> > as
> > > those cashes are built for the first time or rebuilt following a
> commit.
> > > You can use static warming queries to warm the fields in the background
> > > following a commit. After caches are built the queries should start
> > > streaming immediately on subsequent queries. The 70 second response
> time
> > > with only 541 tuples sounds like it might be caused by the caches being
> > > rebuilt.
> > >
> > > In general though the /export handler will slow down as you add more
> > fields
> > > to the field list and sort list. But if you post your query and
> > information
> > > on what types the fields are in the field list I can give you an idea
> of
> > > the type of performance I would expect.
> > >
> > > Joel Bernstein
> > > http://joelsolr.blogspot.com/
> > >
> > > On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <[hidden email]>
> > wrote:
> > >
> > > > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to
> 6.6.1
> > on
> > > > our way to 7 and I'm getting surprising /stream results.
> > > >
> > > > In one example I /select (wt=csv) and /stream [using
> > > > search(...,wt=javabin)] with a query that gives a resultset size of
> 541
> > > > tuples.  The select comes back in under a second.  The stream takes
> 70
> > > > seconds.  Should I expect this much difference?
> > > >
> > > > I then /select and /stream over a query with a resultset size of 3.5M
> > > > documents.  The select takes 14 minutes.  The stream takes just
> under 7
> > > > minutes using `curl`.  When I use solrj I get
> > > >
> > > > Truncated chunk ( expected size: 32768; actual size:
> > > > 13830)","trace":"org.apache.http.TruncatedChunkException: Truncated
> > chunk
> > > > (
> > > > expected size: 32768; actual size: 13830)
> > > >         at
> > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > ChunkedInputStream.java:200)
> > > >         at
> > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > ChunkedInputStream.java:215)
> > > >         at
> > > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > > ChunkedInputStream.java:316)
> > > >         at
> > > > org.apache.http.conn.BasicManagedEntity.streamClosed(
> > > > BasicManagedEntity.java:164)
> > > >         at
> > > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > > EofSensorInputStream.java:228)
> > > >         ...
> > > >
> > > > I found a reference to this being from a timeout of the HTTP session
> in
> > > > CloudSolrStream but couldn't find a bug in Jira on the topic.
> Digging
> > > > around in the source (yay OSS) I found that I could get hold of the
> > > > ClouldSolrClient and up the SOTimeout so that's working now.
> > > >
> > > > The documentation describes /stream as "returning data as soon as
> > > > available" but there seems to be a HUGE startup latency.  Any
> thoughts
> > on
> > > > how to reduce that?
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Streaming and large resultsets

Susmit Shukla
Hi Lanny,

For long running streaming queries with many shards and huge resultsets,
solrj's default settings for http max connections/connections per host may
not be enough. If you are using the worker collection (/stream), it depends
on dispensing http clients using SolrClientCache with default limits. Could
be useful to turn on debug logging and check.

Thanks,
Susmit

On Thu, Nov 9, 2017 at 8:35 PM, Lanny Ripple <[hidden email]> wrote:

> First, Joel, thanks for your help on this.
>
> 1) I have to admit we really haven't played with a lot of system tuning
> recently (before DocValues for sure).   We'll go through another tuning
> round.
>
> 2) At the time I ran these numbers this morning we were not indexing.  We
> build this collection once a month and then client jobs can update it.  I
> was watching our job queue and there were no jobs running at that time.
> It's possible someone else was querying against other collections but they
> wouldn't have been updating this collection.
>
> 3) I'll try /export on each node.  We're pretty cookie-cutter with all
> nodes being the same and configuration controlled with puppet.  We collect
> system metrics to a Graphite display panel and no host looks out of sorts
> relative to the others.  That said I wouldn't be surprised if a node was
> out of whack.
>
> Thanks again.
>   -ljr
>
> On Thu, Nov 9, 2017 at 2:34 PM Joel Bernstein <[hidden email]> wrote:
>
> > In my experience this should be very fast:
> >
> >  search(graph-october,
> >     q="outs:tokenA",
> >     fl="id,token",
> >     sort="id asc",
> >     qt="/export",
> >     wt=javabin)
> >
> >
> > When the DocValues cache is statically warmed for the two output fields I
> > would see somewhere around 500,000 docs per second exported from a single
> > node.
> >
> > You have sixteen shards which would give you 16 times the throughput. But
> > off course the data is being sent back through the single aggregator node
> > so your throughput is only as fast as the aggregator node can process the
> > results.
> >
> > This does not explain the slowness that you are seeing. I see a couple of
> > possible reasons:
> >
> > 1) The memory on the system is not tuned optimally. You allocated a large
> > amount of memory to the heap and are not providing enough memory to OS
> > filesystem. Lucene DocValues use the OS filesystem cache for the
> DocValues
> > caches. So I would bump down the size of heap considerably.
> >
> > 2) Are you indexing while querying at all? If you are you would need to
> be
> > statically warming the DocValues caches for the id field which is used
> for
> > sorting. Following each commit there is a top level docvalues cache that
> is
> > rebuilt for sorting on string fields. If you use a static warming query
> it
> > will warm the cache before making the new searcher live for searchers. I
> > would also pause indexing if possible and run queries only to see how it
> > runs without indexing.
> >
> > 3) Try running a query directly to /export handler on each node. Possibly
> > one of your nodes is slow for some reason and that is causing the entire
> > query to respond slowly.
> >
> >
> >
> >
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> > On Thu, Nov 9, 2017 at 2:22 PM, Lanny Ripple <[hidden email]>
> wrote:
> >
> > > Happy to do so.  I am testing streams for the first time so we don't
> have
> > > any 5.x experience.  The collection I'm testing was loaded after going
> to
> > > 6.6.1 and fixing up the solrconfig for lucene_version and removing the
> > > /export clause.  The indexes run 57G per replica.  We are using 64G
> hosts
> > > with 48G heaps using G1GC but this isn't our only large collection.
> This
> > > morning as I'm running these our cluster is quiet.  I realize some of
> the
> > > performance we are seeing is going to be our data size so not expecting
> > any
> > > silver bullets.
> > >
> > > We are storing 858M documents that are basically
> > >
> > > id: String
> > > token: String
> > > outs: String[]
> > > outsCount: Int
> > >
> > > All stored=true, docvalues=true.
> > >
> > > The `outs` reference a select number of tokens (1.26M).  Here are
> current
> > > percentiles of our outsCount
> > >
> > > `outsCount`
> > > 50%       12
> > > 85%      127
> > > 98%      937
> > > 99.9% 16,284
> > >
> > > I'll display the /stream query but I'm setting up the same thing in
> > solrj.
> > > I'm going to name our small result set "tokenA" and our large one
> > "tokenV".
> > >
> > >   search(graph-october,
> > >     q="outs:tokenA",
> > >     fl="id,token",
> > >     sort="id asc",
> > >     qt="/export",
> > >     wt=javabin)
> > >
> > > I've placed this in file /tmp/expr and invoke with
> > >
> > >   curl -sSN -m 3600 --data-urlencode expr@/tmp/expr
> > > http://host/solr/graph-october/stream
> > >
> > > The large resultset query replaces "tokenA" with "tokenV".
> > >
> > > My /select query is
> > >
> > >   curl -sSN -m 3600 -d wt=csv -d rows=10000 -d q="outs:tokenA" -d
> > > fl="id,token" http://host/solr/graph-october/select
> > >
> > > So the behavior I'm seeing (times are with running the commands back to
> > > back)
> > >
> > > /select tokenA 0.296s
> > > /search tokenA 8.787s
> > > /search tokenA 8.513s
> > > /select tokenA 0.292s
> > >
> > > I was going to get some "tokenV" times here but curl is hitting
> > "Truncated
> > > chunk" errors.
> > >
> > > I can pick a "tokenB" which hasn't been run but from the originating
> > data I
> > > know has a resultset size of about 5800.  Times are
> > >
> > > /search tokenB 11.486s
> > > /search tokenB 0.960s
> > > /select tokenB 2.534s
> > > /select tokenB 2.117s
> > > /search tokenB 1.208s
> > >
> > > So that's closer to the behavior I'd expect and desire.  Back to
> > "tokenA".
> > >
> > > /search tokenA 0.289s
> > > /search tokenA 0.249s
> > > /select tokenA 0.342s
> > > /select tokenA 0.329s
> > > /search tokenA 11.848s
> > >
> > > Here's "tokenV" out of a solrj couting job.
> > >
> > > _run1_
> > > Token tokenV has 3497203 outs.
> > > Time to first read: PT37.577S
> > > Runtime: PT1M43.822S
> > >
> > > _run2_
> > > Token tokenV has 3497203 outs.
> > > Time to first read: PT18.764S
> > > Runtime: PT1M16.655S
> > >
> > > So hmm.  Given this I'd hazard there's nothing unexpected about /stream
> > and
> > > the problem is the load on our cluster.
> > >
> > > So I thought I'd run "tokenA" though the count job.
> > >
> > > _run1_
> > > Token tokenA has 513* outs.
> > > Time to first read: PT12.105S
> > > Runtime: PT12.114S
> > >
> > > * The correct count.  Last post I also counted csv header (514) then
> > > transposed the digits.  :/
> > >
> > > _run2_
> > > Token tokenA has 513 outs.
> > > Time to first read: PT7.85S
> > > Runtime: PT7.86S
> > >
> > > _run3_
> > > /select tokenA 0.229s
> > >
> > > _run4_
> > > Token tokenA has 513 outs.
> > > Time to first read: PT11.485S
> > > Runtime: PT11.495S
> > >
> > > And we are back to not nearly as fast.
> > >
> > > Here's my counting code.
> > > https://gist.github.com/LannyRipple/8e47769155c766b33bf37bcf901b17e9
> > >
> > > On Thu, Nov 9, 2017 at 11:17 AM Joel Bernstein <[hidden email]>
> > wrote:
> > >
> > > > Can you post the exact streaming query you are using? The size of the
> > > index
> > > > and field types will help understand the issue as well. Also are you
> > > seeing
> > > > different performance behaviors after the upgrade or just testing the
> > > > streaming for the first time on 6.6.1?
> > > >
> > > > When using the /export handler to stream, the results are pulled from
> > the
> > > > DocValues caches which have in-memory structures that need to be
> built.
> > > If
> > > > your query returns a large number of string fields you could see a
> > delay
> > > as
> > > > those cashes are built for the first time or rebuilt following a
> > commit.
> > > > You can use static warming queries to warm the fields in the
> background
> > > > following a commit. After caches are built the queries should start
> > > > streaming immediately on subsequent queries. The 70 second response
> > time
> > > > with only 541 tuples sounds like it might be caused by the caches
> being
> > > > rebuilt.
> > > >
> > > > In general though the /export handler will slow down as you add more
> > > fields
> > > > to the field list and sort list. But if you post your query and
> > > information
> > > > on what types the fields are in the field list I can give you an idea
> > of
> > > > the type of performance I would expect.
> > > >
> > > > Joel Bernstein
> > > > http://joelsolr.blogspot.com/
> > > >
> > > > On Thu, Nov 9, 2017 at 9:54 AM, Lanny Ripple <[hidden email]>
> > > wrote:
> > > >
> > > > > We've recently upgraded our SolrCloud (16 shards, 2 replicas) to
> > 6.6.1
> > > on
> > > > > our way to 7 and I'm getting surprising /stream results.
> > > > >
> > > > > In one example I /select (wt=csv) and /stream [using
> > > > > search(...,wt=javabin)] with a query that gives a resultset size of
> > 541
> > > > > tuples.  The select comes back in under a second.  The stream takes
> > 70
> > > > > seconds.  Should I expect this much difference?
> > > > >
> > > > > I then /select and /stream over a query with a resultset size of
> 3.5M
> > > > > documents.  The select takes 14 minutes.  The stream takes just
> > under 7
> > > > > minutes using `curl`.  When I use solrj I get
> > > > >
> > > > > Truncated chunk ( expected size: 32768; actual size:
> > > > > 13830)","trace":"org.apache.http.TruncatedChunkException:
> Truncated
> > > chunk
> > > > > (
> > > > > expected size: 32768; actual size: 13830)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > > ChunkedInputStream.java:200)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.read(
> > > > > ChunkedInputStream.java:215)
> > > > >         at
> > > > > org.apache.http.impl.io.ChunkedInputStream.close(
> > > > > ChunkedInputStream.java:316)
> > > > >         at
> > > > > org.apache.http.conn.BasicManagedEntity.streamClosed(
> > > > > BasicManagedEntity.java:164)
> > > > >         at
> > > > > org.apache.http.conn.EofSensorInputStream.checkClose(
> > > > > EofSensorInputStream.java:228)
> > > > >         ...
> > > > >
> > > > > I found a reference to this being from a timeout of the HTTP
> session
> > in
> > > > > CloudSolrStream but couldn't find a bug in Jira on the topic.
> > Digging
> > > > > around in the source (yay OSS) I found that I could get hold of the
> > > > > ClouldSolrClient and up the SOTimeout so that's working now.
> > > > >
> > > > > The documentation describes /stream as "returning data as soon as
> > > > > available" but there seems to be a HUGE startup latency.  Any
> > thoughts
> > > on
> > > > > how to reduce that?
> > > > >
> > > >
> > >
> >
>