Under-utilization during streaming expression execution

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Under-utilization during streaming expression execution

Gus Heck
Hi Folks,

I'm looking for ideas on how to speed up processing for a streaming
expression. I can't post the full details because it's customer related,
but the structure is shown here: https://imgur.com/a/98sENVT What that does
is take the results of two queries, join them and push them back into the
collection as a new (denormalized) doc. The second (hash) join just updates
a field that distinguishes the new docs from either of the old docs so it's
hashing exactly one value, and thus this is not of concern for performance
(if there were a good way to tell select to modify only one field and keep
all the rest without listing the fields explicitly it wouldn't be needed) .


When I run it across a test index with 1377364 and 5146620 docs for the two
queries. The result is that it inserts 4742322 new documents, in ~10
minutes. This seems pretty spiffy except this test index is ~1/1000 of the
real index... so obviously I want to find *at least* a factor of 10
improvement. So far I managed a factor of about 3 to get it down to
slightly over 200 seconds by programmatically building the queries
partitioning based on a set of percentiles from a stats query on one of the
fields that is a floating point number with good distribution, but this
seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
split to all 50 nodes brings things back to ~400 seconds.

The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
Still trying to get a good handle on network numbers, but I'm guessing that
I'm either network limited or there's an inefficiency with contention
somewhere inside solr (no I haven't put a profiler on it yet).

Here's the interesting bit. I happen to know that the join key in the
leftJoin is on a key that is used for document routing, so we're only
joining up with documents on the same node. Furthermore, the id generated
is a concatenation of these id's with a value from one of the fields and
should also route to the same node... Is there any way to make the whole
expression run locally on the nodes to avoid throwing the data back and
forth across the network needlessly?

Any other ideas for making this go another factor of 2-3 faster?

-Gus
Reply | Threaded
Open this post in threaded view
|

Re: Under-utilization during streaming expression execution

Joel Bernstein
You can run in parallel and that should help quite a bit. But at a really
large batch job is better done like this:

https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html

Joel Bernstein
http://joelsolr.blogspot.com/


On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <[hidden email]> wrote:

> Hi Folks,
>
> I'm looking for ideas on how to speed up processing for a streaming
> expression. I can't post the full details because it's customer related,
> but the structure is shown here: https://imgur.com/a/98sENVT What that
> does
> is take the results of two queries, join them and push them back into the
> collection as a new (denormalized) doc. The second (hash) join just updates
> a field that distinguishes the new docs from either of the old docs so it's
> hashing exactly one value, and thus this is not of concern for performance
> (if there were a good way to tell select to modify only one field and keep
> all the rest without listing the fields explicitly it wouldn't be needed) .
>
>
> When I run it across a test index with 1377364 and 5146620 docs for the two
> queries. The result is that it inserts 4742322 new documents, in ~10
> minutes. This seems pretty spiffy except this test index is ~1/1000 of the
> real index... so obviously I want to find *at least* a factor of 10
> improvement. So far I managed a factor of about 3 to get it down to
> slightly over 200 seconds by programmatically building the queries
> partitioning based on a set of percentiles from a stats query on one of the
> fields that is a floating point number with good distribution, but this
> seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
> split to all 50 nodes brings things back to ~400 seconds.
>
> The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
> metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
> Still trying to get a good handle on network numbers, but I'm guessing that
> I'm either network limited or there's an inefficiency with contention
> somewhere inside solr (no I haven't put a profiler on it yet).
>
> Here's the interesting bit. I happen to know that the join key in the
> leftJoin is on a key that is used for document routing, so we're only
> joining up with documents on the same node. Furthermore, the id generated
> is a concatenation of these id's with a value from one of the fields and
> should also route to the same node... Is there any way to make the whole
> expression run locally on the nodes to avoid throwing the data back and
> forth across the network needlessly?
>
> Any other ideas for making this go another factor of 2-3 faster?
>
> -Gus
>
Reply | Threaded
Open this post in threaded view
|

Re: Under-utilization during streaming expression execution

Joel Bernstein
Use large batches and fetch instead of hashjoin and lots of parallel
workers.

Joel Bernstein
http://joelsolr.blogspot.com/


On Fri, Feb 15, 2019 at 7:48 PM Joel Bernstein <[hidden email]> wrote:

> You can run in parallel and that should help quite a bit. But at a really
> large batch job is better done like this:
>
>
> https://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Thu, Feb 14, 2019 at 6:10 PM Gus Heck <[hidden email]> wrote:
>
>> Hi Folks,
>>
>> I'm looking for ideas on how to speed up processing for a streaming
>> expression. I can't post the full details because it's customer related,
>> but the structure is shown here: https://imgur.com/a/98sENVT What that
>> does
>> is take the results of two queries, join them and push them back into the
>> collection as a new (denormalized) doc. The second (hash) join just
>> updates
>> a field that distinguishes the new docs from either of the old docs so
>> it's
>> hashing exactly one value, and thus this is not of concern for performance
>> (if there were a good way to tell select to modify only one field and keep
>> all the rest without listing the fields explicitly it wouldn't be needed)
>> .
>>
>>
>> When I run it across a test index with 1377364 and 5146620 docs for the
>> two
>> queries. The result is that it inserts 4742322 new documents, in ~10
>> minutes. This seems pretty spiffy except this test index is ~1/1000 of the
>> real index... so obviously I want to find *at least* a factor of 10
>> improvement. So far I managed a factor of about 3 to get it down to
>> slightly over 200 seconds by programmatically building the queries
>> partitioning based on a set of percentiles from a stats query on one of
>> the
>> fields that is a floating point number with good distribution, but this
>> seems to stop helping 10-12 splits on my 50 node cluster, scaling up to
>> split to all 50 nodes brings things back to ~400 seconds.
>>
>> The CPU utilization on the machines mostly stabilizes around 30-50%, Disk
>> metrics don't seem to look bad (disk idle stat in AWS stays over 90%).
>> Still trying to get a good handle on network numbers, but I'm guessing
>> that
>> I'm either network limited or there's an inefficiency with contention
>> somewhere inside solr (no I haven't put a profiler on it yet).
>>
>> Here's the interesting bit. I happen to know that the join key in the
>> leftJoin is on a key that is used for document routing, so we're only
>> joining up with documents on the same node. Furthermore, the id generated
>> is a concatenation of these id's with a value from one of the fields and
>> should also route to the same node... Is there any way to make the whole
>> expression run locally on the nodes to avoid throwing the data back and
>> forth across the network needlessly?
>>
>> Any other ideas for making this go another factor of 2-3 faster?
>>
>> -Gus
>>
>