cursorMark and shards? (6.6.2)

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

cursorMark and shards? (6.6.2)

Walter Underwood
I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.

Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.

It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Erick Erickson
Not sure whether cursormark respects distrib=false, although I can easily see there being “complications” here.

Hmmm, whenever I try to use distrib=false, I usually fire the query at the specific replica rather than use the shards parameter. IDK whether that’ll make any difference.

<a href="https://node:port/solr/collection1_shard1_replica1/query?distrib=false……">https://node:port/solr/collection1_shard1_replica1/query?distrib=false…….

You could also make it simpler...

q=id:{last_id_from_last_packet TO *]&rows=some_reasonable_number&sort=id asc&distrib=false

That doesn’t pay the penalty of having a huge start param.

Again, I’d use the specific replica rather than shards parameter, but just because I’ve never tried it with shards….

Best,
Erick

> On Feb 10, 2020, at 1:30 PM, Walter Underwood <[hidden email]> wrote:
>
> I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.
>
> Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.
>
> It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Walter Underwood
I’ll give that a shot.

Not sure if range queries work on a UUID field, but I have thought of segmenting the ID space and running parallel queries on those.

Right now it is sucking over 1.6 million docs per hour, so that is bearable. Making it 4X or 16 X faster would be nice, though.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 2:19 PM, Erick Erickson <[hidden email]> wrote:
>
> Not sure whether cursormark respects distrib=false, although I can easily see there being “complications” here.
>
> Hmmm, whenever I try to use distrib=false, I usually fire the query at the specific replica rather than use the shards parameter. IDK whether that’ll make any difference.
>
> <a href="https://node:port/solr/collection1_shard1_replica1/query?distrib=false……">https://node:port/solr/collection1_shard1_replica1/query?distrib=false…….
>
> You could also make it simpler...
>
> q=id:{last_id_from_last_packet TO *]&rows=some_reasonable_number&sort=id asc&distrib=false
>
> That doesn’t pay the penalty of having a huge start param.
>
> Again, I’d use the specific replica rather than shards parameter, but just because I’ve never tried it with shards….
>
> Best,
> Erick
>
>> On Feb 10, 2020, at 1:30 PM, Walter Underwood <[hidden email]> wrote:
>>
>> I tried to get fancy and dump our content with one thread per shard, but it did distributed search anyway. I specified the shard using the “shards” param and set distrib=false.
>>
>> Is this a bug or expected behavior in 6.6.2? I did not see it mentioned in the docs.
>>
>> It is working fine with a single thread and distributed search. Should have followed the old Kernighan and Plauger rule, “Make it right before youmake it faster."
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Walter Underwood

> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]> wrote:
>
> Not sure if range queries work on a UUID field, ...

A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Erick Erickson
Any field that’s unique per doc would do, but yeah, that’s usually an ID.

Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
at individual replicas. Each replica should have multiple UUIDs that start with 0-f.

Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
collection rather than individual shards which would work too. But for individual
shards I think you need to look for all possible IDs...

Erick

> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]> wrote:
>
>
>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]> wrote:
>>
>> Not sure if range queries work on a UUID field, ...
>
> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Michael Gibney
Possibly worth mentioning, although it might not be appropriate for
your use case: if the fields you're interested in are configured with
docValues, you could use streaming expressions (or directly handle
thread-per-shard connections to the /export handler) and get
everything in a single shot without paging of any kind. (I'm actually
working on something of this nature now; though not quite ready for
prime time, it's reliably exporting 68 million records to a 24G
compressed zip archive in 23 minutes -- 24 shards).

On Mon, Feb 10, 2020 at 6:39 PM Erick Erickson <[hidden email]> wrote:

>
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start with 0-f.
>
> Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
> collection rather than individual shards which would work too. But for individual
> shards I think you need to look for all possible IDs...
>
> Erick
>
> > On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]> wrote:
> >
> >
> >> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]> wrote:
> >>
> >> Not sure if range queries work on a UUID field, ...
> >
> > A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f.
> >
> > wunder
> > Walter Underwood
> > [hidden email]
> > http://observer.wunderwood.org/  (my blog)
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Walter Underwood
In reply to this post by Erick Erickson
I’ll back up a bit, since it is sort of an X/Y problem.

I have an index with four shards and 17 million documents. I want to dump all the docs in JSON, label each one with a classifier, then load them back in with the labels. This is a one-time (or rare) bootstrap of the classified data. This will unblock testing and relevance work while we get the classifier hooked into the indexing pipeline.

Because I’m dumping all the fields, we can’t rely on docValues.

It is OK if it takes a few hours.

Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is 16 threads searching id:0* through id:f*, fetching 1000 rows each time, using cursorMark and distributed search. Median response time is 10 s. CPU usage is about 1%.

It is all pretty grubby and it seems like there could be a better way.

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]> wrote:
>
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start with 0-f.
>
> Unless I misunderstand and you’re just firing off, say, 16 threads at the entire
> collection rather than individual shards which would work too. But for individual
> shards I think you need to look for all possible IDs...
>
> Erick
>
>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]> wrote:
>>
>>
>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]> wrote:
>>>
>>> Not sure if range queries work on a UUID field, ...
>>
>> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll try separate queries for 0-f.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Tim Casey
Walter,

When you do the query, what is the sort of the results?

tim

On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <[hidden email]>
wrote:

> I’ll back up a bit, since it is sort of an X/Y problem.
>
> I have an index with four shards and 17 million documents. I want to dump
> all the docs in JSON, label each one with a classifier, then load them back
> in with the labels. This is a one-time (or rare) bootstrap of the
> classified data. This will unblock testing and relevance work while we get
> the classifier hooked into the indexing pipeline.
>
> Because I’m dumping all the fields, we can’t rely on docValues.
>
> It is OK if it takes a few hours.
>
> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
> using cursorMark and distributed search. Median response time is 10 s. CPU
> usage is about 1%.
>
> It is all pretty grubby and it seems like there could be a better way.
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]>
> wrote:
> >
> > Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> >
> > Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
> firing
> > at individual replicas. Each replica should have multiple UUIDs that
> start with 0-f.
> >
> > Unless I misunderstand and you’re just firing off, say, 16 threads at
> the entire
> > collection rather than individual shards which would work too. But for
> individual
> > shards I think you need to look for all possible IDs...
> >
> > Erick
> >
> >> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]>
> wrote:
> >>
> >>
> >>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]>
> wrote:
> >>>
> >>> Not sure if range queries work on a UUID field, ...
> >>
> >> A search for id:0* took 260 ms, so it looks like they work just fine.
> I’ll try separate queries for 0-f.
> >>
> >> wunder
> >> Walter Underwood
> >> [hidden email]
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Walter Underwood
sort=“id asc”

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 9:50 PM, Tim Casey <[hidden email]> wrote:
>
> Walter,
>
> When you do the query, what is the sort of the results?
>
> tim
>
> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <[hidden email]>
> wrote:
>
>> I’ll back up a bit, since it is sort of an X/Y problem.
>>
>> I have an index with four shards and 17 million documents. I want to dump
>> all the docs in JSON, label each one with a classifier, then load them back
>> in with the labels. This is a one-time (or rare) bootstrap of the
>> classified data. This will unblock testing and relevance work while we get
>> the classifier hooked into the indexing pipeline.
>>
>> Because I’m dumping all the fields, we can’t rely on docValues.
>>
>> It is OK if it takes a few hours.
>>
>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>> using cursorMark and distributed search. Median response time is 10 s. CPU
>> usage is about 1%.
>>
>> It is all pretty grubby and it seems like there could be a better way.
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]>
>> wrote:
>>>
>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>
>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>> firing
>>> at individual replicas. Each replica should have multiple UUIDs that
>> start with 0-f.
>>>
>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>> the entire
>>> collection rather than individual shards which would work too. But for
>> individual
>>> shards I think you need to look for all possible IDs...
>>>
>>> Erick
>>>
>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]>
>> wrote:
>>>>
>>>>
>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]>
>> wrote:
>>>>>
>>>>> Not sure if range queries work on a UUID field, ...
>>>>
>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>> I’ll try separate queries for 0-f.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> [hidden email]
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>
>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Erick Erickson
Wow, that’s pretty horrible performance.

Yeah, I was conflating a couple of things here. Now it’s clear.

If you specify rows=1, what do you get in response time? I’m wondering if
your time is spent just assembling the response rather than searching. You’d
have to have massive docs for that to be the case, kind of a shot in the dark.
The assembly step requires the docs be read off disk, decompressed and then
transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
being I/O bound either disk wise or network, but I’m pretty sure you’ve already
thought about that.

You are transmitting things around your servers given your statement that you
are seeing the searches distributed, which is also a waste, but again I wouldn’t
expect it to be that bad.

Hmmm, quick thing to check: What are the QTime’s reported? Those are
exclusive of assembling the return packet. If they were a few milliseconds and
your response back at the client was 10s, that’d be a clue.

Best,
Erick

> On Feb 11, 2020, at 2:13 AM, Walter Underwood <[hidden email]> wrote:
>
> sort=“id asc”
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>> On Feb 10, 2020, at 9:50 PM, Tim Casey <[hidden email]> wrote:
>>
>> Walter,
>>
>> When you do the query, what is the sort of the results?
>>
>> tim
>>
>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <[hidden email]>
>> wrote:
>>
>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>>
>>> I have an index with four shards and 17 million documents. I want to dump
>>> all the docs in JSON, label each one with a classifier, then load them back
>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>> classified data. This will unblock testing and relevance work while we get
>>> the classifier hooked into the indexing pipeline.
>>>
>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>>
>>> It is OK if it takes a few hours.
>>>
>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>> usage is about 1%.
>>>
>>> It is all pretty grubby and it seems like there could be a better way.
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]>
>>> wrote:
>>>>
>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>>
>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>> firing
>>>> at individual replicas. Each replica should have multiple UUIDs that
>>> start with 0-f.
>>>>
>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>> the entire
>>>> collection rather than individual shards which would work too. But for
>>> individual
>>>> shards I think you need to look for all possible IDs...
>>>>
>>>> Erick
>>>>
>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]>
>>> wrote:
>>>>>
>>>>>
>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]>
>>> wrote:
>>>>>>
>>>>>> Not sure if range queries work on a UUID field, ...
>>>>>
>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>> I’ll try separate queries for 0-f.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email]
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>
>>>
>>>
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Walter Underwood
Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d blame the slowness on the VPN connection, but the median response time of 10,000 msec is measured at the server.

The client is in Python, using wt=json. Average document size in JSON is 5132 bytes. The system should not be IO bound, but I’ll check. The instances have 31 GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM monitoring on this cluster, so I can’t see if the GC is thrashing.

2020-02-11T08:51:33 INFO QTime=401
2020-02-11T08:51:34 INFO QTime=612
2020-02-11T08:51:34 INFO QTime=492
2020-02-11T08:51:35 INFO QTime=513
2020-02-11T08:51:36 INFO QTime=458
2020-02-11T08:51:36 INFO QTime=419
2020-02-11T08:51:46 INFO QTime=477
2020-02-11T08:51:47 INFO QTime=479
2020-02-11T08:51:47 INFO QTime=457
2020-02-11T08:51:50 INFO QTime=553
2020-02-11T08:51:50 INFO QTime=658
2020-02-11T08:51:52 INFO QTime=379

wunder
Walter Underwood
[hidden email]
http://observer.wunderwood.org/  (my blog)

> On Feb 11, 2020, at 6:28 AM, Erick Erickson <[hidden email]> wrote:
>
> Wow, that’s pretty horrible performance.
>
> Yeah, I was conflating a couple of things here. Now it’s clear.
>
> If you specify rows=1, what do you get in response time? I’m wondering if
> your time is spent just assembling the response rather than searching. You’d
> have to have massive docs for that to be the case, kind of a shot in the dark.
> The assembly step requires the docs be read off disk, decompressed and then
> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
> being I/O bound either disk wise or network, but I’m pretty sure you’ve already
> thought about that.
>
> You are transmitting things around your servers given your statement that you
> are seeing the searches distributed, which is also a waste, but again I wouldn’t
> expect it to be that bad.
>
> Hmmm, quick thing to check: What are the QTime’s reported? Those are
> exclusive of assembling the return packet. If they were a few milliseconds and
> your response back at the client was 10s, that’d be a clue.
>
> Best,
> Erick
>
>> On Feb 11, 2020, at 2:13 AM, Walter Underwood <[hidden email]> wrote:
>>
>> sort=“id asc”
>>
>> wunder
>> Walter Underwood
>> [hidden email]
>> http://observer.wunderwood.org/  (my blog)
>>
>>> On Feb 10, 2020, at 9:50 PM, Tim Casey <[hidden email]> wrote:
>>>
>>> Walter,
>>>
>>> When you do the query, what is the sort of the results?
>>>
>>> tim
>>>
>>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <[hidden email]>
>>> wrote:
>>>
>>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>>>
>>>> I have an index with four shards and 17 million documents. I want to dump
>>>> all the docs in JSON, label each one with a classifier, then load them back
>>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>>> classified data. This will unblock testing and relevance work while we get
>>>> the classifier hooked into the indexing pipeline.
>>>>
>>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>>>
>>>> It is OK if it takes a few hours.
>>>>
>>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>>> usage is about 1%.
>>>>
>>>> It is all pretty grubby and it seems like there could be a better way.
>>>>
>>>> wunder
>>>> Walter Underwood
>>>> [hidden email]
>>>> http://observer.wunderwood.org/  (my blog)
>>>>
>>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]>
>>>> wrote:
>>>>>
>>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>>>
>>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>>> firing
>>>>> at individual replicas. Each replica should have multiple UUIDs that
>>>> start with 0-f.
>>>>>
>>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>>> the entire
>>>>> collection rather than individual shards which would work too. But for
>>>> individual
>>>>> shards I think you need to look for all possible IDs...
>>>>>
>>>>> Erick
>>>>>
>>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]>
>>>> wrote:
>>>>>>
>>>>>>
>>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]>
>>>> wrote:
>>>>>>>
>>>>>>> Not sure if range queries work on a UUID field, ...
>>>>>>
>>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>>> I’ll try separate queries for 0-f.
>>>>>>
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> [hidden email]
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>
>>>>>
>>>>
>>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: cursorMark and shards? (6.6.2)

Erick Erickson
Curiouser and curiouser. So two possibilities are just the time it takes to assemble the packet and the time it takes to send it back. Three more experiments then.

1> change the returned doc to return a single docValues=true field. My claim: The response will be very close to the 400-600 ms range back at the client. This is kind of a sanity check, it should collect the field to return without visiting the stored data/decompressing it etc.

2> change the returned doc to return a single docValues=false stored=true field. That’ll exercise the whole fetch-from-disk-and-decompress cycle because all the stored values for a doc needs to be decompressed if you access even one. If that comes back in, say, < 1 second then the speed issues are either GC thrashing or your network would be my guess. If it’s in the same 10s range, then I’d be looking at GC and the like.

3> change the returned rows to, say, 100 while returning all rows. If you see a pretty linear relationship between the number of docs and the response time then at least we know where to dig.

But the 1% CPU utilization makes me suspect transmission. If it were read-from-disk-and-decompress I’d expect more CPU due to the decompression phase unless you have a very slow disk.

So apparently CursorMark was a red herring?

it is _vaguely_ possible that bumping the documentCache higher might help, but frankly I doubt it. That would help if you had a situation where you were having to re-access the stored data from disk for several different components in a single query, but I don’t think that pertains. But then I’m surprised you’re seeing this at all so what I think, as an author I read had a character say, “ain’t worth a goldfish fart” in the face of evidence.

Best,
Erick

> On Feb 11, 2020, at 12:33 PM, Walter Underwood <[hidden email]> wrote:
>
> Good questions. Here is the QTime for rows=1000. Looks pretty reasonable. I’d blame the slowness on the VPN connection, but the median response time of 10,000 msec is measured at the server.
>
> The client is in Python, using wt=json. Average document size in JSON is 5132 bytes. The system should not be IO bound, but I’ll check. The instances have 31 GB of memory, shards are 40 GB on SSD. I don’t think I set up JVM monitoring on this cluster, so I can’t see if the GC is thrashing.
>
> 2020-02-11T08:51:33 INFO QTime=401
> 2020-02-11T08:51:34 INFO QTime=612
> 2020-02-11T08:51:34 INFO QTime=492
> 2020-02-11T08:51:35 INFO QTime=513
> 2020-02-11T08:51:36 INFO QTime=458
> 2020-02-11T08:51:36 INFO QTime=419
> 2020-02-11T08:51:46 INFO QTime=477
> 2020-02-11T08:51:47 INFO QTime=479
> 2020-02-11T08:51:47 INFO QTime=457
> 2020-02-11T08:51:50 INFO QTime=553
> 2020-02-11T08:51:50 INFO QTime=658
> 2020-02-11T08:51:52 INFO QTime=379
>
> wunder
> Walter Underwood
> [hidden email]
> http://observer.wunderwood.org/  (my blog)
>
>> On Feb 11, 2020, at 6:28 AM, Erick Erickson <[hidden email]> wrote:
>>
>> Wow, that’s pretty horrible performance.
>>
>> Yeah, I was conflating a couple of things here. Now it’s clear.
>>
>> If you specify rows=1, what do you get in response time? I’m wondering if
>> your time is spent just assembling the response rather than searching. You’d
>> have to have massive docs for that to be the case, kind of a shot in the dark.
>> The assembly step requires the docs be read off disk, decompressed and then
>> transmitted, but 10 seconds is ridiculous for that. I’m starting to wonder about
>> being I/O bound either disk wise or network, but I’m pretty sure you’ve already
>> thought about that.
>>
>> You are transmitting things around your servers given your statement that you
>> are seeing the searches distributed, which is also a waste, but again I wouldn’t
>> expect it to be that bad.
>>
>> Hmmm, quick thing to check: What are the QTime’s reported? Those are
>> exclusive of assembling the return packet. If they were a few milliseconds and
>> your response back at the client was 10s, that’d be a clue.
>>
>> Best,
>> Erick
>>
>>> On Feb 11, 2020, at 2:13 AM, Walter Underwood <[hidden email]> wrote:
>>>
>>> sort=“id asc”
>>>
>>> wunder
>>> Walter Underwood
>>> [hidden email]
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>> On Feb 10, 2020, at 9:50 PM, Tim Casey <[hidden email]> wrote:
>>>>
>>>> Walter,
>>>>
>>>> When you do the query, what is the sort of the results?
>>>>
>>>> tim
>>>>
>>>> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <[hidden email]>
>>>> wrote:
>>>>
>>>>> I’ll back up a bit, since it is sort of an X/Y problem.
>>>>>
>>>>> I have an index with four shards and 17 million documents. I want to dump
>>>>> all the docs in JSON, label each one with a classifier, then load them back
>>>>> in with the labels. This is a one-time (or rare) bootstrap of the
>>>>> classified data. This will unblock testing and relevance work while we get
>>>>> the classifier hooked into the indexing pipeline.
>>>>>
>>>>> Because I’m dumping all the fields, we can’t rely on docValues.
>>>>>
>>>>> It is OK if it takes a few hours.
>>>>>
>>>>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>>>>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>>>>> using cursorMark and distributed search. Median response time is 10 s. CPU
>>>>> usage is about 1%.
>>>>>
>>>>> It is all pretty grubby and it seems like there could be a better way.
>>>>>
>>>>> wunder
>>>>> Walter Underwood
>>>>> [hidden email]
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>
>>>>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>>>>>
>>>>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>>>>> firing
>>>>>> at individual replicas. Each replica should have multiple UUIDs that
>>>>> start with 0-f.
>>>>>>
>>>>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>>>>> the entire
>>>>>> collection rather than individual shards which would work too. But for
>>>>> individual
>>>>>> shards I think you need to look for all possible IDs...
>>>>>>
>>>>>> Erick
>>>>>>
>>>>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <[hidden email]>
>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <[hidden email]>
>>>>> wrote:
>>>>>>>>
>>>>>>>> Not sure if range queries work on a UUID field, ...
>>>>>>>
>>>>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>>>>> I’ll try separate queries for 0-f.
>>>>>>>
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> [hidden email]
>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>
>