Iterating BinaryDocValues

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Iterating BinaryDocValues

Joel Bernstein

Hi,

I'm looking for a faster way to perform large scale docId -> bytesRef lookups for BinaryDocValues.

I'm finding that I can't get the performance that I need from the random access seek in the BinaryDocValues interface.

I'm wondering if sequentially scanning the docValues would be a faster approach. I have a BitSet of matching docs, so if I sequentially moved through the docValues I could test each one against that bitset.

Wondering if that approach would be faster for bulk extracts and how tricky it would be to add an iterator to the BinaryDocValues interface?

Thanks,
Joel 
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
Joel,

I tried to hack it straightforwardly, but found no free gain there. The only attempt I can suggest is to try to reuse bytes in https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401 right now it allocates bytes every time, which beside of GC can also impact memory access locality. Could you try fix memory waste and repeat performance test?

Have a good hack!


On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]> wrote:

Hi,

I'm looking for a faster way to perform large scale docId -> bytesRef lookups for BinaryDocValues.

I'm finding that I can't get the performance that I need from the random access seek in the BinaryDocValues interface.

I'm wondering if sequentially scanning the docValues would be a faster approach. I have a BitSet of matching docs, so if I sequentially moved through the docValues I could test each one against that bitset.

Wondering if that approach would be faster for bulk extracts and how tricky it would be to add an iterator to the BinaryDocValues interface?

Thanks,
Joel 



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Michael McCandless-2
Going sequentially should help, if the pages are not hot (in the OS's IO cache).

You can also use a different DVFormat, e.g. Direct, but this holds all
bytes in RAM.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
<[hidden email]> wrote:

> Joel,
>
> I tried to hack it straightforwardly, but found no free gain there. The only
> attempt I can suggest is to try to reuse bytes in
> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
> right now it allocates bytes every time, which beside of GC can also impact
> memory access locality. Could you try fix memory waste and repeat
> performance test?
>
> Have a good hack!
>
>
> On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]> wrote:
>>
>>
>> Hi,
>>
>> I'm looking for a faster way to perform large scale docId -> bytesRef
>> lookups for BinaryDocValues.
>>
>> I'm finding that I can't get the performance that I need from the random
>> access seek in the BinaryDocValues interface.
>>
>> I'm wondering if sequentially scanning the docValues would be a faster
>> approach. I have a BitSet of matching docs, so if I sequentially moved
>> through the docValues I could test each one against that bitset.
>>
>> Wondering if that approach would be faster for bulk extracts and how
>> tricky it would be to add an iterator to the BinaryDocValues interface?
>>
>> Thanks,
>> Joel
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
FWIW,

Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap directory.
I wonder why it doesn't reads into incoming bytes https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401



On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless <[hidden email]> wrote:
Going sequentially should help, if the pages are not hot (in the OS's IO cache).

You can also use a different DVFormat, e.g. Direct, but this holds all
bytes in RAM.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
<[hidden email]> wrote:
> Joel,
>
> I tried to hack it straightforwardly, but found no free gain there. The only
> attempt I can suggest is to try to reuse bytes in
> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
> right now it allocates bytes every time, which beside of GC can also impact
> memory access locality. Could you try fix memory waste and repeat
> performance test?
>
> Have a good hack!
>
>
> On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]> wrote:
>>
>>
>> Hi,
>>
>> I'm looking for a faster way to perform large scale docId -> bytesRef
>> lookups for BinaryDocValues.
>>
>> I'm finding that I can't get the performance that I need from the random
>> access seek in the BinaryDocValues interface.
>>
>> I'm wondering if sequentially scanning the docValues would be a faster
>> approach. I have a BitSet of matching docs, so if I sequentially moved
>> through the docValues I could test each one against that bitset.
>>
>> Wondering if that approach would be faster for bulk extracts and how
>> tricky it would be to add an iterator to the BinaryDocValues interface?
>>
>> Thanks,
>> Joel
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
Don't you think it's worth to raise a jira regarding those 'new bytes[]' ? I'm able to provide a patch if you wish.


On Wed, Jan 8, 2014 at 2:02 PM, Mikhail Khludnev <[hidden email]> wrote:
FWIW,

Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap directory.
I wonder why it doesn't reads into incoming bytes https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401



On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless <[hidden email]> wrote:
Going sequentially should help, if the pages are not hot (in the OS's IO cache).

You can also use a different DVFormat, e.g. Direct, but this holds all
bytes in RAM.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
<[hidden email]> wrote:
> Joel,
>
> I tried to hack it straightforwardly, but found no free gain there. The only
> attempt I can suggest is to try to reuse bytes in
> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
> right now it allocates bytes every time, which beside of GC can also impact
> memory access locality. Could you try fix memory waste and repeat
> performance test?
>
> Have a good hack!
>
>
> On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]> wrote:
>>
>>
>> Hi,
>>
>> I'm looking for a faster way to perform large scale docId -> bytesRef
>> lookups for BinaryDocValues.
>>
>> I'm finding that I can't get the performance that I need from the random
>> access seek in the BinaryDocValues interface.
>>
>> I'm wondering if sequentially scanning the docValues would be a faster
>> approach. I have a BitSet of matching docs, so if I sequentially moved
>> through the docValues I could test each one against that bitset.
>>
>> Wondering if that approach would be faster for bulk extracts and how
>> tricky it would be to add an iterator to the BinaryDocValues interface?
>>
>> Thanks,
>> Joel
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]



--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Adrien Grand
Hi Mikhail,

These new byte[] are done on purpose because the contract on the
BytesRef class is that it is only a reference to bytes stored
somewhere, not a buffer. So it is always legal to change offset,
length and the bytes reference, but the content of the byte[] array
may only be changed if you own the BytesRef object. So for example, it
is legal for a BytesRefIterator to use a private BytesRef as a mutable
buffer and to expose it in calls to next() but it is illegal to modify
the content of the byte[] in BinaryDocValues.get since the bytes are
provided by consumers of the API.

If we would like to be able to remove these byte[] creations, I think
we would either need:
 1. to use a byte buffer instead of a BytesRef object,
 2. or to make the BytesRef objects private to the BinaryDocValues
instances and have a "BytesRef get(int docId)" method instead of "void
get(int docId, BytesRef ref)"

Although option 1 would avoid these byte[] creations, it would make
the paged-bytes-based implementations a bit slower (eg.
MemoryDocValuesFormat exposes binary doc values through paged bytes)
because they would need to copy bytes instead of just changing
pointers. With option 2, the BinaryDocValues object wouldn't be
thread-safe by nature anymore, but I don't think this is an issue as
we already store them in thread-locals.


On Thu, Jan 9, 2014 at 5:27 PM, Mikhail Khludnev
<[hidden email]> wrote:

> Don't you think it's worth to raise a jira regarding those 'new bytes[]' ?
> I'm able to provide a patch if you wish.
>
>
> On Wed, Jan 8, 2014 at 2:02 PM, Mikhail Khludnev
> <[hidden email]> wrote:
>>
>> FWIW,
>>
>> Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short
>> binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap
>> directory.
>> I wonder why it doesn't reads into incoming bytes
>> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>
>>
>>
>> On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless
>> <[hidden email]> wrote:
>>>
>>> Going sequentially should help, if the pages are not hot (in the OS's IO
>>> cache).
>>>
>>> You can also use a different DVFormat, e.g. Direct, but this holds all
>>> bytes in RAM.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
>>> <[hidden email]> wrote:
>>> > Joel,
>>> >
>>> > I tried to hack it straightforwardly, but found no free gain there. The
>>> > only
>>> > attempt I can suggest is to try to reuse bytes in
>>> >
>>> > https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>> > right now it allocates bytes every time, which beside of GC can also
>>> > impact
>>> > memory access locality. Could you try fix memory waste and repeat
>>> > performance test?
>>> >
>>> > Have a good hack!
>>> >
>>> >
>>> > On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]>
>>> > wrote:
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm looking for a faster way to perform large scale docId -> bytesRef
>>> >> lookups for BinaryDocValues.
>>> >>
>>> >> I'm finding that I can't get the performance that I need from the
>>> >> random
>>> >> access seek in the BinaryDocValues interface.
>>> >>
>>> >> I'm wondering if sequentially scanning the docValues would be a faster
>>> >> approach. I have a BitSet of matching docs, so if I sequentially moved
>>> >> through the docValues I could test each one against that bitset.
>>> >>
>>> >> Wondering if that approach would be faster for bulk extracts and how
>>> >> tricky it would be to add an iterator to the BinaryDocValues
>>> >> interface?
>>> >>
>>> >> Thanks,
>>> >> Joel
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Sincerely yours
>>> > Mikhail Khludnev
>>> > Principal Engineer,
>>> > Grid Dynamics
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
Adrian,

Thanks for your answer! You highlighted the point which I completely missed! Again, I see that noob questions are quite useful.

On Fri, Jan 10, 2014 at 10:09 AM, Adrien Grand <[hidden email]> wrote:
Hi Mikhail,

These new byte[] are done on purpose because the contract on the
BytesRef class is that it is only a reference to bytes stored
somewhere, not a buffer. So it is always legal to change offset,
length and the bytes reference, but the content of the byte[] array
may only be changed if you own the BytesRef object.
This aspect of usage isn't obvious to me at all. At least it's worth to clarify this rule in javadoc!
 
So for example, it
is legal for a BytesRefIterator to use a private BytesRef as a mutable
buffer and to expose it in calls to next() but it is illegal to modify
the content of the byte[] in BinaryDocValues.get since the bytes are
provided by consumers of the API.

If we would like to be able to remove these byte[] creations, I think
we would either need:
 1. to use a byte buffer instead of a BytesRef object,
 2. or to make the BytesRef objects private to the BinaryDocValues
instances and have a "BytesRef get(int docId)" method instead of "void
get(int docId, BytesRef ref)"

Although option 1 would avoid these byte[] creations, it would make
the paged-bytes-based implementations a bit slower (eg.
MemoryDocValuesFormat exposes binary doc values through paged bytes)
because they would need to copy bytes instead of just changing
pointers. With option 2, the BinaryDocValues object wouldn't be
thread-safe by nature anymore, but I don't think this is an issue as
we already store them in thread-locals.

Agree 1 is not fancy, but I feel uncomfortable relying on ThreadLocals. I started from what Joel suggested - from column reading API, but lately I rejected it as unnecessary. It might make sense to look on it again.

public abstract class BinaryDocValues {
  .....
  /**
   * the purpose is an efficient sequential memory access
   * @throws IOException if anything sad happens
   * */
  public BytesRefIterator getColumn(final DocIdSetIterator docIds) throws IOException{
    return new BytesRefIterator (){

      private BytesRef ref = new BytesRef();

      @Override
      public BytesRef next() throws IOException {
        int doc = docIds.nextDoc();
        if(doc==DocIdSetIterator.NO_MORE_DOCS){
          return null;
        } else {
          get(doc, ref);
        }
        return ref;
      }   
    };
  }
 }

It's a concrete impl which backed on current method. We don't need to amend all currently existing implementations. Certain impls can provide an efficient Iterator.

Anyway, whether you like getColumn() or not, if you raise an issue I can work on patch.



On Thu, Jan 9, 2014 at 5:27 PM, Mikhail Khludnev
<[hidden email]> wrote:
> Don't you think it's worth to raise a jira regarding those 'new bytes[]' ?
> I'm able to provide a patch if you wish.
>
>
> On Wed, Jan 8, 2014 at 2:02 PM, Mikhail Khludnev
> <[hidden email]> wrote:
>>
>> FWIW,
>>
>> Micro benchmark shows 4% gain on reusing incoming ByteRef.bytes in short
>> binary docvalues Test2BBinaryDocValues.testVariableBinary() with mmap
>> directory.
>> I wonder why it doesn't reads into incoming bytes
>> https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>
>>
>>
>> On Wed, Jan 8, 2014 at 12:53 AM, Michael McCandless
>> <[hidden email]> wrote:
>>>
>>> Going sequentially should help, if the pages are not hot (in the OS's IO
>>> cache).
>>>
>>> You can also use a different DVFormat, e.g. Direct, but this holds all
>>> bytes in RAM.
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Tue, Jan 7, 2014 at 1:09 PM, Mikhail Khludnev
>>> <[hidden email]> wrote:
>>> > Joel,
>>> >
>>> > I tried to hack it straightforwardly, but found no free gain there. The
>>> > only
>>> > attempt I can suggest is to try to reuse bytes in
>>> >
>>> > https://github.com/apache/lucene-solr/blame/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesProducer.java#L401
>>> > right now it allocates bytes every time, which beside of GC can also
>>> > impact
>>> > memory access locality. Could you try fix memory waste and repeat
>>> > performance test?
>>> >
>>> > Have a good hack!
>>> >
>>> >
>>> > On Mon, Dec 23, 2013 at 9:51 PM, Joel Bernstein <[hidden email]>
>>> > wrote:
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm looking for a faster way to perform large scale docId -> bytesRef
>>> >> lookups for BinaryDocValues.
>>> >>
>>> >> I'm finding that I can't get the performance that I need from the
>>> >> random
>>> >> access seek in the BinaryDocValues interface.
>>> >>
>>> >> I'm wondering if sequentially scanning the docValues would be a faster
>>> >> approach. I have a BitSet of matching docs, so if I sequentially moved
>>> >> through the docValues I could test each one against that bitset.
>>> >>
>>> >> Wondering if that approach would be faster for bulk extracts and how
>>> >> tricky it would be to add an iterator to the BinaryDocValues
>>> >> interface?
>>> >>
>>> >> Thanks,
>>> >> Joel
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Sincerely yours
>>> > Mikhail Khludnev
>>> > Principal Engineer,
>>> > Grid Dynamics
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
>



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Adrien Grand
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
> I don't think we should add such a method. Doc values are commonly
> read from collectors, so why do we need a method that works on top of
> a DocIdSetIterator?
on collect(docnum) collectors can access docvalues via currently present get(), however it's not the most efficient way to access them. Let me refer to Shai's talk at Dublin. (if you need to calculate DV facets for few field - my assumption), the best way to store bitset, and loop by DV fields and for every field scan a column looping by that docset, rather than loop by docs. Solr does it by the same manner, old UnInvertFields for sure, and recent DVFacets too, I guess. So, such column scan method makes sense from performance consideration. 

> I'm also curious how specialized implementations
> could make this method faster than the default implementation?
it can own bytes[] and write into (without ThreadLocal). Let me provide you a scratch later.

btw,
I repeated by micro benchmark with removed random() call from the loop.

..
reuse: true took:117944 ms
...
reuse: false took:131507 ms

it seems like gain from reusing bytes[] is 10% (I understand that it's incorrect impl, let's think how to improve it). Let me attach samples screenshot. It's mmap directory and 4.5 codec.

Joel, it answers to your consideration: seek is not hot at all.

I'm attaching test and impl diff if you want to reproduce the measurement.



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

docvalues base reuse false.png (83K) Download Attachment
doc values reuse true.png (100K) Download Attachment
doc values reuse byteref benchmark.patch (11K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Mikhail Khludnev
In reply to this post by Adrien Grand
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

doc values bulkget.patch (13K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Robert Muir
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Shai Erera
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Robert Muir
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.


On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Robert Muir


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Joel Bernstein
In reply to this post by Robert Muir
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Robert Muir
what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[hidden email]> wrote:
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Joel Bernstein
Bulk extracting full unsorted result sets from Solr. You give Solr a query and it dumps the full result in a single call. The result set streaming is in place, but throughput is not as good as I would like it.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[hidden email]> wrote:
what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[hidden email]> wrote:
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Joel Bernstein
In reply to this post by Robert Muir
For the test I ran, I just timed the number of docId->bytesRef lookups I could do in a second.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[hidden email]> wrote:
Are you sure its not the wire serialization etc causing the bottleneck (e.g. converting to utf-8 string and back, network traffic, json encoding, etc etc?)


On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[hidden email]> wrote:
Bulk extracting full unsorted result sets from Solr. You give Solr a query and it dumps the full result in a single call. The result set streaming is in place, but throughput is not as good as I would like it.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[hidden email]> wrote:
what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[hidden email]> wrote:
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]








Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Robert Muir
what are you storing in the binaryDV?


On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[hidden email]> wrote:
For the test I ran, I just timed the number of docId->bytesRef lookups I could do in a second.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[hidden email]> wrote:
Are you sure its not the wire serialization etc causing the bottleneck (e.g. converting to utf-8 string and back, network traffic, json encoding, etc etc?)


On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[hidden email]> wrote:
Bulk extracting full unsorted result sets from Solr. You give Solr a query and it dumps the full result in a single call. The result set streaming is in place, but throughput is not as good as I would like it.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[hidden email]> wrote:
what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[hidden email]> wrote:
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]









Reply | Threaded
Open this post in threaded view
|

Re: Iterating BinaryDocValues

Joel Bernstein
The text value for the field. It's a generic bulk extract tool so whatever the user loads. 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 4:08 PM, Robert Muir <[hidden email]> wrote:
what are you storing in the binaryDV?


On Fri, Jan 10, 2014 at 3:44 PM, Joel Bernstein <[hidden email]> wrote:
For the test I ran, I just timed the number of docId->bytesRef lookups I could do in a second.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:41 PM, Robert Muir <[hidden email]> wrote:
Are you sure its not the wire serialization etc causing the bottleneck (e.g. converting to utf-8 string and back, network traffic, json encoding, etc etc?)


On Fri, Jan 10, 2014 at 3:39 PM, Joel Bernstein <[hidden email]> wrote:
Bulk extracting full unsorted result sets from Solr. You give Solr a query and it dumps the full result in a single call. The result set streaming is in place, but throughput is not as good as I would like it.

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 3:24 PM, Robert Muir <[hidden email]> wrote:
what are you doing with the data?


On Fri, Jan 10, 2014 at 3:23 PM, Joel Bernstein <[hidden email]> wrote:
I'll provide a little more context. I'm working on bulk extracting BinaryDocValues. My initial performance test was with in-memory binaryDocValues, but I think the end game is actually disk-based binaryDocValues.

I was able to perform around 1 million docId->BytesRef lookups per-second with in-memory BinaryDocValues. Since I need to get the values for multiple fields for each document, this bogs down pretty quickly.

I'm wondering if there is a way to increase this throughput. Since filling a BytesRef is pretty fast, I was assuming it was the seek that was taking the time, but I didn't verify this. The first thing that came to mind is iterating the docValues in such a way that the next docValue could be loaded without a seek. But I haven't dug into how the BinaryDocValues are formatted so I'm not sure if this would help or not. Also there could be something else besides the seek that is limiting the throughput.






 

Joel Bernstein
Search Engineer at Heliosearch


On Fri, Jan 10, 2014 at 2:54 PM, Robert Muir <[hidden email]> wrote:
Yeah, i dont think its from newer docvalues-using code like yours shai.

instead the problems i had doing this are historical, because e.g. fieldcache pointed to large arrays and consumers were lazy about it, knowing that there reference pointed to bytes that would remain valid across invocations.

we just have to remove these assumptions. I don't apologize for not doing this, as you show, its some small % improvement (which we should go and get back!), but i went with safety first initially rather than bugs.



On Fri, Jan 10, 2014 at 2:50 PM, Shai Erera <[hidden email]> wrote:
I agree with Robert. We should leave cloning BytesRefs to whoever needs that, and not penalize everyone else who don't need it. I must say I didn't know I can "own" those BytesRefs and I clone them whenever I need to. I think I was bitten by one of the other APIs, so I assumed returned BytesRefs are not "mine" across all the APIs.

Shai


On Fri, Jan 10, 2014 at 9:46 PM, Robert Muir <[hidden email]> wrote:
the problem is really simpler to solve actually.

Look at the comments in the code, it tells you why it is this way:

          // NOTE: we could have one buffer, but various consumers (e.g. FieldComparatorSource)
          // assume "they" own the bytes after calling this!

That is what we should fix. There is no need to make bulk APIs or even change the public api in any way (other than javadocs).

We just move the clone'ing out of the codec, and require the consumer to do it, same as termsenum or other apis. The codec part is extremely simple here, its even the way i had it initially.

But at the time (and even still now) this comes with some risk of bugs. So initially I removed the reuse and went with a more conservative approach to start with.


On Fri, Jan 10, 2014 at 2:36 PM, Mikhail Khludnev <[hidden email]> wrote:
Adrian,

Please find bulkGet() scratch. It's ugly copy-paste, just reuses ByteRef that provides 10% gain.
...
bulkGet took:101630 ms
...
get took:114422 ms



On Fri, Jan 10, 2014 at 8:58 PM, Adrien Grand <[hidden email]> wrote:
I don't think we should add such a method. Doc values are commonly
read from collectors, so why do we need a method that works on top of
a DocIdSetIterator? I'm also curious how specialized implementations
could make this method faster than the default implementation?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




--
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

[hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]