[jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
Adaptive Frame Of Reference
----------------------------

                 Key: LUCENE-2886
                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Codecs
            Reporter: Renaud Delbru
             Fix For: 4.0


We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].

[1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Renaud Delbru updated LUCENE-2886:
----------------------------------

    Attachment: lucene-afor.tar.gz

tarball containing a maven project with source code and unit tests for:
- AFOR1
- AFOR2
- FOR
- PFOR Non Compulsive
- Simple64
- a basic tool for debugging IntBlock codecs.

It includes also the lucene-1458 snapshot dependencies that are necessary to compile the code and run the tests.

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Paul Elschot
In reply to this post by Tim Allison (Jira)
Any idea on how this compares to the vector split encoding here:
http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html
?

Regards,
Paul Elschot

On Monday 24 January 2011 19:32:44 Renaud Delbru (JIRA) wrote:

> Adaptive Frame Of Reference
> ----------------------------
>
>                  Key: LUCENE-2886
>                  URL: https://issues.apache.org/jira/browse/LUCENE-2886
>              Project: Lucene - Java
>           Issue Type: New Feature
>           Components: Codecs
>             Reporter: Renaud Delbru
>              Fix For: 4.0
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
>
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Renaud Delbru
Hi Paul,

This is a good question. The two methods, i.e., VSE and AFOR, are very
similar. The two methods can be considered as an extension of FOR to
make it less sensitive to outliers by adapting the encoding to the value
distribution. To achieve this, the two methods are encoding a list of
values by
- partitioning it into "frames" (or sequence of consecutive integers) of
variable lengths,
- encoding each frame using a different "bit frame" (the minimum number
of bits required to encode any integer in the frame, and still be able
to distinguish them)
- relying on algorithms to automatically fi nd a good list partitioning.

Apart from the minor differences in the implementation design (that I
will discuss later), the main difference is that VSE is optimised for
achieving a high compression rate and a fast decompression but
disregards the efficiency of compression, while AFOR is optimised for
achieving a high compression rate, a fast decompression but also a fast
compression speed. VSE is using a Dynamic Programming method to find the
*optimal partitioning* of a list (optimal in term of compression rate).
While this approach provides a higher compression rate than the one
proposed in AFOR, the complexity of such a partitioning algorithm is O(n
* k), with the term n being the number of values and the term k the size
of the larger frame, which might greatly impact the compression
performance. In AFOR, we use instead a local optimisation algorithm that
is less effective in term of compression rate but faster to compute.

In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current
implementation of AFOR restrict the length of a frame to be a multiple
of 8 to to be aligned with the start and end of a byte boundary (and
also to minimise the number of loop-unrolled highly-optimised routines).
More precisely, AFOR-2 use three frame lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original
implementation of VSE needs to operate on the full list. On the
contrary, AFOR has been developed to operate on small subsets of the
list, so that AFOR can be applied during incremental construction of the
compressed list (it does not require the full list, but works on small
block of 32 or more integers). However, we can think of applying VSE on
small subset, as in AFOR. In this case, VSE does not compute the optimal
partition of a list, but only the optimal partition of the subset of the
list.

VSE and AFOR encodes a frame in a similar way: first, a header (1 byte)
which provides the bit frame and the frames length, then the encoded frame.

So, as you can see, in essence, the two models are very similar. For the
background, I know well Fabrizio Silvestri (co-author of VSE), and he
was my PhD thesis examiner (the AFOR compression scheme is a chapter of
my thesis). The funny thing is that we come up with these two models at
the same time, this summer, without knowing we were working on something
similar ;o). However, he was more lucky than I am to publish his
findings before me.

I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
--
Renaud Delbru

On 24/01/11 22:02, Paul Elschot wrote:

> Any idea on how this compares to the vector split encoding here:
> http://puma.isti.cnr.it/publichtml/section_cnr_isti/cnr_isti_2010-TR-016.html
> ?
>
> Regards,
> Paul Elschot
>
> On Monday 24 January 2011 19:32:44 Renaud Delbru (JIRA) wrote:
>> Adaptive Frame Of Reference
>> ----------------------------
>>
>>                   Key: LUCENE-2886
>>                   URL: https://issues.apache.org/jira/browse/LUCENE-2886
>>               Project: Lucene - Java
>>            Issue Type: New Feature
>>            Components: Codecs
>>              Reporter: Renaud Delbru
>>               Fix For: 4.0
>>
>>
>> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
>> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
>> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
>>
>> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (LUCENE-2886) Adaptive Frame Of Reference

Renaud Delbru
-- sorry, resending it as I don't know what happens to the layout of the
previous one

Hi Paul,

This is a good question. The two methods, i.e., VSE and AFOR, are very
similar. The two methods can be considered as an extension of FOR to
make it less sensitive to outliers by adapting the encoding to the value
distribution. To achieve this, the two methods are encoding a list of
values by
- partitioning it into "frames" (or sequence of consecutive integers) of
variable lengths,
- encoding each frame using a different "bit frame" (the minimum number
of bits required to encode any integer in the frame, and still be able
to distinguish them)
- relying on algorithms to automatically find a good list partitioning.

Apart from the minor differences in the implementation design (that I
will discuss later), the main difference is that VSE is optimised for
achieving a high compression rate and a fast decompression but
disregards the efficiency of compression, while AFOR is optimised for
achieving a high compression rate, a fast decompression but also a fast
compression speed. VSE is using a Dynamic Programming method to find the
*optimal partitioning* of a list (optimal in term of compression rate).
While this approach provides a higher compression rate than the one
proposed in AFOR, the complexity of such a partitioning algorithm is O(n
* k), with the term n being the number of values and the term k the size
of the larger frame, which might greatly impact the compression
performance. In AFOR, we use instead a local optimisation algorithm that
is less effective in term of compression rate but faster to compute.

In term of implementation details, there is a few differences.
1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The current
implementation of AFOR restrict the length of a frame to be a multiple
of 8 to to be aligned with the start and end of a byte boundary (and
also to minimise the number of loop-unrolled highly-optimised routines).
More precisely, AFOR-2 use three frame lengths: 8, 16 and 32.
2) To allow the *optimal partitioning* of a list, the original
implementation of VSE needs to operate on the full list. On the
contrary, AFOR has been developed to operate on small subsets of the
list, so that AFOR can be applied during incremental construction of the
compressed list (it does not require the full list, but works on small
block of 32 or more integers). However, we can think of applying VSE on
small subset, as in AFOR. In this case, VSE does not compute the optimal
partition of a list, but only the optimal partition of the subset of the
list.

VSE and AFOR encodes a frame in a similar way: first, a header (1 byte)
which provides the bit frame and the frames length, then the encoded frame.

So, as you can see, in essence, the two models are very similar. For the
background, I know well Fabrizio Silvestri (co-author of VSE), and he
was my PhD thesis examiner (the AFOR compression scheme is a chapter of
my thesis). The funny thing is that we come up with these two models at
the same time, this summer, without knowing we were working on something
similar ;o). However, he was more lucky than I am to publish his
findings before me.

I hope this answers to your question.
Feel free to ask if you have any other questions,
Regards,
--
Renaud Delbru

On 25/01/11 12:24, Renaud Delbru wrote:

> Hi Paul,
>
> This is a good question. The two methods, i.e., VSE and AFOR, are very
> similar. The two methods can be considered as an extension of FOR to
> make it less sensitive to outliers by adapting the encoding to the
> value distribution. To achieve this, the two methods are encoding a
> list of values by
> - partitioning it into "frames" (or sequence of consecutive integers)
> of variable lengths,
> - encoding each frame using a different "bit frame" (the minimum
> number of bits required to encode any integer in the frame, and still
> be able to distinguish them)
> - relying on algorithms to automatically find a good list partitioning.
>
> Apart from the minor differences in the implementation design (that I
> will discuss later), the main difference is that VSE is optimised for
> achieving a high compression rate and a fast decompression but
> disregards the efficiency of compression, while AFOR is optimised for
> achieving a high compression rate, a fast decompression but also a
> fast compression speed. VSE is using a Dynamic Programming method to
> find the *optimal partitioning* of a list (optimal in term of
> compression rate). While this approach provides a higher compression
> rate than the one proposed in AFOR, the complexity of such a
> partitioning algorithm is O(n * k), with the term n being the number
> of values and the term k the size of the larger frame, which might
> greatly impact the compression performance. In AFOR, we use instead a
> local optimisation algorithm that is less effective in term of
> compression rate but faster to compute.
>
> In term of implementation details, there is a few differences.
> 1) VSE allows frames of length 1, 2, 4, 6, 8, 12, 16 and 32. The
> current implementation of AFOR restrict the length of a frame to be a
> multiple of 8 to to be aligned with the start and end of a byte
> boundary (and also to minimise the number of loop-unrolled
> highly-optimised routines). More precisely, AFOR-2 use three frame
> lengths: 8, 16 and 32.
> 2) To allow the *optimal partitioning* of a list, the original
> implementation of VSE needs to operate on the full list. On the
> contrary, AFOR has been developed to operate on small subsets of the
> list, so that AFOR can be applied during incremental construction of
> the compressed list (it does not require the full list, but works on
> small block of 32 or more integers). However, we can think of applying
> VSE on small subset, as in AFOR. In this case, VSE does not compute
> the optimal partition of a list, but only the optimal partition of the
> subset of the list.
>
> VSE and AFOR encodes a frame in a similar way: first, a header (1
> byte) which provides the bit frame and the frames length, then the
> encoded frame.
>
> So, as you can see, in essence, the two models are very similar. For
> the background, I know well Fabrizio Silvestri (co-author of VSE), and
> he was my PhD thesis examiner (the AFOR compression scheme is a
> chapter of my thesis). The funny thing is that we come up with these
> two models at the same time, this summer, without knowing we were
> working on something similar ;o). However, he was more lucky than I am
> to publish his findings before me.
>
> I hope this answers to your question.
> Feel free to ask if you have any other questions,
> Regards,
> --
> Renaud Delbru


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2886:
--------------------------------

    Attachment: LUCENE-2886_simple64.patch

I pulled out the simple64 implementation here and adapted it to the bulkpostings branch.

Thanks for uploading this code Renaud, its great and the code is easy to work with. I hope to get some more of the codecs you wrote into the branch for testing.

I changed a few things that helped in benchmarking:
* the decoder uses relative gets instead of absolute
* we write #longs in the block header instead of #bytes (as its always long aligned, but smaller numbers)

But mainly, for this one I think we should change it to be a VariableIntBlock codec... right now it packs 128 integers into as few longs as possible, but this is wasteful for two reasons: it has to write a per-block byte header, and also wastes bits (e.g. in the case of a block of 128 1's).

With variableintblock, we could do this differently, e.g. read a fixed number of longs per-block (say 4 longs), and our block would then be variable between 4 and 240 integers depending upon data.


> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2886:
---------------------------------------

    Attachment: LUCENE-2886_simple64_varint.patch

New patch, including Robert's patch, and also adding Simple64 as a VarInt codec.  We badly need more testing of the VarInt cases, so it's great Simple64 came along (thanks Renaud!).

All tests pass w/ Simple64VarInt codec.

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990143#comment-12990143 ]

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

OK I committed the two new Simple64 codecs (to bulk branch).

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990439#comment-12990439 ]

Robert Muir commented on LUCENE-2886:
-------------------------------------

We are still not using 2 of the simple64 selectors, but in the simple64 paper it discusses
using these wasted bits to represent 120 and 240 "all ones"... I think we should do this?


> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509 ]

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add.

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990509#comment-12990509 ]

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 10:42 AM:
----------------------------------------------------------------

Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. But this can be tedious task with only a limited benefit.

      was (Author: renaud.delbru):
    Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as
- inter-leaved sequences of 1 bit and 2 bits integers
- inter-leaved sequences of 2 bits and 3 bits integers
or something like this.
The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add.
 

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990534#comment-12990534 ]

Robert Muir commented on LUCENE-2886:
-------------------------------------

bq. Hi Michael, Robert,
great to hear that the code is useful, looking forward to see some benchmark.
I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much.

In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
for the gov collection's doc file?

But still, for the all 1's case I'm not actually thinking about unstructured text so much...
in this case I am thinking about metadata fields and more structured data?


> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990535#comment-12990535 ]

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
for the gov collection's doc file?
{quote}
our results were performed on the wikipedia dataset and blogs dataset. I don;t know what was our selection rate, I was just referring to the gain in overall compression rate.

{quote}
But still, for the all 1's case I'm not actually thinking about unstructured text so much...
in this case I am thinking about metadata fields and more structured data?
{quote}

Yes, this makes sense. In the context of SIREn (kind of simple xml node based inverted index) which is meant for indexing semi-structured data, the difference was more observable (mainly on the frequency and position files, as well as other structure node files).
This might be also useful on the document id file for very common terms (maybe for certain type of facets, with a very few number of values covering a large portion of the document collection).

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538 ]

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset:

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990538#comment-12990538 ]

Renaud Delbru edited comment on LUCENE-2886 at 2/4/11 12:05 PM:
----------------------------------------------------------------

Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset:

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.216|0.142|0.143|0.143|0.143|0.929|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.

      was (Author: renaud.delbru):
    Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

DBpedia dataset: (structured version of wikipedia)

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.246|0.043|0.141|0.065|0.180|0.816|
|AFOR-2|0.229|0.039|0.132|0.059|0.167|0.758|
|AFOR-3|0.229|0.031|0.131|0.054|0.159|0.736|
|FOR|0.315|0.061|0.170|0.117|0.216|1.049|
|PFOR|0.317|0.044|0.155|0.070|0.205|0.946|
|Rice|0.240|0.029|0.115|0.057|0.152|0.708|
|S-64|0.249|0.041|0.133|0.062|0.171|0.791|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Geonames Dataset:

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|0.129|0.023|0.058|0.025|0.025|0.318|
|AFOR-2|0.123|0.023|0.057|0.024|0.024|0.307|
|AFOR-3|0.114|0.006|0.056|0.016|0.008|0.256|
|FOR|0.150|0.021|0.065|0.025|0.023|0.349|
|PFOR|0.154|0.019|0.057|0.022|0.023|0.332|
|Rice|0.133|0.019|0.063|0.029|0.021|0.327|
|S-64|0.147|0.021|0.058|0.023|0.023|0.329|
|VByte|0.264|0.162|0.222|0.222|0.245|1.335|

Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

||Method||Ent||Frq||Att||Val||Pos||Total||
|AFOR-1|2.578|0.395|0.942|0.665|1.014|6.537|
|AFOR-2|2.361|0.380|0.908|0.619|0.906|6.082|
|AFOR-3|2.297|0.176|0.876|0.530|0.722|5.475|
|FOR|3.506|0.506|1.121|0.916|1.440|8.611|
|PFOR|3.221|0.374|1.153|0.795|1.227|7.924|
|Rice|2.721|0.314|0.958|0.714|0.941|6.605|
|S-64|2.581|0.370|0.917|0.621|0.908|6.313|
|VByte|3.287|2.106|2.411|2.430|2.488|15.132|

Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.
 

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990543#comment-12990543 ]

Robert Muir commented on LUCENE-2886:
-------------------------------------

Thanks for those numbers Renaud... yes the cases you see in e.g. Geonames
was one example of what I was thinking: in general you might say someone
should be using "omitTFAP" to omit freqs and positions for these fields,
but they might not be able to do this, if they want to support e.g. phrase
queries like "washington hill". So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.

Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is :) Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings.

I'm not saying we could rely upon this, but i do think in general lots
of people's docs aren't in completely random order, and its probably
common to see long streams of docdelta=1 in structured data for this reason?




> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990548#comment-12990548 ]

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

{quote}
So if we can pack long streams of 1s with
freqs and positions I think this is probably useful for a lot of people.
{quote}
Yes, if the overhead is minimal, it might not be an issue in certain cases.

{quote}
Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
your "Ent" basically a measure of doc deltas? I'm confused exactly
what it is
{quote}

Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as the document id). It is just that I have changed the name of the concept, as SIREn is manipulating principally entity and not document. In my case, an entity is just a set of attribute-value pairs, similarly to a document in Lucene.

{quote}
Because I would think if you take e.g. Geonames, the place
names in the dataset are not in random order but actually "batched" by
country for example, so you would have long streams of docdelta=1 for
country=Germany's postings.
{quote}
I checked, and Geonames dataset was alphabetically sorted by url names:
http://sws.geonames.org/1/
http://sws.geonames.org/10/
...
as well as dbpedia and sindice.

So, yes, this might have (good) consequences on the docdelta list for certain datasets such as geonames. And especially when indexing semi-structured data, as the schema of the data in one dataset is generally identical across entities/documents. therefore it is likely to see long runs of 1 for certain terms or schema terms.

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990555#comment-12990555 ]

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

I test perf of BulkVInt vs Simple64VarInt (Linux, 10M wikipedia index, NIOFSDir, multi-seg, java -server -Xbatch -Xmx2g -Xms2g):

||Query||QPS bulkvint||QPS simple64varint||Pct diff||||
|+united +states|19.46|16.47|{color:red}-15.4%{color}|
|"united states"|11.92|10.09|{color:red}-15.3%{color}|
|unit~0.5|17.35|16.96|{color:red}-2.2%{color}|
|"united states"~3|5.70|5.72|{color:green}0.3%{color}|
|united~0.6|8.24|8.31|{color:green}0.8%{color}|
|doctitle:.*[Uu]nited.*|5.36|5.48|{color:green}2.1%{color}|
|united~0.75|12.45|12.75|{color:green}2.4%{color}|
|unit~0.7|27.37|28.08|{color:green}2.6%{color}|
|spanFirst(unit, 5)|156.48|162.78|{color:green}4.0%{color}|
|united states|15.35|16.07|{color:green}4.7%{color}|
|spanNear([unit, state], 10, true)|42.84|46.11|{color:green}7.6%{color}|
|states|48.13|52.03|{color:green}8.1%{color}|
|unit*|32.80|37.07|{color:green}13.0%{color}|
|doctimesecnum:[10000 TO 60000]|12.38|14.16|{color:green}14.4%{color}|
|uni*|18.66|21.53|{color:green}15.3%{color}|
|u*d|9.66|11.17|{color:green}15.7%{color}|
|un*d|19.00|22.04|{color:green}16.0%{color}|
|+nebraska +states|101.66|141.83|{color:green}39.5%{color}|


> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990560#comment-12990560 ]

Michael McCandless commented on LUCENE-2886:
--------------------------------------------

Same as last perf test, except MMapDir instead:

||Query||QPS bulkvint||QPS simple64varint||Pct diff||||
|"united states"|12.81|9.95|{color:red}-22.3%{color}|
|"united states"~3|6.04|5.35|{color:red}-11.4%{color}|
|+united +states|19.18|17.25|{color:red}-10.1%{color}|
|united~0.6|10.35|10.25|{color:red}-0.9%{color}|
|united states|16.33|16.29|{color:red}-0.2%{color}|
|doctimesecnum:[10000 TO 60000]|14.18|14.24|{color:green}0.4%{color}|
|unit*|36.69|37.20|{color:green}1.4%{color}|
|united~0.75|14.71|14.92|{color:green}1.4%{color}|
|unit~0.5|22.15|22.55|{color:green}1.8%{color}|
|uni*|21.27|21.73|{color:green}2.2%{color}|
|spanFirst(unit, 5)|166.18|171.96|{color:green}3.5%{color}|
|doctitle:.*[Uu]nited.*|6.24|6.46|{color:green}3.6%{color}|
|unit~0.7|34.18|35.49|{color:green}3.8%{color}|
|spanNear([unit, state], 10, true)|47.16|49.53|{color:green}5.0%{color}|
|states|50.71|53.52|{color:green}5.5%{color}|
|un*d|21.51|22.96|{color:green}6.7%{color}|
|u*d|11.18|12.22|{color:green}9.3%{color}|
|+nebraska +states|127.16|161.43|{color:green}26.9%{color}|


> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2886) Adaptive Frame Of Reference

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990568#comment-12990568 ]

Renaud Delbru commented on LUCENE-2886:
---------------------------------------

Hi Michael,
the first results are not that impressive.
* Could you tell me what is BulkVInt ? Is it the simple VInt codec implemented on top of the Bulk branch ?
* What the difference between '+united +states' and '+nebraska +states' ? Is nebraska a low frequency term ?

> Adaptive Frame Of Reference
> ----------------------------
>
>                 Key: LUCENE-2886
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2886
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Codecs
>            Reporter: Renaud Delbru
>             Fix For: 4.0
>
>         Attachments: LUCENE-2886_simple64.patch, LUCENE-2886_simple64_varint.patch, lucene-afor.tar.gz
>
>
> We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
> I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
> I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].
> [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12