[jira] Created: (LUCENE-2426) change sort order to binary order

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
change sort order to binary order
---------------------------------

                 Key: LUCENE-2426
                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 4.0.0
            Reporter: Robert Muir
             Fix For: 4.0.0


Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].

I think its time to look at sorting terms as byte[]... this would yield the following improvements:
* terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
* numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
* automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863137#action_12863137 ]

Robert Muir commented on LUCENE-2426:
-------------------------------------

by the way: as mentioned above, as far as numerics and collation goes,
both of these today avoid any of the parts of unicode that are sensitive to such a sort order change.

So these already "backwards compatible" in the sense that numeric fields or
collated fields will sort the same way in either UTF-8/UTF-32 byte[] order or UTF-16 char[] order.


> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0.0
>            Reporter: Robert Muir
>             Fix For: 4.0.0
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863146#action_12863146 ]

Yonik Seeley commented on LUCENE-2426:
--------------------------------------

big +1
the more we get to pure bytes, the better IMO.

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0.0
>            Reporter: Robert Muir
>             Fix For: 4.0.0
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863154#action_12863154 ]

Robert Muir commented on LUCENE-2426:
-------------------------------------

I think most apps will be unaffected by this change (if the prefix-flex index convertor can sort the terms in binary, too).

But we need to lookout for some traps:
* Things that use String.compareTo are dangerous, as it uses code unit order (e.g. i see a binary search w/ this in FieldCache)
* In general assuming a term can be a String at all is problematic with using byte[] terms, if numeric wants to use full byte, etc.
So we should think about changing Term, too.

the best way to avoid problems is to stick with byte[] as much as possible and try to avoid using String for terms...

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0.0
>            Reporter: Robert Muir
>             Fix For: 4.0.0
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863297#action_12863297 ]

Michael McCandless commented on LUCENE-2426:
--------------------------------------------

Big +1 too :)

For FieldCache, we need to do LUCENE-2380 (creates a BytesRef field cache) and switch Lucene to use it -- I'll add a dependency.

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 4.0.0
>            Reporter: Robert Muir
>             Fix For: 4.0.0
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2426:
---------------------------------------

    Attachment: LUCENE-2426.patch

Checkpointing my current state here... it should compile but tests are probably failing from the mods in preflex codec.

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2426:
--------------------------------

    Attachment: LUCENE-2426_automaton.patch

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch, LUCENE-2426_automaton.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2426:
---------------------------------------

    Attachment: LUCENE-2426.patch

Attached patch, changing term sort order to unicode codepoint!  All tests pass.  I fixed preflex codec to seek around surrogates, and then back again, so that preflex indices also sort properly; it's rather hairy... I added a new randomized test that writes a preflex segment (just the terms dict) with random terms and then asserts the order.

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch, LUCENE-2426.patch, LUCENE-2426_automaton.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880974#action_12880974 ]

Robert Muir commented on LUCENE-2426:
-------------------------------------

How to deal with Term?

I really don't like that Term.compareTo uses String.compareTo, for example MultiTermQuery uses this in TopTermsBooleanQueryRewrite for comparing terms in its priority queue.

I don't think it should block this patch either, but we should at least open a second issue to figure out what to do about this.
Term needs to either go away, or use BytesRef w/ the codec's comparator in cases like this, or some things like FuzzyQuery will be technically wrong (i should add a test for this too, I think)


> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch, LUCENE-2426.patch, LUCENE-2426_automaton.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881220#action_12881220 ]

Michael McCandless commented on LUCENE-2426:
--------------------------------------------

bq. How to deal with Term?

Maybe we should keep it, but do a hard cutover of its .text from String to BytesRef, and also change its .compareTo to compare text by unicode code point order?

I agree we should do this as a followon issue; in fact I think another issue is already open.

Note, though, that field names still sort by UTF16 (String.compareTo) order.

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch, LUCENE-2426.patch, LUCENE-2426_automaton.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-2426) change sort order to binary order

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-2426.
----------------------------------------

    Resolution: Fixed

> change sort order to binary order
> ---------------------------------
>
>                 Key: LUCENE-2426
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2426
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 3.1
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2426.patch, LUCENE-2426.patch, LUCENE-2426_automaton.patch
>
>
> Since flexible indexing, terms are now represented as byte[], but for backwards compatibility reasons, they are not sorted as byte[], but instead as if they were char[].
> I think its time to look at sorting terms as byte[]... this would yield the following improvements:
> * terms are more opaque by default, they are byte[] and sort as byte[]. I think this would make lucene friendlier to customizations.
> * numerics and collation are then free to use their own encoding (full byte) rather than avoiding the use of certain bits to remain compatible with char[] sort order.
> * automaton gets simpler because as in LUCENE-2265, it uses byte[] too, and has special hacks because terms are sorted as char[]

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]