[jira] Created: (LUCENE-2514) Change Term to use bytes

classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885062#action_12885062 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

+1

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12885065#action_12885065 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

I don't have a real computer for a few days, so take it if you want!

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler reassigned LUCENE-2514:
-------------------------------------

    Assignee: Uwe Schindler

I take it and will commit it tomorrow.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2514:
----------------------------------

    Attachment: LUCENE-2514.patch

Committed this patch revision: 960484

I keep this open, as more improvements may be added (e.g. TermRangeQuery)

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514_qp.patch

With Term as byte, and tokenstreams can encode terms to byte however they want with TermToBytesRefAttribute, it makes sense for queryparsers to consume bytes like the indexer, and build terms without an intermediate String.

This way non-unicode terms (e.g. collation) work as expected.

This patch updates the queryparsers, except for contrib/queryparser (which will be more serious and cause API changes), and the range query building AnalyzingQueryParser (we need to fix TermRangeQuery first).

All tests pass.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890509#action_12890509 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

I'd like to commit this queryparser patch tomorrow if no one objects. Then I think we should look at range query, etc.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890630#action_12890630 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

+1 to commit

This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890646#action_12890646 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

bq. This would also mean the BOCU-1 encoding could be used drop-in w/ QueryParser for basic (Term, Phrase) queries right?

Yes, they should then work (or there is a bug!)

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12890740#action_12890740 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

Committed LUCENE-2514_qp.patch revision 966254

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514_collatedrange.patch

In order to move forward with collation-keys-as-byte and other improvements, we need to fix TermRangeQuery.
But this is difficult when the String-only Collation support exists mixed with the byte-order TermRangeQuery...

As discussed previously on this issue, here is a patch that splits this into a separate CollatedTermRangeQuery/Filter


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895579#action_12895579 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

we could make things even better by removing queryparser's get/setRangeCollator method.
instead in its place, it could have something like a boolean 'analyzeRangeQueries' ?
it could then analyze the endpoints (producing byte collation keys) and use a regular fast term range query.

I think its good to support collation order for people who want it, but we should make it easy to do things the fast way,
right now we make it easy to do things the slow way and hard to do it fast.


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895624#action_12895624 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

bq. by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

+1

I agree we have it backwards now.  The "obvious" approach should be the performant one.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895630#action_12895630 ]

Uwe Schindler commented on LUCENE-2514:
---------------------------------------

bq. by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

+1

By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895674#action_12895674 ]

Steven Rowe commented on LUCENE-2514:
-------------------------------------

bq. by the way, i was thinking it would be nice to really move this slow collatedtermrangequery stuff either out of lucene alltogether or at least into contrib/queries.

+1

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895700#action_12895700 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

bq. By the way, the problem BytesRef vs. String is not yet solved for core TRQ. I would prefer to do it like for NRQ/FCRF with static factory methods. Then its also consistent accross all RQ parts.

Yes I know! I was leaving this for you, but if you have no time, I can take care of it.
When that too is done, finally I think then I can commit LUCENE-2551 !

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514_collatedrange.patch

just checkpointing progress, here's my latest patch.

Here i moved the slow functionality (range,sort) out of core and into contrib/queries.
So TermRangeQuery just does byte comparison, nothing fancy.
Additionally TermRangeQuery's API is changed to be more like NumericRangeQuery's, with newStringRange and newByteRange.

TODO:
* QP's newRangeQuery args should be changed to BytesRef, and newRangeQuery should build ranges with newByteRange
* contrib/qp needs a new attribute and some other work, and some of the other queryparsers need more changes too.
* need to add tests (the ones i removed from core) for SlowCollatedRangeQuery and friends.


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900661#action_12900661 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

Patch looks good Robert!

I like the TermRangeQuery/Filter.newStringRange static factory.  You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900663#action_12900663 ]

Uwe Schindler commented on LUCENE-2514:
---------------------------------------

Yah, the factory makes it easy for new users to create string ranges (as one exspects, like with NRQ, NRF, FCRF), but as the query itsself works on BytesRef, its ctor takes BytesRef. With the static factory, compile errors when you use null as bounds are not happening.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900663#action_12900663 ]

Uwe Schindler edited comment on LUCENE-2514 at 8/20/10 7:00 AM:
----------------------------------------------------------------

Yah, the factory makes it easy for new users to create string ranges (as one exspects, like with NRQ, NRF, FCRF), but as the query itsself works on BytesRef, its ctor takes BytesRef. With the static factory, compile errors when you use null as bounds are not happening.

Do we need a LessString() or should we completely remove the useless static factories named Less in the filter?

      was (Author: thetaphi):
    Yah, the factory makes it easy for new users to create string ranges (as one exspects, like with NRQ, NRF, FCRF), but as the query itsself works on BytesRef, its ctor takes BytesRef. With the static factory, compile errors when you use null as bounds are not happening.
 

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Hudson (Jira)
In reply to this post by Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900673#action_12900673 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

bq. I like the TermRangeQuery/Filter.newStringRange static factory. You need to add the Slow prefix to the class names in MIGRATE.txt, and also mention the slow collated comparator has also moved.

Yes, i forgot this from my TODO list!

bq. or should we completely remove the useless static factories named Less in the filter?

+1. I didnt see these used anywhere. If we aren't going to remove them, then we should at least deprecate IMO

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-MTQPagedBytes.patch, LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_collatedrange.patch, LUCENE-2514_qp.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234