[jira] Created: (LUCENE-2514) Change Term to use bytes

classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
Change Term to use bytes
------------------------

                 Key: LUCENE-2514
                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
             Project: Lucene - Java
          Issue Type: Task
          Components: Search
    Affects Versions: 4.0
            Reporter: Robert Muir


in LUCENE-2426, the sort order was changed to codepoint order.

unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
So MultiTermQuery, etc (especially its priority queues) are currently wrong.

By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
strange string encodings.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514.patch

attached is one option: use bytesref behind the scenes but also support String ctors like we do today.

i tried the 'hard cutover' mike suggested, but this is a massive change and I think typically users will just be using String.

personally I don't see the harm in supporting Strings this way, any perf-sensitive stuff creating a lot of Term objects (e.g. MultiTermQuery) should be using BytesRef anyway.

one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882239#action_12882239 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

bq. one test fails: the preflex TestSurrogates... does this test create terms with unpaired surrogates? If so this would explain the failures I think.

Oh man not that one!!

It is not supposed to create unpaired surrogates (it uses _TestUtil.makeRandomUnicodeString).  I'll dig...

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882241#action_12882241 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

its my fault. preflex codec still uses Term.compareTo (and this is expected) in several places.

so, i need to add back UTF8SortedAsUTF16Comparator to bytesref, and add a utf-16 compare to Term that uses it for the old behavior (or accomplish the equiv logic somewhere else)

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514.patch

i tried to fix the preflex problem like this, but it didnt work... maybe something (sort or other) is still relying on the old utf-16 compareTo() that i didnt find????

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882252#action_12882252 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

Ahh sorry I do indeed seek to an unpaired high surrogate -- easy to fix (pair it up w/ minimum low surrogate).

But, there's another problem, which is that the pre-flex codec uses Term.compareTo, and it needs that to be based on UTF16.  We can just fix the preflex codec to use its own (UTF8inUTF16order) comparator.  I *think* once we fix those two the test should pass again (crossing fingers...).

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-2514:
---------------------------------------

    Attachment: LUCENE-2514-surrogates-dance.patch

Attached patch to fix (I think) the unpaired surrogate passed to Term... not yet tested.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514.patch

ok i combined my patch with yours, and fixed a typo (thats what i get for copy/paste)...

i think it might be ok

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882272#action_12882272 ]

Yonik Seeley commented on LUCENE-2514:
--------------------------------------

Looks good!
Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882276#action_12882276 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

{quote}
Might want to just add a doc on the Term constructor that the provided BytesRef is not copied, and should not be modified after construction (i.e. if you create a Term w/ the BytesRef provided by a TermsEnum or something, you're in trouble).
{quote}

Yeah, I agree. I'd also like to improve some of the perf sensitive spots to never create strings at all before committing (e.g. MultiTermQuery rewrites).


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2514:
--------------------------------

    Attachment: LUCENE-2514.patch

here's an updated patch with yonik's suggested note (might need wording changes).

Also i started converting some various code to use the bytes(), queries and such...


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882300#action_12882300 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

Mike, can you review some of the preflex changes to bytes() in the patch?
I wonder if in some of these places we even need 'scratchBytesRef' at all now...

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882312#action_12882312 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

preflex changes look good.

I think with this we can eliminate the scratchBytesRef entirely in PreFlexFields and instead just use termEnum.term().bytes()!


> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882315#action_12882315 ]

Uwe Schindler commented on LUCENE-2514:
---------------------------------------

Robert: I can take MTQ tomorrow. I think we can remove the whole backwards stuff from MTQ and change completely to BytesRef (internally). This makes the steps TermsEnum (bytes) -> TermCollector -> TermQuery which converts all the time simplier. The collector abstract class in the MTQ rewrites will be much nicer.
I can also remove the rest of pre-BoostAttribute stuff from TopTermsRewrite.

I will go to sleep now, tomorrow more...

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882318#action_12882318 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

Uwe thanks, I would prefer if you did MTQ too.

I agree it should completely use BytesRef... i think it should also not create new Terms even until rewriting.

For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882321#action_12882321 ]

Uwe Schindler commented on LUCENE-2514:
---------------------------------------

bq. For example, currently the priority queue in TopTerms does BytesRef -> String conversion and creates a new Term for each add, but this might be entirely useless as it could fall off the pq, so i think its ScoreTerm or whatever should not hold term at all but just bytesref

Exactly! We removed support for TermEnum (without s), so field name is never null. You can always take the field from the MTQ when building TermQueries. And for that we create the Term using new Term(field, BytesRef) or with the non-interning placeholder (see also below). This makes MTQ much simplier, I started to do it...

By the way: we could remove all String interning for field names now? We don't compare fields anymore?

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882323#action_12882323 ]

Yonik Seeley commented on LUCENE-2514:
--------------------------------------

bq. By the way: we could remove all String interning for field names now? We don't compare fields anymore?

Yeah, I noticed that too... I think so.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882325#action_12882325 ]

Michael McCandless commented on LUCENE-2514:
--------------------------------------------

We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower.  We could just add corresponding BytesRef methods?

bq. By the way: we could remove all String interning for field names now? We don't compare fields anymore?

We should be careful with this!

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882326#action_12882326 ]

Robert Muir commented on LUCENE-2514:
-------------------------------------

bq. We also need to fix FieldCache/TermRangeQuery, since they now take separate String upper/lower. We could just add corresponding BytesRef methods?

I'll fix these. I didnt notice them since they don't use Term, but String directly. The patch is also generally incomplete in other ways... mainly I am searching on uses of Term.text() and such to ensure I do not introduce performance regressions.

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-2514) Change Term to use bytes

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-2514:
----------------------------------

    Attachment: LUCENE-2514.patch

Here robert's patch with MTQ changed.

It currently still uses placeholderTerms to not need to intern every time. If we remove string interning from Term, we can replace this by simple new Term() in MTQ.

I delayed cloning of BytesRef until the BytesRef is put into a TermQuery or PQ or whenever it is set aside. But it no longer clones it e.g. if the term is never accepted by the PQ. Also the PQ reuses its ScoreTerm instances and so, the term bytes are simply copied over :-)

I also removed a Java 1.6 interface override - the Generics Policeman gives a ticket! I don't understand where those come from, Java 1.6 should also fail to compile as the ant build uses -source 1.5...?

> Change Term to use bytes
> ------------------------
>
>                 Key: LUCENE-2514
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2514
>             Project: Lucene - Java
>          Issue Type: Task
>          Components: Search
>    Affects Versions: 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2514-surrogates-dance.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch, LUCENE-2514.patch
>
>
> in LUCENE-2426, the sort order was changed to codepoint order.
> unfortunately, Term is still using string internally, and more importantly its compareTo() uses the wrong order [utf-16].
> So MultiTermQuery, etc (especially its priority queues) are currently wrong.
> By changing Term to use bytes, we can also support terms encoded as bytes such as numerics, instead of using
> strange string encodings.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

1234