[jira] Created: (LUCENE-510) IndexOutput.writeString() should write length in bytes

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Yonik Seeley-2
On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
<[hidden email]> wrote:

> Yonik Seeley <[hidden email]> wrote:
>
>  >  Hmmm, can't we always do it by unicode code point?
>  >  When do we need UTF-16 order?
>
>  In theory, we can.  I think the sort order doesn't matter much, as
>  long as everyone (writers & readers) agree what it is.  I think
>  unicode code point order is more "standards compliant" too.
>
>  A big benefit is then we could leave things (eg TermBuffer and maybe
>  eventually Term, FieldCache) as UTF8 bytes and save on the conversion
>  cost when reading.
>
>  But I don't think Java provides a way to do this comparison?  However
>  it's not hard to implement your own:
>
>   http://www.icu-project.org/docs/papers/utf16_code_point_order.html

Not sure I follow... you just do a byte-by-byte comparison right?  For
ASCII, this should be slightly faster (same number of comparisons,
less memory space and hence less cache space overall).

>  But then I worried about how much slower that code is than
>  String.compareTo, and, I found alot of places where innocent compareTo
>  or < or > needed to be changed to this method call.  Field name
>  comparisons would have to be fixed too.  Then for backwards
>  compatibility all of these places that do comparisons would have to
>  fallback to the Java way when interacting with an older segment.

Oh... older segments.  Yeah, I was speaking "theoretically".

>  I think we can still explore this?  It just seemed way too big to
>  glomm into the already-big changes in LUCENE-510.

Yeah, I was thinking of some of this more along the lines of Lucene 3.

A term could contain a byte array instead of a String.  A String
constructor would convert to UTF8 and then do lookups in the index
(simple byte comparisons, no charset encoding).  A byte constructor
for Term would also be allowed.  Things like TermEnumerators would
keep everything in bytes, the tii would be in bytes, etc.

One could also think about ways to directly index bytes too.
Is it all worth it?  I really don't know.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless-2

Yonik Seeley wrote:

> On Wed, Mar 26, 2008 at 6:06 PM, Michael McCandless
> <[hidden email]> wrote:
>> Yonik Seeley <[hidden email]> wrote:
>>
>>>  Hmmm, can't we always do it by unicode code point?
>>>  When do we need UTF-16 order?
>>
>>  In theory, we can.  I think the sort order doesn't matter much, as
>>  long as everyone (writers & readers) agree what it is.  I think
>>  unicode code point order is more "standards compliant" too.
>>
>>  A big benefit is then we could leave things (eg TermBuffer and maybe
>>  eventually Term, FieldCache) as UTF8 bytes and save on the  
>> conversion
>>  cost when reading.
>>
>>  But I don't think Java provides a way to do this comparison?  
>> However
>>  it's not hard to implement your own:
>>
>>   http://www.icu-project.org/docs/papers/utf16_code_point_order.html
>
> Not sure I follow... you just do a byte-by-byte comparison right?  For
> ASCII, this should be slightly faster (same number of comparisons,
> less memory space and hence less cache space overall).

Sorry, you're right: if you're working with byte[] at the time, a  
byte by byte comparison of UTF8 gives you the same order as unicode  
code point.

It's when you need to compare a String or char[] to one another, or  
to a UTF8 byte[], that you need that code.

>>  But then I worried about how much slower that code is than
>>  String.compareTo, and, I found alot of places where innocent  
>> compareTo
>>  or < or > needed to be changed to this method call.  Field name
>>  comparisons would have to be fixed too.  Then for backwards
>>  compatibility all of these places that do comparisons would have to
>>  fallback to the Java way when interacting with an older segment.
>
> Oh... older segments.  Yeah, I was speaking "theoretically".

Yeah.

>>  I think we can still explore this?  It just seemed way too big to
>>  glomm into the already-big changes in LUCENE-510.
>
> Yeah, I was thinking of some of this more along the lines of Lucene 3.
> A term could contain a byte array instead of a String.  A String
> constructor would convert to UTF8 and then do lookups in the index
> (simple byte comparisons, no charset encoding).  A byte constructor
> for Term would also be allowed.  Things like TermEnumerators would
> keep everything in bytes, the tii would be in bytes, etc.

Yup.

> One could also think about ways to directly index bytes too.

Right, DocumentsWriter could hold its terms in byte[] and save time/
space when terms are ascii.

> Is it all worth it?  I really don't know.

Right, that's where I started to wonder.  It felt very much like I  
was "going against the grain of Java" as the changes started to pile  
up ...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595426#action_12595426 ]

Mark Miller commented on LUCENE-510:
------------------------------------

I suspect that this issue is not reading old indexes perfectly. I am very suspicious that this patch caused what I thought was index corruption the other day. I didn't notice it was a format change and should have paid more attention.

I seem to be able to replicate the issue so hopefully I will have more to report soon.

If my thoughts are unlikely, please let me know.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595526#action_12595526 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

Whoa, I think you are correct Mark!

On inspecting my changes here, I think the bulk-merging of stored fields is to blame.  Specifically, when we bulk merge the stored fields we fail to check whether the segments being merged are the pre-UTF8 format.  And so that code bulk-copies stored fields in the older format into a file that claims it's using the newer format.

This only affects trunk, not 2.3.

Thanks for being such a brave early-adopter trunk tester, Mark.  And, sorry :(

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reopened LUCENE-510:
---------------------------------------


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595559#action_12595559 ]

Mark Miller commented on LUCENE-510:
------------------------------------

No problem Michael. Brave/Foolish, I certainly am one. Can't keep myself away from these fresh issues :) My fault for not understanding the release process really though. I thought I was roughly getting 2.3.2 ahead of time. Live and learn :) I got a little lax inspecting all the issues for someone grabbing from trunk...once all my tests pass I usually feel pretty good about it. Time to add some upgrade index tests :)

Thanks so much for checking into this so quick - certainly saves me some time today.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595565#action_12595565 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

bq. Time to add some upgrade index tests
In fact Lucene already has TestBackwardsCompatibility ... so my first step is to add a test case there that shows this bug...

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595575#action_12595575 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

OK, I've fixed TestBackwardsCompatibility to catch this, and then fixed merging of stored fields to work properly.  Will commit shortly...

Thanks again Mark!

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-510.
---------------------------------------

    Resolution: Fixed

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123