[jira] Created: (LUCENE-510) IndexOutput.writeString() should write length in bytes

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
IndexOutput.writeString() should write length in bytes
------------------------------------------------------

         Key: LUCENE-510
         URL: http://issues.apache.org/jira/browse/LUCENE-510
     Project: Lucene - Java
        Type: Improvement
  Components: Store  
    Versions: 2.1    
    Reporter: Doug Cutting
     Fix For: 2.1


We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:

http://www.mail-archive.com/java-dev@.../msg01970.html

We must increment the file format number to indicate this change.  At least the format number in the segments file should change.

I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
    [ http://issues.apache.org/jira/browse/LUCENE-510?page=comments#action_12378519 ]

Marvin Humphrey commented on LUCENE-510:
----------------------------------------

The following patch...

  * Changes Lucene to use bytecounts as the prefix to all written Strings
  * Changes Lucene to write standard UTF-8 rather than Modified UTF-8
  * Adds the new test classes MockIndexOutput and TestIndexOutput
  * Increases the number of tests in TestIndexInput

It also slows Lucene down -- indexing takes around a 20% speed hit.  It would be possible to submit a patch which had a smaller impact on performance, but this one is already over 700 lines long, and it's goal is to achieve standard UTF-8 compliance and modify the definition of Lucene strings as simply and reliably as possible.  Optimization patches can now be submitted which build upon this one.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>          Key: LUCENE-510
>          URL: http://issues.apache.org/jira/browse/LUCENE-510
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Store
>     Versions: 2.1
>     Reporter: Doug Cutting
>      Fix For: 2.1

>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-510?page=all ]

Marvin Humphrey updated LUCENE-510:
-----------------------------------

    Attachment: strings.diff

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>          Key: LUCENE-510
>          URL: http://issues.apache.org/jira/browse/LUCENE-510
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Store
>     Versions: 2.1
>     Reporter: Doug Cutting
>      Fix For: 2.1
>  Attachments: strings.diff
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Tatu Saloranta
In reply to this post by Nick Burch (Jira)
--- "Marvin Humphrey (JIRA)" <[hidden email]> wrote:

...
> It also slows Lucene down -- indexing takes around a
> 20% speed hit.  It would be possible to submit a
> patch which had a smaller impact on performance, but
> this one is already over 700 lines long, and it's
> goal is to achieve standard UTF-8 compliance and
> modify the definition of Lucene strings as simply
> and reliably as possible.  Optimization patches can
> now be submitted which build upon this one.

I'm quite sure that the UTF-8 decoding loop can be
improved quite a bit after merging in the patch, so
eventual performance hit is probably lower (assuming
this is a hot spot). Using a tighter inner loop for
single-byte values can give a significant boost (up to
50% speedup compared to default UTF-8 decoder jdk 1.5
ships with).
In this case, it's probably best to isolate the hot
spot (when working on this part, measuring impact of
changes), since otherwise it may be hard to measure
direct impact. And then measure the total effect when
integrating the change.

That is to say, I wouldn't worry too much about the
initial hit, much/most of it can be optimized away
quite soon, just like you suggested.

-+ Tatu +-


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)
     [ http://issues.apache.org/jira/browse/LUCENE-510?page=all ]

Marvin Humphrey updated LUCENE-510:
-----------------------------------

    Attachment: SortExternal.java
                TestSortExternal.java

Greets,

I've ported KinoSearch's external sorting module to java, along with its tests.  This class is the linchpin for the KinoSearch merge model, as it allows serialized postings to be dumped into a sort pool of effectively unlimited size.

At some point, I'll submit patches implementing the KinoSearch merge model in Lucene.  I'm reasonably confident that it will more than make up for the index-time performance hit caused by using bytecounts as string headers.

Thematically, this class belongs in org.apache.lucene.util, and that's where I've put it for now.  The classes that will use it are in org.apache.lucene.index, so if it stays in util, it will have to be public.  However, it shouldn't be part of Lucene's documented public API.  The process by which Lucene's docs are generated is not clear to me, so access control advice would be appreciated.

There are a number of other areas where this code could stand review, especially considering my relatively limited experience using Java.  I'd single out exception handling and thread safety, but of course anything else is fair game.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>          Key: LUCENE-510
>          URL: http://issues.apache.org/jira/browse/LUCENE-510
>      Project: Lucene - Java
>         Type: Improvement

>   Components: Store
>     Versions: 2.1
>     Reporter: Doug Cutting
>      Fix For: 2.1
>  Attachments: SortExternal.java, TestSortExternal.java, strings.diff
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-510:
--------------------------------------

    Assignee: Grant Ingersoll

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462114 ]

Grant Ingersoll commented on LUCENE-510:
----------------------------------------

Hi Marvin,

This no longer applies cleanly to trunk, care to update?

Thanks,
Grant

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462122 ]

Chuck Williams commented on LUCENE-510:
---------------------------------------

Has an improvement been made to eliminate the reported 20% indexing hit?  That would be a big price to pay.

To me the performance benefits in algorithms that scan for selected fields (e.g., FieldsReader.doc() with a FieldSelector) are much more important than standard UTF-8 compliance.

A 20% hit seems suprising.  The pre-scan over the string to be written shouldn't cost much compared to the cost of tokenizing and indeixng that string (assuming it is in an indexed field).

In case it is relevant, I had a related issue in my bulk updater, a case where a vint required at the beginning of a record by the lucene index format was not known until after the end.  I solved this with a fixed length vint record that was estimated up front and revised if necessary after the whole record was processed.  The vint representation still works if more bytes than necessary are written.


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462138 ]

Yonik Seeley commented on LUCENE-510:
-------------------------------------

I'd like to see everything kept as bytes for as long as possible (right up into Term).
A nice bonus would be to reduce the size of things like the FieldCache, and to allow true binary data.


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462285 ]

Marvin Humphrey commented on LUCENE-510:
----------------------------------------

Grant... At the moment I am completely consumed by the task of getting a devel release of KinoSearch version 0.20 out the door.  Once that is taken care of, I will be glad to update this patch, and to explore how to compensate for the performance hit it causes.

Chuck... If bytecount-based strings are adopted, standard UTF-8 probably comes along for the ride.  There's actually a 1-2% performance gain to be had using standard over modified because of simplified conditionals.  What holds us back is backwards compatibility -- but we'll have wrecked backwards compat with the bytecounts.  However, I no longer have a strong objection to using Modified UTF-8 (for Lucene, that is -- Modified UTF-8 would be a deal-breaker for Lucy), so if somewhere along the way we find a compelling reason to stick with modified UTF-8, so be it.

If bytecount-based strings get adopted, it will be because they hold up on their own merits.  They're required for KinoSearch merge model; once KS 0.20 is out, I'll port the new benchmarking stuff, we can study the numbers, and assess whether the significant effort needed to pry that algo into Lucene would be worthwhile.

Yonik... yes, I agree.  Even better for indexing time, leave postings in serialized form for the entire indexing session.  :)

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.1
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-510:
--------------------------------------

    Fix Version/s:     (was: 2.1)
                   2.2

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Assigned To: Grant Ingersoll
>             Fix For: 2.2
>
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated LUCENE-510:
-----------------------------------

    Fix Version/s:     (was: 2.2)

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Grant Ingersoll
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned LUCENE-510:
--------------------------------------

    Assignee:     (was: Grant Ingersoll)

I don't have time at the moment

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557937#action_12557937 ]

Michael Busch commented on LUCENE-510:
--------------------------------------

I think it makes total sense to change this. And this issue seems to be
very popular with 5 votes.

Mike, you've done so much performance & indexing work recently. Are
you interested in taking this?

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557953#action_12557953 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

Yup, I'll take this!

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-510:
-----------------------------------------

    Assignee: Michael McCandless

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael Busch
In reply to this post by Nick Burch (Jira)
Cool!

Michael McCandless (JIRA) wrote:

>     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557953#action_12557953 ]
>
> Michael McCandless commented on LUCENE-510:
> -------------------------------------------
>
> Yup, I'll take this!
>
>> IndexOutput.writeString() should write length in bytes
>> ------------------------------------------------------
>>
>>                 Key: LUCENE-510
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Store
>>    Affects Versions: 2.1
>>            Reporter: Doug Cutting
>>         Attachments: SortExternal.java, strings.diff, TestSortExternal.java
>>
>>
>> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
>> http://www.mail-archive.com/java-dev@.../msg01970.html
>> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
>> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-510:
--------------------------------------

    Attachment: LUCENE-510.patch

Attached patch.

I modernized Marvin's original patch and added full backwards
compatibility to it so that old indices can be opened for reading or
writing.  New segments are written in the new format.

All tests pass.  I think it's close, but, I need to run performance
tests now to measure the impact to indexing throughput.

I think future optimizations can keep the byte[] further, eg, into
Term and FieldCache, as Yonik mentioned.  We could also fix
DocumentsWriter to use byte[] for its terms storage which would
improve RAM efficiency for single-byte (ascii) content.

I also updated the TestBackwardsCompatibility testcase to properly
test non-ascii terms.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576205#action_12576205 ]

Grant Ingersoll commented on LUCENE-510:
----------------------------------------

So, with this we should be able to skip ahead in the FieldsReader, right?  I will try to update your patch with that.  Should improve lazy loading, etc.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576256#action_12576256 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

Yes, exactly.  But I think the current patch is already doing this? -- ie, using seek instead of skipChars, if the fdt is new.


> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123