[jira] Created: (LUCENE-510) IndexOutput.writeString() should write length in bytes

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576316#action_12576316 ]

Grant Ingersoll commented on LUCENE-510:
----------------------------------------

Cool.  I just downloaded and applied, but hadn't looked at it yet.  :-)

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-510:
--------------------------------------

    Attachment: LUCENE-510.take2.patch

New rev of the patch.  I think it's ready to commit.  I'll wait a few
days.

I made some performance improvements by factoring out a new
UnicodeUtil class that does not allocate new objects for every
conversion to/from UTF8.

One new issue I fixed is the handling of invalid UTF-16 strings.
Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is
unable to represent it (unlike the current modified UTF-8 Lucene
format).  I changed DocumentsWriter & UnicodeUtil to substitute the
replacement char U+FFFD for such invalid surrogate characters.  This
affects terms, stored String fields and term vectors.

Indexing performance has a small slowdown (3.5%); details are below.

Unfortunately, time to enumerate terms was more affected.  I made a
simple test that enumerates all terms from the index (= ~3.3 million
terms) created below:

  public class TestTermEnum {
    public static void main(String[] args) throws Exception {
      IndexReader r = IndexReader.open(args[0]);
      TermEnum terms = r.terms();
      int count = 0;
      long t0 = System.currentTimeMillis();
      while(terms.next())
        count++;
      long t1 = System.currentTimeMillis();
      System.out.println(count + " terms in " + (t1-t0) + " millis");
      r.close();
    }
  }

On trunk with current index format this takes 3104 msec (best of 5).
With the patch with UTF8 index format it takes 3443 msec = 10.9%
slower.  I don't see any further ways to make this faster.

Details on the indexing performance test:

  analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
 
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
 
  docs.file=/Volumes/External/lucene/wiki.txt
  doc.stored = true
  doc.term.vector = true
  doc.add.log.step=2000
 
  directory=FSDirectory
  autocommit=false
  compound=false
 
  ram.flush.mb=64
 
  { "Rounds"
    ResetSystemErase
    { "BuildIndex"
      CreateIndex
      { "AddDocs" AddDoc > : 200000
      - CloseIndex
    }
    NewRound
  } : 5
 
  RepSumByPrefRound BuildIndex

I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array,
running OS 10.4.11, java 1.5, run with these command-line args:

  -server -Xbatch -Xms1024m -Xmx1024m

Best of 5 with current trunk is 921.2 docs/sec and with patch it's
888.7 = 3.5% slowdown.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579708#action_12579708 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580592#action_12580592 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

{quote}
I'm wondering why the patch doesn't utilize java.nio.charset.CharsetEncoder, CharsetDecoder....?
{quote}

I think there are two reasons for rolling our own instead of using
CharsetEncoder/Decoder.  First is performance.  If I use
CharsetEncoder, like this:

  CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
  CharBuffer cb = CharBuffer.allocate(5000);
  ByteBuffer bb = ByteBuffer.allocate(5000);
  byte[] bbArray = bb.array();
  UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

  t0 = System.currentTimeMillis();
  for(int i=0;i<count;i++) {
    cb.clear();
    cb.put(strings[i]);
    cb.flip();
    bb.clear();
    encoder.reset();
    encoder.encode(cb, bb, true);
  }

Then it takes 676 msec to convert ~3.3 million strings from the terms
from indexing first 200K Wikipedia docs.  If I replace for loop with:

  UnicodeUtil.UTF16toUTF8(strings[i], 0, strings[i].length(), utf8Result);

It's 441 msec.

Second reason is some API mismatch.  EG we need to convert char[] that
end in the 0xffff character.  Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580739#action_12580739 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

To the first reason of performance issue,

{quote}
CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb = CharBuffer.allocate(5000);
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

t0 = System.currentTimeMillis();
for(int i=0;i<count;i++) { cb.clear(); cb.put(strings[i]); cb.flip(); bb.clear(); encoder.reset(); encoder.encode(cb, bb, true); }
{quote}

How about this code?

CharsetEncoder encoder = Charset.forName("UTF-8").newEncoder();
CharBuffer cb;
ByteBuffer bb = ByteBuffer.allocate(5000);
byte[] bbArray = bb.array();
UnicodeUtil.UTF8Result utf8Result = new UnicodeUtil.UTF8Result();

t0 = System.currentTimeMillis();
for(int i=0;i<count;i++) {
  bb.clear();
  encoder.reset();
  encoder.encode(CharBuffer.wrap(strings[i]), bb, true);
}



> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580768#action_12580768 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

That version takes 1067 msec (best of 3).

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580797#action_12580797 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

Ah interesting !

I'm not tried yet, but...

CharBuffer cb = ByteBuffer.allocateDirect(5000).asCharBuffer();
ByteBuffer bb = ByteBuffer.allocateDirect(5000);

will improve the performance.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580820#action_12580820 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

{quote}
Second reason is some API mismatch. EG we need to convert char[] that
end in the 0xffff character. Also, we need to do incremental
conversion (only convert changed bytes), which is used by TermEnum.
CharsetEncoder/Decoder doesn't do this.
{quote}

IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java). And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580828#action_12580828 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

Using allocateDirect is surprisingly slower: 1050 msec.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580831#action_12580831 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

{quote}
IMHO, the char 0xffff is not telling that the string is the end. The marker is not to be used to dermine the termination, but is for assertion, I think (in DocumentsWriterThreadState.java).
{quote}

The 0xffff is used for more than assertion.  It marks the end of the text for each term.

{quote}
And we should rewrite DocumentsWriterFieldMergeState.java to provide "private int postingUpto" outside via public method, to use it in DocumentsWriter.java and DocumentsWriterField.java.
{quote}
The "private int postingUpto" in DocumentsWriterFieldMergeState is very different from the usage of 0xffff marker; I'm not sure what you're suggesting here?

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580846#action_12580846 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

I'm sorry I mistook something in DocumentsWriterFieldMergeState.java.

But, my suggestion here, is use 0xffff only for assertion.

To achive this, we should add textLength property to DocumentsWriterFieldMergeState and use it in DocumentsWriter.java, modify DocumentsWriter#compareText.

I'll submit a patch.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580860#action_12580860 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

OK.  Can you open a separate issue for that?

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580868#action_12580868 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

OK. I'll open another ticket. :-)

I'll try measuring the performance, too. I still believe that we should use java.nio.charset for our code maintainance.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580869#action_12580869 ]

Michael McCandless commented on LUCENE-510:
-------------------------------------------

I think using java.nio.charset is too much of a performance hit.  I think it will especially further slow down enumerating terms since we can't convert the characters incrementally if we use java.nio.charset.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580985#action_12580985 ]

Hiroaki Kawai commented on LUCENE-510:
--------------------------------------

I think performance issue is yet another issue, and I believe we can find the way to speed up.

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-510.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.4

> IndexOutput.writeString() should write length in bytes
> ------------------------------------------------------
>
>                 Key: LUCENE-510
>                 URL: https://issues.apache.org/jira/browse/LUCENE-510
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>    Affects Versions: 2.1
>            Reporter: Doug Cutting
>            Assignee: Michael McCandless
>             Fix For: 2.4
>
>         Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, SortExternal.java, strings.diff, TestSortExternal.java
>
>
> We should change the format of strings written to indexes so that the length of the string is in bytes, not Java characters.  This issue has been discussed at:
> http://www.mail-archive.com/java-dev@.../msg01970.html
> We must increment the file format number to indicate this change.  At least the format number in the segments file should change.
> I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 (other than removal of deprecated features).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Marvin Humphrey

> Michael McCandless resolved LUCENE-510.

Congratulations.  :)

When I wrote my initial patch, I saw a performance degradation of c.  
30% in my indexing benchmarks.  Repeated reallocation was presumably  
one culprit: when length in Java chars is stored in the index, you  
only need to allocate once, whereas when reading in UTF-8, you can't  
know just how much memory you need until the read completes.  
Furthermore, at write-time, you can't look at something composed of 16-
bit chars and know what the byte-length of its UTF-8 representation  
will be without pre-scanning.

How did you solve those problems?  Are the string diffs and  
comparisons now performed against raw bytes, so that fewer conversions  
are needed?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless-2

Marvin Humphrey wrote:
>
>> Michael McCandless resolved LUCENE-510.
>
> Congratulations.  :)

Thanks.  I didn't quite realize what I was getting myself into when I  
said "yes" on that issue!

> When I wrote my initial patch, I saw a performance degradation of  
> c. 30% in my indexing benchmarks.

I think it was 20%.

> Repeated reallocation was presumably one culprit: when length in  
> Java chars is stored in the index, you only need to allocate once,  
> whereas when reading in UTF-8, you can't know just how much memory  
> you need until the read completes.  Furthermore, at write-time, you  
> can't look at something composed of 16-bit chars and know what the  
> byte-length of its UTF-8 representation will be without pre-scanning.

Right, not doing allocations was pretty much it (the getBytes method  
of String was most of the slowdown I think).  I was also able to  
eliminate another per-term scan we were doing in DocumentsWriter and  
fold it into the conversion.

I ended up creating custom conversion methods (UTF8toUTF16 & vice-
versa) to do this conversion into a re-used byte[] or char[], which  
grow as needed, then I just bulk-write the bytes.  I think this is  
not much slower than before (modified UTF8) since it also had to go  
character by character w/ ifs inside that inner loop.

I'm less happy with the 11% slowdown on TermEnum, and that's even  
with the optimization to incrementally decode only the "new" UTF-8  
bytes as we are reading the changed suffix of each term, reusing the  
already-decoded UTF16 chars from the previous term.  This will  
slowdown populating a FieldCache, which is already slow.  But  
LUCENE-831 and LUCENE-1231 should fix that.

> Are the string diffs and comparisons now performed against raw  
> bytes, so that fewer conversions are needed?

Alas, not yet: Lucene still uses UTF16 java chars internally.  The  
conversion to UTF-8 happens "at the last minute" when writing, and  
"immediately" when reading.

I started exploring keeping UTF-8 bytes further in, but it quickly  
got messy because it would require changing how the term infos are  
sorted to be unicode code point order.  Comparing bytes in UTF-8 is  
the same as comparing unicode code points, which is nice.  But  
comparing UTF-16 values is almost but not quite the same.   So  
suddenly everywhere where a string comparison takes place I had to  
assess whether that comparison should be by unicode code point, and  
call our own method for doing so.  It quickly became a "big" project  
so I ran back to sorting by UTF-16 value.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Yonik Seeley-2
On Wed, Mar 26, 2008 at 5:22 PM, Michael McCandless
<[hidden email]> wrote:

>  > Are the string diffs and comparisons now performed against raw
>  > bytes, so that fewer conversions are needed?
>
>  Alas, not yet: Lucene still uses UTF16 java chars internally.  The
>  conversion to UTF-8 happens "at the last minute" when writing, and
>  "immediately" when reading.
>
>  I started exploring keeping UTF-8 bytes further in, but it quickly
>  got messy because it would require changing how the term infos are
>  sorted to be unicode code point order.  Comparing bytes in UTF-8 is
>  the same as comparing unicode code points, which is nice.  But
>  comparing UTF-16 values is almost but not quite the same.   So
>  suddenly everywhere where a string comparison takes place I had to
>  assess whether that comparison should be by unicode code point, and
>  call our own method for doing so.  It quickly became a "big" project
>  so I ran back to sorting by UTF-16 value.

Hmmm, can't we always do it by unicode code point?
When do we need UTF-16 order?

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Resolved: (LUCENE-510) IndexOutput.writeString() should write length in bytes

Michael McCandless-2
Yonik Seeley <[hidden email]> wrote:

>  Hmmm, can't we always do it by unicode code point?
>  When do we need UTF-16 order?

In theory, we can.  I think the sort order doesn't matter much, as
long as everyone (writers & readers) agree what it is.  I think
unicode code point order is more "standards compliant" too.

A big benefit is then we could leave things (eg TermBuffer and maybe
eventually Term, FieldCache) as UTF8 bytes and save on the conversion
cost when reading.

But I don't think Java provides a way to do this comparison?  However
it's not hard to implement your own:

  http://www.icu-project.org/docs/papers/utf16_code_point_order.html

But then I worried about how much slower that code is than
String.compareTo, and, I found alot of places where innocent compareTo
or < or > needed to be changed to this method call.  Field name
comparisons would have to be fixed too.  Then for backwards
compatibility all of these places that do comparisons would have to
fallback to the Java way when interacting with an older segment.

I think we can still explore this?  It just seemed way too big to
glomm into the already-big changes in LUCENE-510.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123