[jira] Created: (LUCENE-1241) 0xffff char is not a string terminator

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
0xffff char is not a string terminator
--------------------------------------

                 Key: LUCENE-1241
                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Hiroaki Kawai


Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.

However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hiroaki Kawai updated LUCENE-1241:
----------------------------------

    Attachment: LUCENE-1241.patch

Created a patch that is aware of string length.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581049#action_12581049 ]

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
we can't handle a string that really contains \uffff
{quote}
This is an invalid UTF16 string for interchange.  The standard explicitly allows for certain characters (including this one) to be used for internal purposes.

{quote}
However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.
{quote}
I don't think this is necessary for assertion.  The memory cost for this is sizable.  Right now tracking a string's length consumes 2 bytes (0xffff char) per posting.  By adding length we're consuming an additional 4 bytes.  While indexing, there are a large number of postings (one per unique term) so this added RAM usage is not negligible.

I think we should do one or the other, but not both.

Really the tradeoff we are exploring here is whether using up 2 more bytes per term, which causes us to flush sooner & merge more often for a given RAM buffer size, is offset by the speedup of not having to check for 0xffff and compute length in certain places.

One problem with the patch is you forgot to add another int (4 bytes) POSTING_NUM_BYTE in DocumentsWriter.  This is important because the tradeoff we are exploring here is whether increasing RAM usage of a Posting, which causes more frequent flushing, while then saving some of not having to compare to 0xffff in certain places, is net/net a performance "win".  Can you fix this?  Thanks.

Have you run any performance tests to assess the impact of this change?  I think that's critical here since if this is net/net a performance loss we shouldn't make the change.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581110#action_12581110 ]

Steven Rowe commented on LUCENE-1241:
-------------------------------------

{quote}
bq. we can't handle a string that really contains \uffff

This is an invalid UTF16 string for interchange. The standard explicitly allows for certain characters (including this one) to be used for internal purposes.
{quote}

I strongly suspect, however, that "internal purposes" is meant to be taken as application-internal, not leaf-library-internal.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581768#action_12581768 ]

Hiroaki Kawai commented on LUCENE-1241:
---------------------------------------

I think we should not use \uffff as a terminator in Lucene library regardless of the fact that it is allowed in Unicode standard, because it is unnecessary.

Reading commit log in svn repository, and the code base at revision 553235, I suspect termination with "\uffff" is introduced at 553236 referring the implementation of java.text.CharacterIterator. Isn't it? ( java.text.CharacterIterator.DONE is class static and is "\uffff". The class java.text.CharacterIterator is for supporting internationalization interface of bidirectional string scan. And we can determine whether we reached the end of a string by comparing what we get with java.text.CharacterIterator.DONE. )

I came to the idea of introducing a new class that implements CharSequence, Comparable and has a good hashCode() that will use the buffer of original memory allocation (String, StringBuffer, char[], CharBuffer, or etc.).

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hiroaki Kawai updated LUCENE-1241:
----------------------------------

    Attachment: ComparableCharSequence.java

ComparableCharSequence illustrates the idea. I wanted to name it shorter, but have no idea right now.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581843#action_12581843 ]

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
I think we should not use \uffff as a terminator in Lucene library regardless of the fact that it is allowed in Unicode standard, because it is unnecessary.
{quote}

I'm not yet convinced it's unecessary.  We need to run performance
tests to understand the time/space tradeoff here.  If this change
speeds up indexing we should do it.  RAM is cheap.

By far, the Posting instances consume the most RAM in DocumentsWriter.
Right now each Posting is 66 bytes; this patch, once finished
increases that to 68 bytes.

I don't like increasing the byte usage of Posting unless there's a
good counterbalance, which I think this change *may* have if we see
that it improves indexing speed.

I just checked: when indexing Wikipedia with a 64 MB buffer, each
segment flushed has ~430,000 Posting instances.  So the Posting
instances alone account for 27 MB of the buffer.

That means the added 2 bytes from this change will consume ~840 KB
additional RAM, which is not insignificant loss of RAM efficiency.

[Aside: by Zipf's law, the vast majority of these terms should occur
rarely.  Eg roughly half will occur only once.  If we could find some
way to represent these rare terms with a much more compact structure
(Posting has alot of "overhead" to efficiently manage a long posting
list) then we would greatly increase DW's RAM efficiency.]




> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581844#action_12581844 ]

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

{quote}
Reading commit log in svn repository, and the code base at revision 553235, I suspect termination with "\uffff" is introduced at 553236 referring the implementation of java.text.CharacterIterator. Isn't it? ( java.text.CharacterIterator.DONE is class static and is "\uffff". The class java.text.CharacterIterator is for supporting internationalization interface of bidirectional string scan. And we can determine whether we reached the end of a string by comparing what we get with java.text.CharacterIterator.DONE. )
{quote}

Indeed, CharacterIterator.DONE also uses U+FFFF as the termination
character, though I hadn't realized that until now.

{quote}
I came to the idea of introducing a new class that implements CharSequence, Comparable and has a good hashCode() that will use the buffer of original memory allocation (String, StringBuffer, char[], CharBuffer, or etc.).
{quote}

This looks neat, but, can you pull this all together into a single
workable patch that we can run some performance tests on?



> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581860#action_12581860 ]

Hiroaki Kawai commented on LUCENE-1241:
---------------------------------------

{quote}
This looks neat, but, can you pull this all together into a single
workable patch that we can run some performance tests on?
{quote}

OK, I can. But, do you really want a huge single patch? And, this is yet another issue to do. I want to do the right thing, and performance is also yet another issue.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581884#action_12581884 ]

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

OK how about a separate issue for ComparableCharSequence?

But it'd be great to first bring closure to this issue, ie, fixing the issues I found (above) so we can assess performance impact of this change.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12581887#action_12581887 ]

Hiroaki Kawai commented on LUCENE-1241:
---------------------------------------

{quote}
OK how about a separate issue for ComparableCharSequence?
{quote}

I'm now working for it :)

I'll open later.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582584#action_12582584 ]

Hiroaki Kawai commented on LUCENE-1241:
---------------------------------------

Your commit of rev 641303 was so huge that my current working copy got broken perfectly. I can't help giving up right now.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582587#action_12582587 ]

Michael McCandless commented on LUCENE-1241:
--------------------------------------------

Woops, sorry.  I can take over?  I'll start from your patch, update to the current trunk, and fold in my feedback above, then test performance.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1241:
------------------------------------------

    Assignee: Michael McCandless

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>            Assignee: Michael McCandless
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch, LUCENE-1241.take2.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1241:
---------------------------------------

    Attachment: LUCENE-1241.take2.patch

Attached take2 patch.  I fixed it to apply to trunk, and I removed
0xffff entirely.  All tests pass, but...

Unfortunately, this change causes a significant net slowdown (5.9%) in
indexing throughput.  I ran this alg:

  analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
  doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker
  docs.file=/Volumes/External/lucene/wiki.txt
  doc.stored = true
  doc.term.vector = true
  doc.add.log.step=2000
  directory=FSDirectory
  autocommit=false
  compound=false
  ram.flush.mb=64
  { "Rounds"
    ResetSystemErase
    { "BuildIndex"
      - CreateIndex
      { "AddDocs" AddDoc > : 200000
      - CloseIndex
    }
    NewRound
  } : 5
  RepSumByPrefRound BuildIndex

I ran the test on an Intel quad core Mac Pro with 4-drive RAID 0.  JVM
is 1.5 and I run with "-Xms1024M -Xmx1024M -Xbatch -server".

Trunk gets 897.3 rec/s and the patch gets 844.3 rec/s, best of 5 =
5.9% slower.

I don't think we should commit this.

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>            Assignee: Michael McCandless
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch, LUCENE-1241.take2.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-1241) 0xffff char is not a string terminator

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-1241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-1241.
----------------------------------------

       Resolution: Won't Fix
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> 0xffff char is not a string terminator
> --------------------------------------
>
>                 Key: LUCENE-1241
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1241
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Hiroaki Kawai
>            Assignee: Michael McCandless
>         Attachments: ComparableCharSequence.java, LUCENE-1241.patch, LUCENE-1241.take2.patch
>
>
> Current trunk index.DocumentWriter uses "\uffff" as a string terminator, but it should not to be for some reasons. \uffff is not a terminator char itself and we can't handle a string that really contains \uffff. And also, we can calculate the end char position in a character sequence from the string length that we already know.
> However, I agree with the usage for assertion, that "\uffff" is placed after at the end of a string in a char sequence.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]