[jira] [Created] (LUCENE-3896) CharTokenizer has bugs for large documents.

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
CharTokenizer has bugs for large documents.
-------------------------------------------

                 Key: LUCENE-3896
                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Robert Muir
            Priority: Blocker


Initially found by hudson from additional testing added in LUCENE-3894, but
currently not reproducable (see LUCENE-3895).

But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3896.
---------------------------------

    Resolution: Not A Problem

Dammit, sorry guys: this was bug in the test.

Still chasing down this hudson bug... but I've fixed this test and will commit it.
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234030#comment-13234030 ]

Robert Muir commented on LUCENE-3896:
-------------------------------------

Hmm might not be over yet... I have a seed!

{noformat}
ant test -Dtestcase=TestAnalyzers -Dtestmethod=testRandomStrings -Dtests.seed=-7f6a719106dd5c8:-63eef3f2749f16d4:706a2a70bcc9d7ac -Dtests.multiplier=3 -Dargs="-Dfile.encoding=UTF-8"
{noformat}

This is the random test for WhitespaceAnalyzer.

{noformat}
    [junit] Testcase: testRandomStrings(org.apache.lucene.analysis.core.TestAnalyzers): Caused an ERROR
    [junit] term 0 expected:<𑂛[]𑂑𑂶𑃁𑃌𑂪𑂲𑂮𑃋𑃍𑂓> but was:<𑂛[?]𑂑𑂶𑃁𑃌𑂪𑂲𑂮𑃋𑃍𑂓>
    [junit] org.junit.ComparisonFailure: term 0 expected:<𑂛[]𑂑𑂶𑃁𑃌𑂪𑂲𑂮𑃋𑃍𑂓> but was:<𑂛[?]𑂑𑂶𑃁𑃌𑂪𑂲𑂮𑃋𑃍𑂓>
    [junit] at org.junit.Assert.assertEquals(Assert.java:125)
    [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.assertTokenStreamContents(BaseTokenStreamTestCase.java:144)
    [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
    [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:337)
    [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:306)
    [junit] at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:298)
    [junit] at org.apache.lucene.analysis.core.TestAnalyzers.testRandomStrings(TestAnalyzers.java:213)
    [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    [junit] at java.lang.reflect.Method.invoke(Method.java:597)
    [junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
    [junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
    [junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
    [junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
    [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
    [junit] at org.apache.lucene.util.LuceneTestCase$SubclassSetupTeardownRule$1.evaluate(LuceneTestCase.java:729)
    [junit] at org.apache.lucene.util.LuceneTestCase$InternalSetupTeardownRule$1.evaluate(LuceneTestCase.java:645)
    [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22)
    [junit] at org.apache.lucene.util.LuceneTestCase$TestResultInterceptorRule$1.evaluate(LuceneTestCase.java:556)
    [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51)
    [junit] at org.apache.lucene.util.LuceneTestCase$RememberThreadRule$1.evaluate(LuceneTestCase.java:618)
    [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
    [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
    [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
    [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:164)
    [junit] at org.apache.lucene.util.LuceneTestCaseRunner.runChild(LuceneTestCaseRunner.java:57)
    [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
    [junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
    [junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
    [junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
    [junit] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
    [junit] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
    [junit] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
    [junit] at org.apache.lucene.util.UncaughtExceptionsRule$1.evaluate(UncaughtExceptionsRule.java:51)
    [junit] at org.apache.lucene.util.StoreClassNameRule$1.evaluate(StoreClassNameRule.java:21)
    [junit] at org.apache.lucene.util.SystemPropertiesInvariantRule$1.evaluate(SystemPropertiesInvariantRule.java:22)
    [junit] at org.junit.rules.RunRules.evaluate(RunRules.java:18)
    [junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
    [junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
    [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:420)
    [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:911)
    [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:768)
    [junit]
    [junit]
    [junit] Test org.apache.lucene.analysis.core.TestAnalyzers FAILED
{noformat}
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Reopened] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reopened LUCENE-3896:
---------------------------------

   

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>         Attachments: LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3896:
--------------------------------

    Attachment: LUCENE-3896.patch

Here's a patch, but we should not commit it.

I think this patch only shoves the bug under the rug: by reading fully it makes it less likely to happen that there is some off-by-one here.
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>         Attachments: LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3896:
---------------------------------------

    Attachment: LUCENE-3896.patch

OK I think the problem here was the spoon-feeder would feed 1 character, and that 1 character is a high-surrogate.  In that case, CharacterUtils.fill was returning 0, but I think should re-attempt the read to pull more chars?

Attached patch seems to fix it... not sure we can do it cleaner though.
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234317#comment-13234317 ]

Robert Muir commented on LUCENE-3896:
-------------------------------------

+1
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3896:
---------------------------------------

    Attachment: LUCENE-3896.patch

Improved the patch a bit; "ant test" in modules/analysis passes (at least once!).

I think it's ready...
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3896:
--------------------------------

    Fix Version/s: 4.0
                   3.6
   

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3896.
---------------------------------

    Resolution: Fixed

Mike: I backported your commit to 3.x to start making progress on these hudson fails.
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3896) CharTokenizer has bugs for large documents.

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13234397#comment-13234397 ]

Michael McCandless commented on LUCENE-3896:
--------------------------------------------

Thanks Rob!
               

> CharTokenizer has bugs for large documents.
> -------------------------------------------
>
>                 Key: LUCENE-3896
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3896
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>            Priority: Blocker
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3896.patch, LUCENE-3896.patch, LUCENE-3896.patch
>
>
> Initially found by hudson from additional testing added in LUCENE-3894, but
> currently not reproducable (see LUCENE-3895).
> But its easy to reproduce for a simple single-threaded case in TestDuelingAnalyzers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]