[jira] [Created] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
Make BaseTokenStreamTestCase a bit more evil
--------------------------------------------

                 Key: LUCENE-3894
                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: 3.6, 4.0


Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3894:
---------------------------------------

    Attachment: LUCENE-3894.patch

Patch; tests pass.

I had to fix up Edge/NGramTokenizers to work w/ spoon feeding, but otherwise no analyzers seem to be failing, at least on one run...

I had to do some sneaky things with MockTokenizer to work around its state machine...
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3894:
---------------------------------------

    Attachment: LUCENE-3894.patch

Fixed a few things...
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3894:
--------------------------------

    Attachment: LUCENE-3894.patch

+1 Mike, here's an updated patch... the random test for icutokenizer now passes (spoonfeeding caught a bug).

But, now testHugeDoc fails... (not a random test).
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233886#comment-13233886 ]

Michael McCandless commented on LUCENE-3894:
--------------------------------------------

I think that new read method needs to use the incoming offset (ie, pass location + offset, not location, as 2nd arg to input.read)?  Does testHugeDoc then pass?
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233893#comment-13233893 ]

Robert Muir commented on LUCENE-3894:
-------------------------------------

Thats it! But this 'new read method' is not really new, its from commons-io! we should open a bug over there...
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Resolved] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3894.
----------------------------------------

    Resolution: Fixed
   

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233945#comment-13233945 ]

Michael McCandless commented on LUCENE-3894:
--------------------------------------------

Thanks Rob!
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233996#comment-13233996 ]

Robert Muir commented on LUCENE-3894:
-------------------------------------

I think we have bugs in some tokenizers. Its currently very hard to reproduce and we get no random seed :(

I think the issue is the maxWordLength=20. This is not long enough to catch bugs in tokenizers I think,
we should exceed whatever buffersize they use for example.

So I think we need to refactor this logic so that the multithreaded tests take maxWordLength, and ensure
this parameter is always respected.

This way, tests for things like tokenizers can bump this up to things like CharTokenizer.IO_BUFFER_SIZE*2
or whatever makes sense to them, to ensure we really test them well.

I don't like the fact that only my stupid trivial test (testHugeDoc) found the IO-311 bug, what if we
didn't have that silly test?

I'll add a patch.
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3894) Make BaseTokenStreamTestCase a bit more evil

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3894:
--------------------------------

    Attachment: LUCENE-3894_maxWordLength.patch

patch for the maxWordLength issue.

This also makes the single-threaded version that the multi-threaded versions call private, so that its not accidentally used (losing test coverage).

Now we can beef up tokenizer tests to test longer strings, for stemmers and filters i think 20 is probably fine though.
               

> Make BaseTokenStreamTestCase a bit more evil
> --------------------------------------------
>
>                 Key: LUCENE-3894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3894
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894.patch, LUCENE-3894_maxWordLength.patch
>
>
> Throw an exception from the Reader while tokenizing, stop after not consuming all tokens, sometimes spoon-feed chars from the reader...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]