[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (LUCENE-8651) Tokenizer implementations can't be reset

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-8651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16748798#comment-16748798 ]

Dan Meehl edited comment on LUCENE-8651 at 1/22/19 5:30 PM:
------------------------------------------------------------

This is basically the same solution I came to in LUCENE-8650 (patch2). I ended up calling mine KeywordTokenStream to keep the naming in line because it matches what KeywordTokenizer does. 

Honestly though, the lifecycle of a Tokenizer still feels wrong to me. All other TokenStreams have a reset(), incrementToken(), end(), close() lifecycle. But Tokenizer has an extra setReader() in there, and the consumer must know that it's a Tokenizer and then call the extra step (assuming it even has access to the Reader). It feels to me like Tokenizer should have to conform to the same lifecycle steps as every other TokenStream. Or at least, if that can't be true, Tokenizer implementations should be able to set their reader by overriding reset(). This currently can't be done because inputPending and setReader() and ILLEGAL_STATE_READER are final. If this could be done then one could construct a Tokenizer implementation that conformed to the TokenStream lifecycle and then the consumer doesn't have to know anything about Tokenizer. After all, that is the point of an abstraction like this: If the consumer takes a TokenStream, then it knows what the lifecycle is. 

If the lifecycle of Tokenizer is to stay the same, I'd like to propose a documentation update on TokenStream and Tokenizer. I can take a swing at that and post a patch if you'd like.


was (Author: dmeehl):
This is basically the same solution I came to in LUCENE-8650 (patch2). I ended up calling mine KeywordTokenStream to keep the naming in line because it matches what KeywordTokenizer does. 

Honestly though, the lifecycle of a Tokenizer still feels wrong to me. All other TokenStreams have a reset(), incrementToken(), end(), close() lifecycle. But Tokenizer has an extra setReader() in there, and the consumer must know that it's a Tokenizer and therefore must call the extra step (assuming it even has access to the Reader). It feels to me like Tokenizer should have to conform to the same lifecycle steps as every other TokenStream. Or at least, if that can't be true, Tokenizer implementations should be able to set their reader by overriding reset(). This currently can't be done because inputPending and setReader() and ILLEGAL_STATE_READER are final. If this could be done then one could construct a Tokenizer implementation that conformed to the TokenStream lifecycle and then the consumer doesn't have to know anything about Tokenizer. After all, that is the point of an abstraction like this: If the consumer takes a TokenStream, then it knows what the lifecycle is. 

If the lifecycle of Tokenizer is to stay the same, I'd like to propose a documentation update on TokenStream and Tokenizer. I can take a swing at that and post a patch if you'd like.

> Tokenizer implementations can't be reset
> ----------------------------------------
>
>                 Key: LUCENE-8651
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8651
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Dan Meehl
>            Priority: Major
>         Attachments: LUCENE-8650-2.patch, LUCENE-8651.patch, LUCENE-8651.patch
>
>
> The fine print here is that they can't be reset without calling setReader() every time before reset() is called. The reason for this is that Tokenizer violates the contract put forth by TokenStream.reset() which is the following:
> "Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh."
> Tokenizer implementation's reset function can't reset in that manner because their Tokenizer.close() removes the reference to the underlying Reader because of LUCENE-2387. The catch-22 here is that we don't want to unnecessarily keep around a Reader (memory leak) but we would like to be able to reset() if necessary.
> The patches include an integration test that attempts to use a ConcatenatingTokenStream to join an input TokenStream with a KeywordTokenizer TokenStream. This test fails with an IllegalStateException thrown by Tokenizer.ILLEGAL_STATE_READER.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]