[jira] [Comment Edited] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576676#comment-16576676 ]

Uwe Schindler edited comment on LUCENE-8450 at 8/10/18 6:22 PM:
----------------------------------------------------------------

bq. Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead.

I think correctOffset should indeed be part of the OffsetAttribute (we need to extend the interface). But we have to make sure, that it does not contain any hidden state. Attributes are only "beans" with getters and setters and no hidden state, and must be symmetric (if you set something by setter, the getter must return it unmodified). They can be used as a state on their own (like flagsattribute) to control later filters, but they should not have any hidden state, that affects how the setters work.

bq. Maybe it makes sense for something like standardtokenizer to offer a "decompound hook" or something that is very limited (e.g., not a chain, just one thing) so that european language decompounders don't need to duplicate a lot of the logic around punctuation and unicode

Actually that the real solution for the decompounding or WordDelimiterFilter. Actually all tokenizers should support it. Maybe that can be done in the base class and the incrementToken() get's final. Instead the parsing code could push tokens that are passed to decompounder and then icrementToken returns them. So incrementToken is final and calls some next method on the tokenization and passes the result to the decompounder. Which is is a no-op by default.

Another way would be to have a special type of TokenFilter where the input is not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead of "TokenStream", the "input" field is also Tokenizer). In general decompounders should always be directly after the tokenizer (some of them may need to lowercase currently to process the token like dictionary based decompounders, but that's a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer and call correctOffset on them, if they split tokens.


was (Author: thetaphi):
bq. Separately I don't like the correctOffset() method that we already have on tokenizer today. maybe it could be in the offsetattributeimpl or similar instead.

I think correctOffset should indeed be part of the OffsetAttribute (we need to extend the interface). But we have to make sure, that it does not contain any hidden state. Attributes are only "beans" with getters and setters and no hidden state, and must be symmetric (if you set something by setter, the getter must return it unmodified). They can be used as a state on their own (like flagsattribute) to control later filters, but they should not have any hidden state, that affects how the setters work.

bq. Maybe it makes sense for something like standardtokenizer to offer a "decompound hook" or something that is very limited (e.g., not a chain, just one thing) so that european language decompounders don't need to duplicate a lot of the logic around punctuation and unicode

Actually that the real solution for the decompounding or WordDelimiterFilter. Actually all tokenizers should support it. Maybe that can be done in the base class and the incrementToken() get's final. Instead the parsing code could push tokens that are passed to decompounder and then icrementToken returns them. So incrementToken is final and calls some next method on the tokenization and passes the result to the decompounder. Which is is a no-op by default.

Another way would be to have a special type of TokenFilter where the input is not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead of "TokenStream", the "input" field is also Tokenizer). In general decompounders should always be directly after the tokenizer (some of them may need to lowercase at the moemnt like dictionary based decompounders, but that's a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer and call correctOffset on them, if they split tokens.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent filters cannot perform simple arithmetic to calculate the original ("correct") offset of a character in the interior of the token. A similar situation exists for Tokenizers, but these can call CharFilter.correctOffset() to map offsets back to their original location in the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the correct portion of a compound token. For example the german word "au­ßer­stand" (meaning afaict "unable to do something") will be decompounded and match "stand and "ausser", but as things are today, offsets are always set using the start and end of the tokens produced by Tokenizer, meaning that highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap()­­­;}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]