[jira] [Created] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Created] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

JIRA jira@apache.org
Michael Braun created LUCENE-7795:

             Summary: WordDelimiterFilter produces invalid offsets in single word case
                 Key: LUCENE-7795
                 URL: https://issues.apache.org/jira/browse/LUCENE-7795
             Project: Lucene - Core
          Issue Type: Bug
    Affects Versions: 6.5, master (7.0)
            Reporter: Michael Braun

This problem is not present in WordDelimiterGraphFilter, but it is present in WordDelimiterFilter's interaction with HTMLStripCharFilter.

Test code:

public class TestTokenizationIssue2 {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
       // WordDelimiterGraphFilter wdgf = new WordDelimiterGraphFilter(whitespaceTokenizer,
        //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);

        WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
               WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());

    private static Reader getText() {
        return new StringReader("“Risk");


The offsets produced by the WordDelimiterFilter are 1,10. With WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as this is the original text:    “Risk   - and 1 is between the ampersand and hash.

Inside WordDelimiterFilter, I believe the conditional branch from "if (isSingleWord && startOffset <= savedEndOffset) "   is invalid and it should always use the saved start and end offsets because it can't make the assertion that the iterator's current and end are reliable markers.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]