[jira] [Created] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[jira] [Created] (LUCENE-7795) WordDelimiterFilter produces invalid offsets in single word case

JIRA jira@apache.org
Michael Braun created LUCENE-7795:
-------------------------------------

             Summary: WordDelimiterFilter produces invalid offsets in single word case
                 Key: LUCENE-7795
                 URL: https://issues.apache.org/jira/browse/LUCENE-7795
             Project: Lucene - Core
          Issue Type: Bug
    Affects Versions: 6.5, master (7.0)
            Reporter: Michael Braun


This problem is not present in WordDelimiterGraphFilter, but it is present in WordDelimiterFilter's interaction with HTMLStripCharFilter.

Test code:

{code}
public class TestTokenizationIssue2 {
    public static void main(String... args) throws IOException {
        HTMLStripCharFilter charFilter = new HTMLStripCharFilter(getText());
        WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
        whitespaceTokenizer.setReader(charFilter);
       // WordDelimiterGraphFilter wdgf = new WordDelimiterGraphFilter(whitespaceTokenizer,
        //       WordDelimiterGraphFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);

        WordDelimiterFilter wdgf = new WordDelimiterFilter(whitespaceTokenizer,
               WordDelimiterFilter.GENERATE_WORD_PARTS, CharArraySet.EMPTY_SET);
        wdgf.reset();

        while (wdgf.incrementToken()) {
            CharTermAttribute charTermAttribute = wdgf.getAttribute(CharTermAttribute.class);
            OffsetAttribute offsetAttribute = wdgf.getAttribute(OffsetAttribute.class);

            System.out.println(charTermAttribute.toString() + " - " + offsetAttribute.startOffset() + ',' + offsetAttribute.endOffset());
        }
    }

    private static Reader getText() {
        return new StringReader("“Risk");
    }
}

{code}

The offsets produced by the WordDelimiterFilter are 1,10. With WordDelimiterGraphFilter the offsets produced are 0,10. It should be 0,10 as this is the original text:    “Risk   - and 1 is between the ampersand and hash.

Inside WordDelimiterFilter, I believe the conditional branch from "if (isSingleWord && startOffset <= savedEndOffset) "   is invalid and it should always use the saved start and end offsets because it can't make the assertion that the iterator's current and end are reliable markers.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Loading...