[jira] [Commented] (LUCENE-3849) position increments should be implemented by TokenStream.end()

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (LUCENE-3849) position increments should be implemented by TokenStream.end()

Hudson (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13744879#comment-13744879 ]

Michael McCandless commented on LUCENE-3849:

I think because facet module cutover from payloads to DV, many of the problematic TokenStreams disappeared?  But there was still one, inside DirectoryTaxonomyWriter, that I fixed in the patch ... it now calls clearAttributes and sets each att on incrementToken.

That's a good idea on end(); I'll do that and check all impls.

I don't see a better way than setting posInc to 0 in end ... and I agree this bug is bad.  It can also affects suggesters, e.g. if it uses ShingleFilter after StopFilter and the user's last word was a stop word.

> position increments should be implemented by TokenStream.end()
> --------------------------------------------------------------
>                 Key: LUCENE-3849
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3849
>             Project: Lucene - Core
>          Issue Type: Bug
>    Affects Versions: 3.6, 4.0-ALPHA
>            Reporter: Robert Muir
>         Attachments: LUCENE-3849.patch, LUCENE-3849.patch, LUCENE-3849.patch, LUCENE-3849.patch
> if you have pages of a book as multivalued fields, with the default position increment gap
> of analyzer.java (0), phrase queries won't work across pages if one ends with stopword(s).
> This is because the 'trailing holes' are not taken into account in end(). So I think in
> TokenStream.end(), subclasses of FilteringTokenFilter (e.g. stopfilter) should do:
> {code}
> super.end();
> posIncAtt += skippedPositions;
> {code}
> One problem is that these filters need to 'add' to the posinc, but currently nothing clears
> the attributes for end() [they are dirty, except offset which is set by the tokenizer].
> Also the indexer should be changed to pull posIncAtt from end().

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]