[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (LUCENE-8403) Support 'filtered' term vectors - don't require all terms to be present

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16916316#comment-16916316 ]

David Smiley commented on LUCENE-8403:

Atri, I appreciate you put some effort into this but your patch wouldn't work for the use case that inspired the creation of this feature-request.  The terms to be omitted by the term vector are matchable by a pattern; it's not a fixed pre-determined list.  For example imagine filtering all terms that start or end with a special character.

But this issue is stuck without addressing the concern Robert raises -- CheckIndex.  I don't recall the particulars of where in CheckIndex.java it complains but try it out on your patch to see.  Given randomized checkIndex usage automatically within tests, I suspect your patch will ultimately fail given enough iterations.

> Support 'filtered' term vectors - don't require all terms to be present
> -----------------------------------------------------------------------
>                 Key: LUCENE-8403
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8403
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael Braun
>            Priority: Minor
>         Attachments: LUCENE-8403.patch
> The genesis of this was a conversation and idea from [~dsmiley] several years ago.
> In order to optimize term vector storage, we may not actually need all tokens to be present in the term vectors - and if so, ideally our codec could just opt not to store them.
> I attempted to fork the standard codec and override the TermVectorsFormat and TermVectorsWriter to ignore storing certain Terms within a field. This worked, however, CheckIndex checks that the terms present in the standard postings are also present in the TVs, if TVs enabled. So this then doesn't work as 'valid' according to CheckIndex.
> Can the TermVectorsFormat be made in such a way to support configuration of tokens that should not be stored (benefits: less storage, more optimal retrieval per doc)? Is this valuable to the wider community? Is there a way we can design this to not break CheckIndex's contract while at the same time lessening storage for unneeded tokens?

This message was sent by Atlassian Jira

To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]