[jira] [Created] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
FreqFilteringScorerWrapper and min/max freq options on TermQuery
----------------------------------------------------------------

                 Key: LUCENE-3395
                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
             Project: Lucene - Java
          Issue Type: New Feature
            Reporter: Hoss Man


A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).

Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.

I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated LUCENE-3395:
-----------------------------

    Attachment: LUCENE-3395.patch

patch containing FreqFilteringScorerWrapper and a test.  I haven't yet done the work on TermQuery to add options for this -- wanted to see what people thought of it first and get some code review ... been a while since i touched code this deep in the stack.

a few things to note:

* entire class is marked experimental since it's whole existence depends on an experimental method of the Scorer API.  that said: even if we rip out Scorer.freq, i think we can still support this as a TermQuery feature since freq info will always be available from TermScorer.
* test currently has some nocommit's related to an NPE when trying to check the edge case of wrapping a Scorer that matches nothing.  i think the problem relates to some code i cut/paste from TestTermScorer for getting a Scorer from a Query+Searcher to use in the test, but it seems to optimize the Scorer to null when it matches nothing  (even if i didn't have this NPE, that getScorer method would be marked nocommit until someone verified it was in fact a "valid" way for a test to get direct access to a  Scorer)

> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088873#comment-13088873 ]

Robert Muir commented on LUCENE-3395:
-------------------------------------

should we add a doNext() used by both nextDoc() and advance()

> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088874#comment-13088874 ]

Robert Muir commented on LUCENE-3395:
-------------------------------------

Also, I don't think we should hook this into TermQuery? There's no reason we cant just make it a general wrapper to any scorer that exposes freq(), e.g. phrase too.

> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088897#comment-13088897 ]

Hoss Man commented on LUCENE-3395:
----------------------------------

bq. should we add a doNext() used by both nextDoc() and advance()

Ah .. yeah, good call.  it took me a few minutes to figure out the "right" way to do advance, and i didn't notice they would up being essentially a cut/paste

bq. Also, I don't think we should hook this into TermQuery? There's no reason we cant just make it a general wrapper to any scorer that exposes freq(), e.g. phrase too.

Yeah, i wasn't sure how people would feel about that (hence i didn't work on it yet)

My suggestion was to do both -- that this be a public Scorer that advanced users could use to hook into anonymous subclasses of other queries; but also hook it directly into TermQuery so that it was easier to use in (what seems to me like) the simple/common case. i mainly don't see any harm in adding these options to TermQuery, we can do it so that TermScorer is only wrapped if these options are used, so theres no performance downside for existing users who don't care.

Perhaps i'm misisng something though:

1) Is there a straight forward / easy template for non-expert java users to take an arbitrary Query and override it's Scorer with a wrapper like this?
2) Is there a simple Query Wrapper we could write in conjunction with this Scorer Wrapper to make that trivial for users?

...if so, then i'm sold.  let's not fuck with TermQuery.

(In general, the vagueness of what "freq" means for anything other then Terms (ie: with sloppy phrases the value of 1.0 can mean 1 exact match, or 2 matches with different slops, etc...) is the reason i figured this would primarily be useful when dealing with TermQueries and should be particularly easy to use there.  Particularly since Scorer.freq is marked experimental and might go away -- which would mean this public FreqFilteringScorerWrapper would have to go away;  if that happens, i still think we should support something like this explicitly for TermQuery)


> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088920#comment-13088920 ]

Robert Muir commented on LUCENE-3395:
-------------------------------------

{quote}
i mainly don't see any harm in adding these options to TermQuery, we can do it so that TermScorer is only wrapped if these options are used, so theres no performance downside for existing users who don't care.
{quote}

Well I think there is a downside? Because if we hook it straight into this query, then someone will want it in SpanTermQuery, and PayloadTermQuery, and PhraseQuery, and their own queries???...

{quote}
1) Is there a straight forward / easy template for non-expert java users to take an arbitrary Query and override it's Scorer with a wrapper like this?
{quote}

Yes! I think we should make a query that wraps another query to do this? Similar to ConstantScoreQuery?


> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088936#comment-13088936 ]

Hoss Man commented on LUCENE-3395:
----------------------------------

bq. Because if we hook it straight into this query, then someone will want it in SpanTermQuery, and PayloadTermQuery, and PhraseQuery, and their own queries???...

They might -- but like i said, i was focused on what seemed like the common/simple case. "make the common case easy, and the uncommon case 'possible'"

bq. Yes! I think we should make a query that wraps another query to do this? Similar to ConstantScoreQuery?

I'm totally on board with the idea, i just don't really understand what the implementation should look like -- not because i don't think it's possible, i just don't personally understand how to write it.  Would the weight of the wrapper query just proxy directly to the inner query for all of the methods except scorer(..) ?

> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (LUCENE-3395) FreqFilteringScorerWrapper and min/max freq options on TermQuery

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/LUCENE-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088943#comment-13088943 ]

Robert Muir commented on LUCENE-3395:
-------------------------------------

{quote}
I'm totally on board with the idea, i just don't really understand what the implementation should look like – not because i don't think it's possible, i just don't personally understand how to write it. Would the weight of the wrapper query just proxy directly to the inner query for all of the methods except scorer(..) ?
{quote}

I'm just looking and typing and not testing, but...
# getValueForNormalization/normalize: I think in general these delegate similar to constant score query, because conceptually you could give this wrapping query a boost (e.g. someone says wrap("foo",min=3)^5 OR "foo"^1 <-- dumb syntax to indicate if it has more than 3 matches it gets a special boost)
# explain: i think this doesn't need to be particularly efficient, more important that its correct. So in this case the Weight would implement this method, first calling sub.scorer and checking the freq is in bounds (else no match). If the freq is in bounds, then it just delegates (yeah i know this means the sub 'explains again' but its easy?)

and the rest seem obvious to me?

> FreqFilteringScorerWrapper and min/max freq options on TermQuery
> ----------------------------------------------------------------
>
>                 Key: LUCENE-3395
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3395
>             Project: Lucene - Java
>          Issue Type: New Feature
>            Reporter: Hoss Man
>         Attachments: LUCENE-3395.patch
>
>
> A Solr User was asking about how specify a minimum tf when searching for a term (ie: documents matching "dog" at least 3 times).
> Based on a conversation with rmuir on IRC, that led me to realize that we now explicitly expose a general "freq()" method on Scorer, and that min/max freq constraints could be implemented as a general Scorer Wrapper.
> I propose that we add such a wrapper, and add setMinFreq(float)/setMaxFreq(float) methods to TermQuery (similar to the minNumShouldMatches and disableCoord type setters in BooleanQuery) that cause it to be used automatically.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]