Number Proximity Query

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Number Proximity Query

KEGan
Hi,

Is there a way to query all numbers that is close to a particular number
(query), and score by how close they are to that number (query) ?

To illustrate further, assume document with single field "num", and the
value for this field can only be integer number. Now, let says, there are 3
documents (doc1, doc2, doc3) with the following "num" values .... 1, 10, and
100. Hence if we query for 12, the return results should be order by doc2,
doc1, doc3. Since 10 is closer to 12, follows by 1, and finally follows by
100.

Is there a way to do this easily in Lucene ?

From my searches, there seems to be a FunctionQuery in Solr that can do this
type of query. But I am using pure Lucene, and trying to port Solr code over
(to create my own version of FunctionQuery) looks too complicated because of
code dependency on other Solr code such as ValueSource, etc.

I have also search on how to write my own query instance, but there is lack
of documentation on doing so. The formula to calculate the number proximity
is quite trivial. But how to stitch together Query, Weight, Scorer is the
problem :(

Any suggestion is greatly appreciated :)

~KEGan
Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

Chris Hostetter-3

: >From my searches, there seems to be a FunctionQuery in Solr that can do this
: type of query. But I am using pure Lucene, and trying to port Solr code over
: (to create my own version of FunctionQuery) looks too complicated because of
: code dependency on other Solr code such as ValueSource, etc.

ValueSource isn't relaly "other Solr code" .. it's an inherient part of
FunctionQuery (hence it's in the same package).

You should be able to use everything in the
org.apache.solr.search.function package as is without any other Solr code.

: I have also search on how to write my own query instance, but there is lack
: of documentation on doing so. The formula to calculate the number proximity
: is quite trivial. But how to stitch together Query, Weight, Scorer is the
: problem :(

Check out the package documentation for org.apache.lucene.search,
particularly section #3 "Changing the Scoring" ...

http://lucene.apache.org/java/docs/api/org/apache/lucene/search/package-summary.html#scoring




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

KEGan
Thanks Chis.

After spending half a day to "really" look into FunctionQuery (and related
classes), and re-reading about Weight and Scorer. I think I am beginning to
understand a bit. But more questions.

(1) Should values returned by DocValues (return from ValueSource) must
always betwen 1.0 and 0.0 ? How is this value affect the overall document
scores, assuming there are others Query clauses as well that is perform on
the document (on other fields).

(2) The documentation on the following functions is extremely lacking (no
matter where I looked). Any expert here can help out ?

-- Weight.getValue() : what values should be returned for
NumberProximityQuery?
-- Weight.sumOfSquareWeights() : no idea what is this for???
-- Weight.normalize() : still no idea
-- Scorer.score() : should this value always between 1.0 and 0.0 ?



Thanks.
~KEGan


On 10/4/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : >From my searches, there seems to be a FunctionQuery in Solr that can do
> this
> : type of query. But I am using pure Lucene, and trying to port Solr code
> over
> : (to create my own version of FunctionQuery) looks too complicated
> because of
> : code dependency on other Solr code such as ValueSource, etc.
>
> ValueSource isn't relaly "other Solr code" .. it's an inherient part of
> FunctionQuery (hence it's in the same package).
>
> You should be able to use everything in the
> org.apache.solr.search.function package as is without any other Solr code.
>
> : I have also search on how to write my own query instance, but there is
> lack
> : of documentation on doing so. The formula to calculate the number
> proximity
> : is quite trivial. But how to stitch together Query, Weight, Scorer is
> the
> : problem :(
>
> Check out the package documentation for org.apache.lucene.search,
> particularly section #3 "Changing the Scoring" ...
>
>
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/package-summary.html#scoring
>
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

Erick Erickson
Sorry if this is a re-post, but I got an "undeliverable" error last time I
tried to post it, something about SPAM. The nerve of that filter!

----------------
I don't have my book handy, but you might want to check out "Lucene In
Action". There's an example of how to create an index of restaurants and
execute a query that orders the responses by the distance to the closest
restaurant.

I think a similar technique (although proabably an easier implementation)
could apply to your problem. Unfortunately, I don't remember the details
well enough to say much more....

Could you accomplish this by implementing your own sort? I have no real idea
whether that's applicable, but it did occur to me......

Not much help, but a start <G>.

Erick

On 10/4/06, KEGan <[hidden email]> wrote:

>
> Thanks Chis.
>
> After spending half a day to "really" look into FunctionQuery (and related
> classes), and re-reading about Weight and Scorer. I think I am beginning
> to
> understand a bit. But more questions.
>
> (1) Should values returned by DocValues (return from ValueSource) must
> always betwen 1.0 and 0.0 ? How is this value affect the overall document
> scores, assuming there are others Query clauses as well that is perform on
> the document (on other fields).
>
> (2) The documentation on the following functions is extremely lacking (no
> matter where I looked). Any expert here can help out ?
>
> -- Weight.getValue() : what values should be returned for
> NumberProximityQuery?
> -- Weight.sumOfSquareWeights() : no idea what is this for???
> -- Weight.normalize() : still no idea
> -- Scorer.score() : should this value always between 1.0 and 0.0 ?
>
>
>
> Thanks.
> ~KEGan
>
>
> On 10/4/06, Chris Hostetter <[hidden email]> wrote:
> >
> >
> > : >From my searches, there seems to be a FunctionQuery in Solr that can
> do
> > this
> > : type of query. But I am using pure Lucene, and trying to port Solr
> code
> > over
> > : (to create my own version of FunctionQuery) looks too complicated
> > because of
> > : code dependency on other Solr code such as ValueSource, etc.
> >
> > ValueSource isn't relaly "other Solr code" .. it's an inherient part of
> > FunctionQuery (hence it's in the same package).
> >
> > You should be able to use everything in the
> > org.apache.solr.search.function package as is without any other Solr
> code.
> >
> > : I have also search on how to write my own query instance, but there is
> > lack
> > : of documentation on doing so. The formula to calculate the number
> > proximity
> > : is quite trivial. But how to stitch together Query, Weight, Scorer is
> > the
> > : problem :(
> >
> > Check out the package documentation for org.apache.lucene.search,
> > particularly section #3 "Changing the Scoring" ...
> >
> >
> >
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/package-summary.html#scoring
> >
> >
> >
> >
> > -Hoss
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

KEGan
Erick, thanks for your reply.

I have the LIA. But the sorting is not the solution I am looking for. As if
I sort, I will lose out the relevancy from searches of other fields. I want
the number proximity to be one in many of the fields that is searched. So
the "num" field will contribute to the overall document score.

~KEGan


On 10/4/06, Erick Erickson <[hidden email]> wrote:

>
> Sorry if this is a re-post, but I got an "undeliverable" error last time I
> tried to post it, something about SPAM. The nerve of that filter!
>
> ----------------
> I don't have my book handy, but you might want to check out "Lucene In
> Action". There's an example of how to create an index of restaurants and
> execute a query that orders the responses by the distance to the closest
> restaurant.
>
> I think a similar technique (although proabably an easier implementation)
> could apply to your problem. Unfortunately, I don't remember the details
> well enough to say much more....
>
> Could you accomplish this by implementing your own sort? I have no real
> idea
> whether that's applicable, but it did occur to me......
>
> Not much help, but a start <G>.
>
> Erick
>
> On 10/4/06, KEGan <[hidden email]> wrote:
> >
> > Thanks Chis.
> >
> > After spending half a day to "really" look into FunctionQuery (and
> related
> > classes), and re-reading about Weight and Scorer. I think I am beginning
> > to
> > understand a bit. But more questions.
> >
> > (1) Should values returned by DocValues (return from ValueSource) must
> > always betwen 1.0 and 0.0 ? How is this value affect the overall
> document
> > scores, assuming there are others Query clauses as well that is perform
> on
> > the document (on other fields).
> >
> > (2) The documentation on the following functions is extremely lacking
> (no
> > matter where I looked). Any expert here can help out ?
> >
> > -- Weight.getValue() : what values should be returned for
> > NumberProximityQuery?
> > -- Weight.sumOfSquareWeights() : no idea what is this for???
> > -- Weight.normalize() : still no idea
> > -- Scorer.score() : should this value always between 1.0 and 0.0 ?
> >
> >
> >
> > Thanks.
> > ~KEGan
> >
> >
> > On 10/4/06, Chris Hostetter <[hidden email]> wrote:
> > >
> > >
> > > : >From my searches, there seems to be a FunctionQuery in Solr that
> can
> > do
> > > this
> > > : type of query. But I am using pure Lucene, and trying to port Solr
> > code
> > > over
> > > : (to create my own version of FunctionQuery) looks too complicated
> > > because of
> > > : code dependency on other Solr code such as ValueSource, etc.
> > >
> > > ValueSource isn't relaly "other Solr code" .. it's an inherient part
> of
> > > FunctionQuery (hence it's in the same package).
> > >
> > > You should be able to use everything in the
> > > org.apache.solr.search.function package as is without any other Solr
> > code.
> > >
> > > : I have also search on how to write my own query instance, but there
> is
> > > lack
> > > : of documentation on doing so. The formula to calculate the number
> > > proximity
> > > : is quite trivial. But how to stitch together Query, Weight, Scorer
> is
> > > the
> > > : problem :(
> > >
> > > Check out the package documentation for org.apache.lucene.search,
> > > particularly section #3 "Changing the Scoring" ...
> > >
> > >
> > >
> >
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/package-summary.html#scoring
> > >
> > >
> > >
> > >
> > > -Hoss
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

Chris Hostetter-3
In reply to this post by KEGan

: (1) Should values returned by DocValues (return from ValueSource) must
: always betwen 1.0 and 0.0 ? How is this value affect the overall document
: scores, assuming there are others Query clauses as well that is perform on
: the document (on other fields).

The "values" returned by the various methods in DocValues can be anything
you want, as long as they conform to hte primitive datatype in the method
sig -- ideally whatever float floatVal returns should be roughly
equivilent to whever double you return from doubleVal so that your
ValueSource appears to behave the same way regardless of whatever other
ValueSources you may compose it in.

How the values affect the over all Document score really depends on how
big the values you return are, and how you compose your FunctionQuery with
the other parts of your query (presumably in a BooleanQuery)


Take a look at LinearFloatFunction and the DocValues it produces.  it's a
good example of what a "function" you want to be able to compose in a
function query should do.

If i remember your problem statement correctly, all you really need is an
AbsoluteValueFunction that you could compose with some linear functions
ala:  linear(abs(linear(field(x), -1, N), -1, M)))

...where N is the magic number you want your doc vals to be close to, M is
the biggest value you ever want your FunctionQuery to score a document
with, and field(x) is where you put a FieldCacheSource on the field you
care about.

Your AbsoluteValueFunction would look a lot like LinearFloatFunction, with
out the slope or intercept and a getValues method that looked something like...

  public DocValues getValues(IndexReader reader) throws IOException {
    final DocValues vals =  source.getValues(reader);
    return new DocValues() {
      public float floatVal(int doc) {
        return Math.abs(vals.floatVal(doc))
      }
      ...

: (2) The documentation on the following functions is extremely lacking (no
: matter where I looked). Any expert here can help out ?
:
: -- Weight.getValue() : what values should be returned for
: NumberProximityQuery?
: -- Weight.sumOfSquareWeights() : no idea what is this for???
: -- Weight.normalize() : still no idea
: -- Scorer.score() : should this value always between 1.0 and 0.0 ?

I honestly don't remember off the top of me head what those methods are
for our how they come into play -- the scoring.html doc that's in progress
should help clear up some of that.  As i recall, the last time i needed to
understand what those methods did, i looked at some of the primitive query
types (like TermQuery and BooleanQuery) to see what they did.

The one thing I can tell you with certainty is that there is nothing
magical about scores between 0.0 and 1.0 -- the notion that Lucene Scores
are between 0 and 1 is a myth perpetuated by the Hits interface which does
a mock-normalization of the scores if the highest is greater then 1.
When you get down into the bowels of scoring, the scores can be any float
-- even negative numbers are legal scores.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

KEGan
Chris, thanks again for your reply. Really appreciate your help.

Another quick question on the score. If my custom Query is returning a score
that can be any value, and this custom Query is being used together with
other standard Query in a BooleanQuery. How do I ensure the value return by
the custome Query doesnt 'overshadow' the values return by other standard
Query??

Example, if the custom Query return values in range of 0-1,000, whereas
other standard Query return values in range of 0-5. Then the score by the
custom Query virtually dominates the overall document score.

I am not sure if I am asking the correct question :) It's 3am now. I will
write more tomorrow. Good night :)

~KEGan


On 10/5/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : (1) Should values returned by DocValues (return from ValueSource) must
> : always betwen 1.0 and 0.0 ? How is this value affect the overall
> document
> : scores, assuming there are others Query clauses as well that is perform
> on
> : the document (on other fields).
>
> The "values" returned by the various methods in DocValues can be anything
> you want, as long as they conform to hte primitive datatype in the method
> sig -- ideally whatever float floatVal returns should be roughly
> equivilent to whever double you return from doubleVal so that your
> ValueSource appears to behave the same way regardless of whatever other
> ValueSources you may compose it in.
>
> How the values affect the over all Document score really depends on how
> big the values you return are, and how you compose your FunctionQuery with
> the other parts of your query (presumably in a BooleanQuery)
>
>
> Take a look at LinearFloatFunction and the DocValues it produces.  it's a
> good example of what a "function" you want to be able to compose in a
> function query should do.
>
> If i remember your problem statement correctly, all you really need is an
> AbsoluteValueFunction that you could compose with some linear functions
> ala:  linear(abs(linear(field(x), -1, N), -1, M)))
>
> ...where N is the magic number you want your doc vals to be close to, M is
> the biggest value you ever want your FunctionQuery to score a document
> with, and field(x) is where you put a FieldCacheSource on the field you
> care about.
>
> Your AbsoluteValueFunction would look a lot like LinearFloatFunction, with
> out the slope or intercept and a getValues method that looked something
> like...
>
> public DocValues getValues(IndexReader reader) throws IOException {
>    final DocValues vals =  source.getValues(reader);
>    return new DocValues() {
>      public float floatVal(int doc) {
>        return Math.abs(vals.floatVal(doc))
>      }
>      ...
>
> : (2) The documentation on the following functions is extremely lacking
> (no
> : matter where I looked). Any expert here can help out ?
> :
> : -- Weight.getValue() : what values should be returned for
> : NumberProximityQuery?
> : -- Weight.sumOfSquareWeights() : no idea what is this for???
> : -- Weight.normalize() : still no idea
> : -- Scorer.score() : should this value always between 1.0 and 0.0 ?
>
> I honestly don't remember off the top of me head what those methods are
> for our how they come into play -- the scoring.html doc that's in progress
> should help clear up some of that.  As i recall, the last time i needed to
> understand what those methods did, i looked at some of the primitive query
> types (like TermQuery and BooleanQuery) to see what they did.
>
> The one thing I can tell you with certainty is that there is nothing
> magical about scores between 0.0 and 1.0 -- the notion that Lucene Scores
> are between 0 and 1 is a myth perpetuated by the Hits interface which does
> a mock-normalization of the scores if the highest is greater then 1.
> When you get down into the bowels of scoring, the scores can be any float
> -- even negative numbers are legal scores.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Number Proximity Query

Chris Hostetter-3

: Another quick question on the score. If my custom Query is returning a score
: that can be any value, and this custom Query is being used together with
: other standard Query in a BooleanQuery. How do I ensure the value return by
: the custome Query doesnt 'overshadow' the values return by other standard
: Query??
        ...
: I am not sure if I am asking the correct question :) It's 3am now. I will
: write more tomorrow. Good night :)

your question is not only 'correct' but very astute - unfortunately I
don't have a good answer for you -- there is no one solution to deal with
this problem, for mant of the same reasons why trying to make value
comparisons about the scores from different queries, or trying to "filter
by score" doesn't work -- there is no "upper bound" on the score that any
one query can produce, so there is no 100% safe way to ensure that you
fairly weight the score contributions of two arbitrary clauses of a
boolean query.

what you can do is try to mitigate the affects, base on what you know
about the various queries ... if you have 3 major clauses: one parsed
from your user input, one built automatically based on some criteria, and
one that's a fixed function query you can look at the typically queries
produced by your users, and the general structure of the automatically
generated clause, and the range of values produced by your function and
come up with boosts for each that work "well enough" in the common case.

the Explanation class is your friend while working out the boosts you
want.

The FunctionQuery package also has a few little gems that help you
mitigate the potential range of values your produce... MaxFloatFunction
can help you ensure that your values are above a certain "hard" threshold,
wrapping that in a LinearFloatFunction with a negative slope can help you
ensure that the values are *below* a hard threshold ... the OrdFieldSource
and ReverseOrdFieldSource are also extremely usefull when you care about
the ordering of Documents by a field value, but not the relative
differences between those values.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]