Numerical Range Query

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Numerical Range Query

Dan Hardiker
Hi,

I've got an application which stores ratings for content in a Lucene
index. It works a treat for the most part, apart from the use-case I
have for being able to filter out ratings that have less than a given
number of rates. It kinda works, but seems to use Alpha ranging rather
than Numeric ranging.

Here is the Java code I am using:

luceneQuery.add( new RangeQuery( new Term(RateUtils.SF_FILTERED_CNT,
minRatesString), null, true), BooleanClause.Occur.MUST );

For context:

* luceneQuery is a org.apache.lucene.search.BooleanQuery
* RateUtils.SF_FILTERED_CNT is the String containing the appropriate
field name "rating-filtered-count"
* minRatesString is an integer as a String

Here is where the field is added into the index:

document.add( new Field(RateUtils.SF_FILTERED_CNT, String.valueOf(
filteredCount ), Field.Store.YES, Field.Index.UN_TOKENIZED) );

For context:

* document is a org.apache.lucene.document.Document
* filteredCount is an int (counting the number of rates that have occurred)

Unfortunately it doesn't work quite as I expected as if I have 5
documents in the index:

# 5 ratings
# 9 ratings
# 1 rating
# 0 ratings
# 11 ratings

If minRatesString is "5" then only the first document is returned, if
it's "1" then the 3rd and 5th are returned, if its "6" then none are
returned. It appears to be filtering alphabetically (starting with the
first digit/character and matching on that) rather than numerically.

Oddly enough, if I sort on that field ... it works as I expect.

Am I missing something?


--
Dan Hardiker

PS: I've been googling for well over an hour, if I'm not searching with
the right terms - please advise me! I tried to find a way to search the
archives specifically, but I could only browse them month by month.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Numerical Range Query

Erick Erickson
Yep, lucene works with strings, not numbers so the fact that you're
not getting what you expect is expected <G>.

Although I'm a bit puzzled by what you're actually getting back.
You might try using Luke to look at your index to see what's
there.

See the NumberTools class for some help here.......

BTW, at least in Lucene 2.1, the preferred way to go about this
would be ConstantScoreRangeQuery...

Best
Erick

On Mon, May 12, 2008 at 1:39 PM, Dan Hardiker <[hidden email]>
wrote:

> Hi,
>
> I've got an application which stores ratings for content in a Lucene
> index. It works a treat for the most part, apart from the use-case I have
> for being able to filter out ratings that have less than a given number of
> rates. It kinda works, but seems to use Alpha ranging rather than Numeric
> ranging.
>
> Here is the Java code I am using:
>
> luceneQuery.add( new RangeQuery( new Term(RateUtils.SF_FILTERED_CNT,
> minRatesString), null, true), BooleanClause.Occur.MUST );
>
> For context:
>
> * luceneQuery is a org.apache.lucene.search.BooleanQuery
> * RateUtils.SF_FILTERED_CNT is the String containing the appropriate field
> name "rating-filtered-count"
> * minRatesString is an integer as a String
>
> Here is where the field is added into the index:
>
> document.add( new Field(RateUtils.SF_FILTERED_CNT, String.valueOf(
> filteredCount ), Field.Store.YES, Field.Index.UN_TOKENIZED) );
>
> For context:
>
> * document is a org.apache.lucene.document.Document
> * filteredCount is an int (counting the number of rates that have
> occurred)
>
> Unfortunately it doesn't work quite as I expected as if I have 5 documents
> in the index:
>
> # 5 ratings
> # 9 ratings
> # 1 rating
> # 0 ratings
> # 11 ratings
>
> If minRatesString is "5" then only the first document is returned, if it's
> "1" then the 3rd and 5th are returned, if its "6" then none are returned. It
> appears to be filtering alphabetically (starting with the first
> digit/character and matching on that) rather than numerically.
>
> Oddly enough, if I sort on that field ... it works as I expect.
>
> Am I missing something?
>
>
> --
> Dan Hardiker
>
> PS: I've been googling for well over an hour, if I'm not searching with
> the right terms - please advise me! I tried to find a way to search the
> archives specifically, but I could only browse them month by month.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Numerical Range Query

Dan Hardiker
Erick Erickson wrote:
> Although I'm a bit puzzled by what you're actually getting back.
> You might try using Luke to look at your index to see what's
> there.

I've looked through with Luke and it doesn't look like much has changed
between using NumberTools and not. NumberTools definitely does some
padding which makes sense, however even though I'm using that, Lucene or
Luke seems to be boiling it down to just the number. I'm not sure which.

> See the NumberTools class for some help here.......
>
> BTW, at least in Lucene 2.1, the preferred way to go about this
> would be ConstantScoreRangeQuery...

Taking your advice I'm now indexing using:

document.add( new Field(RateUtils.SF_FILTERED_CNT,
NumberTools.longToString( filteredCount ), Field.Store.YES,
Field.Index.UN_TOKENIZED) );

and searching using:

I'm now
int minRates = Long.valueOf( minRatesString ).intValue();
luceneQuery.add( new ConstantScoreRangeQuery( RateUtils.SF_FILTERED_CNT,
NumberTools.longToString(minRates), "", true, false ),
BooleanClause.Occur.MUST );

I get very odd results back now, but they seem to work similarly. The
documentation for ConstantScoreRangeQuery is rather thin however I did
find this example which suggests I'm doing the right thing:

http://github.com/we4tech/semantic-repository/tree/master/development/idea-repository-core/src/main/java/com/ideabase/repository/core/index/ExtendedQueryParser.java

The code _looks_ like it should work, it makes sense logically but it
still doesn't do what I'm expecting.

I've tried changing the indexing over to Field.Index.NO_NORMS and it
makes the field value "0000000000000b" instead of "11", and
"00000000000002" instead of "2" ... but that meant that the searching
didn't pick up on that field _at all_.

Surely "find me results where numeric field x is higher than y" can't be
an uncommon request? I can think of many areas where you want to do that
(age filtering for example).

Any other suggestions of what I should be looking for, or where I can
look to find out the next step to take?


--
Dan Hardiker

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Numerical Range Query

Erick Erickson
Are you using NumberTools both at index and query time? Because
this works exactly as I expect....

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumberTools;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.ConstantScoreRangeQuery;

import java.io.IOException;

/**
 * Created by: eoericks
 * Date: May 12, 2008
 * History: $Log$
 */
public class Test {
    public static void main(String args[]) {
        try {
            Test test = new Test();
            test.doIndex();
            test.doSearch();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    private void doIndex() throws IOException {

        IndexWriter w = new
IndexWriter(FSDirectory.getDirectory("C:/lucidx"), new StandardAnalyzer(),
true);
        Document doc = new Document();
        doc.add(new Field("num", NumberTools.longToString(1), Field.Store.NO,
Field.Index.UN_TOKENIZED));
        doc.add(new Field("name", "doc 1", Field.Store.YES,
Field.Index.UN_TOKENIZED));
        w.addDocument(doc);

        doc = new Document();
        doc.add(new Field("num", NumberTools.longToString(11),
Field.Store.NO, Field.Index.UN_TOKENIZED));
        doc.add(new Field("name", "doc 11", Field.Store.YES,
Field.Index.UN_TOKENIZED));
        w.addDocument(doc);

        doc = new Document();
        doc.add(new Field("num", NumberTools.longToString(5), Field.Store.NO,
Field.Index.UN_TOKENIZED));
        doc.add(new Field("name", "doc 5", Field.Store.YES,
Field.Index.UN_TOKENIZED));
        w.addDocument(doc);

        doc = new Document();
        doc.add(new Field("num", NumberTools.longToString(9), Field.Store.NO,
Field.Index.UN_TOKENIZED));
        doc.add(new Field("name", "doc 9", Field.Store.YES,
Field.Index.UN_TOKENIZED));
        w.addDocument(doc);

        w.close();

    }

    private void doSearch() throws IOException {
        IndexSearcher r = new
IndexSearcher(FSDirectory.getDirectory("c:/lucidx"));
        oneSearch(r, 1L);
        oneSearch(r, 2L);
        oneSearch(r, 5L);
        oneSearch(r, 9L);
        oneSearch(r, 0L);

    }
    private void oneSearch(IndexSearcher r, Long lower) throws IOException {
        System.out.println("\n\nSearching for greater than " +
Long.toString(lower));
        Hits hits = r.search(new ConstantScoreRangeQuery("num",
NumberTools.longToString(lower), null,  false, true));
        for (int idx = 0; idx < hits.length(); ++idx) {
            System.out.println(hits.doc(idx).get("name"));
        }

    }
}


***output***

Searching for greater than 1
doc 11
doc 5
doc 9


Searching for greater than 2
doc 11
doc 5
doc 9


Searching for greater than 5
doc 11
doc 9


Searching for greater than 9
doc 11


Searching for greater than 0
doc 1
doc 11
doc 5
doc 9


On Mon, May 12, 2008 at 3:21 PM, Dan Hardiker <[hidden email]>
wrote:

> Erick Erickson wrote:
>
> > Although I'm a bit puzzled by what you're actually getting back.
> > You might try using Luke to look at your index to see what's
> > there.
> >
>
> I've looked through with Luke and it doesn't look like much has changed
> between using NumberTools and not. NumberTools definitely does some padding
> which makes sense, however even though I'm using that, Lucene or Luke seems
> to be boiling it down to just the number. I'm not sure which.
>
>  See the NumberTools class for some help here.......
> >
> > BTW, at least in Lucene 2.1, the preferred way to go about this
> > would be ConstantScoreRangeQuery...
> >
>
> Taking your advice I'm now indexing using:
>
> document.add( new Field(RateUtils.SF_FILTERED_CNT,
> NumberTools.longToString( filteredCount ), Field.Store.YES,
> Field.Index.UN_TOKENIZED) );
>
> and searching using:
>
> I'm now
> int minRates = Long.valueOf( minRatesString ).intValue();
> luceneQuery.add( new ConstantScoreRangeQuery( RateUtils.SF_FILTERED_CNT,
> NumberTools.longToString(minRates), "", true, false ),
> BooleanClause.Occur.MUST );
>
> I get very odd results back now, but they seem to work similarly. The
> documentation for ConstantScoreRangeQuery is rather thin however I did find
> this example which suggests I'm doing the right thing:
>
>
> http://github.com/we4tech/semantic-repository/tree/master/development/idea-repository-core/src/main/java/com/ideabase/repository/core/index/ExtendedQueryParser.java
>
> The code _looks_ like it should work, it makes sense logically but it
> still doesn't do what I'm expecting.
>
> I've tried changing the indexing over to Field.Index.NO_NORMS and it makes
> the field value "0000000000000b" instead of "11", and "00000000000002"
> instead of "2" ... but that meant that the searching didn't pick up on that
> field _at all_.
>
> Surely "find me results where numeric field x is higher than y" can't be
> an uncommon request? I can think of many areas where you want to do that
> (age filtering for example).
>
> Any other suggestions of what I should be looking for, or where I can look
> to find out the next step to take?
>
>
> --
> Dan Hardiker
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
adb
Reply | Threaded
Open this post in threaded view
|

Re: Numerical Range Query

adb
In reply to this post by Dan Hardiker
An alternative to Lucene's NumberTools, is Solr's NumberUtils, which is more
space efficient for indexing numbers, but not as pretty to look at

http://lucene.apache.org/solr/api/org/apache/solr/util/NumberUtils.html



Dan Hardiker wrote:

> Hi,
>
> I've got an application which stores ratings for content in a Lucene
> index. It works a treat for the most part, apart from the use-case I
> have for being able to filter out ratings that have less than a given
> number of rates. It kinda works, but seems to use Alpha ranging rather
> than Numeric ranging.
>
> Here is the Java code I am using:
>
> luceneQuery.add( new RangeQuery( new Term(RateUtils.SF_FILTERED_CNT,
> minRatesString), null, true), BooleanClause.Occur.MUST );
>
> For context:
>
> * luceneQuery is a org.apache.lucene.search.BooleanQuery
> * RateUtils.SF_FILTERED_CNT is the String containing the appropriate
> field name "rating-filtered-count"
> * minRatesString is an integer as a String
>
> Here is where the field is added into the index:
>
> document.add( new Field(RateUtils.SF_FILTERED_CNT, String.valueOf(
> filteredCount ), Field.Store.YES, Field.Index.UN_TOKENIZED) );
>
> For context:
>
> * document is a org.apache.lucene.document.Document
> * filteredCount is an int (counting the number of rates that have occurred)
>
> Unfortunately it doesn't work quite as I expected as if I have 5
> documents in the index:
>
> # 5 ratings
> # 9 ratings
> # 1 rating
> # 0 ratings
> # 11 ratings
>
> If minRatesString is "5" then only the first document is returned, if
> it's "1" then the 3rd and 5th are returned, if its "6" then none are
> returned. It appears to be filtering alphabetically (starting with the
> first digit/character and matching on that) rather than numerically.
>
> Oddly enough, if I sort on that field ... it works as I expect.
>
> Am I missing something?
>
>
> --
> Dan Hardiker
>
> PS: I've been googling for well over an hour, if I'm not searching with
> the right terms - please advise me! I tried to find a way to search the
> archives specifically, but I could only browse them month by month.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Numerical Range Query

Dan Hardiker
In reply to this post by Erick Erickson
Erick Erickson wrote:
> Are you using NumberTools both at index and query time? Because
> this works exactly as I expect....

Yes, the code I posted showed the usage of NumberTools -- here it is
from my 2nd reply:

>> Taking your advice I'm now indexing using:
>>
>> document.add( new Field(RateUtils.SF_FILTERED_CNT,
>> NumberTools.longToString( filteredCount ), Field.Store.YES,
>> Field.Index.UN_TOKENIZED) );
>>
>> and searching using:
>>
>> I'm now
>> int minRates = Long.valueOf( minRatesString ).intValue();
>> luceneQuery.add( new ConstantScoreRangeQuery( RateUtils.SF_FILTERED_CNT,
>> NumberTools.longToString(minRates), "", true, false ),
>> BooleanClause.Occur.MUST );

I'll take your code and use it to create a comparative index which I can
use Luke to see where the differences are.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]