range vs. filter queries

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

range vs. filter queries

Ryan McKinley
Hello-

I'm working on a SearchComponent that should limit results to entries
within a geographic range.  I would love some feedback to make sure I'm
not building silly queries and/or can change them to be better.  I have
four fields:

   <field name="north" type="sfloat" ... />
   <field name="south" type="sfloat" ... />
   <field name="east"  type="sfloat" ... />
   <field name="west"  type="sfloat" ... />

The component looks for a "bounds" argument and parses out the NSEW
corners.  Currently, I'm building a boolean query and adding that to the
filter list:

       FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" );

       BooleanQuery range = new BooleanQuery( true );
       range.add( new ConstantScoreRangeQuery( "north", null,
ft.toInternal(n), true, true ), BooleanClause.Occur.MUST );
       range.add( new ConstantScoreRangeQuery( "south",
ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST );
       range.add( new ConstantScoreRangeQuery( "east", null,
ft.toInternal(e), true, true ), BooleanClause.Occur.MUST );
       range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w),
null, true, true ), BooleanClause.Occur.MUST );

essentially, this is:
  +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]


Would this be better as four individual filters?

Additionally, I could chunk the world into a grid and see index if a
point exists within a square.  This could potentially cut out many
results with a simple term query, but I don't know if it is worthwhile
since I will need to run the points through a range query at the end anyway.

Any thoughts of feedback would be great.

thanks
ryan



Reply | Threaded
Open this post in threaded view
|

Re: range vs. filter queries

Yonik Seeley-2
On Feb 11, 2008 8:51 PM, Ryan McKinley <[hidden email]> wrote:

> Hello-
>
> I'm working on a SearchComponent that should limit results to entries
> within a geographic range.  I would love some feedback to make sure I'm
> not building silly queries and/or can change them to be better.  I have
> four fields:
>
>    <field name="north" type="sfloat" ... />
>    <field name="south" type="sfloat" ... />
>    <field name="east"  type="sfloat" ... />
>    <field name="west"  type="sfloat" ... />
>
> The component looks for a "bounds" argument and parses out the NSEW
> corners.  Currently, I'm building a boolean query and adding that to the
> filter list:
>
>        FieldType ft = req.getSchema().getFieldTypes().get( "sfloat" );
>
>        BooleanQuery range = new BooleanQuery( true );
>        range.add( new ConstantScoreRangeQuery( "north", null,
> ft.toInternal(n), true, true ), BooleanClause.Occur.MUST );
>        range.add( new ConstantScoreRangeQuery( "south",
> ft.toInternal(s), null, true, true ), BooleanClause.Occur.MUST );
>        range.add( new ConstantScoreRangeQuery( "east", null,
> ft.toInternal(e), true, true ), BooleanClause.Occur.MUST );
>        range.add( new ConstantScoreRangeQuery( "west", ft.toInternal(w),
> null, true, true ), BooleanClause.Occur.MUST );
>
> essentially, this is:
>   +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]
>
>
> Would this be better as four individual filters?

Only if there were likely to occur again in combination with different
constraints.
My guess would be no.

Perhaps you want 2 fields (lat and long) instead of 4?

One issue here is range queries that include many terms are currently slow.
That's something we need to address sometime (there has been some work
on this in Lucene, but nothing yet committed AFAIK).

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: range vs. filter queries

Ryan McKinley
>>
>> Would this be better as four individual filters?
>
> Only if there were likely to occur again in combination with different
> constraints.
> My guess would be no.

this is because the filter could not be cached?

Since i know it should not cached, is there any way to make sure it does
not purge useful stuff from the cache?

>
> Perhaps you want 2 fields (lat and long) instead of 4?
>

2 is fine if I was dealing with points, but this is a region, so i need
to deal with a whole region (N,S,E,and W).


> One issue here is range queries that include many terms are currently slow.
> That's something we need to address sometime (there has been some work
> on this in Lucene, but nothing yet committed AFAIK).
>

do range queries operate on the whole index, or can they be limited
first?  That is, if i can throw out half the docs with a simple
TermQuery, does the range still have to go through everything?

thanks
ryan

Reply | Threaded
Open this post in threaded view
|

Re: range vs. filter queries

Yonik Seeley-2
On Feb 11, 2008 9:13 PM, Ryan McKinley <[hidden email]> wrote:
> >>
> >> Would this be better as four individual filters?
> >
> > Only if there were likely to occur again in combination with different
> > constraints.
> > My guess would be no.
>
> this is because the filter could not be cached?

right.  It's probably minor though... the bigger cost will be
generation of those range queries.

> Since i know it should not cached, is there any way to make sure it does
> not purge useful stuff from the cache?
>
> >
> > Perhaps you want 2 fields (lat and long) instead of 4?
> >
>
> 2 is fine if I was dealing with points, but this is a region, so i need
> to deal with a whole region (N,S,E,and W).

If it's a bounding box, it can be defined by 2 range queries, right?

> > One issue here is range queries that include many terms are currently slow.
> > That's something we need to address sometime (there has been some work
> > on this in Lucene, but nothing yet committed AFAIK).
> >
>
> do range queries operate on the whole index, or can they be limited
> first?  That is, if i can throw out half the docs with a simple
> TermQuery, does the range still have to go through everything?

Needs to go through everything.  No easy way to avoid that right now.

-Yonik
Reply | Threaded
Open this post in threaded view
|

RE: range vs. filter queries

Lance Norskog-2
In reply to this post by Ryan McKinley
Is it not possible to make a grid of your boxes? It seems like this would be
a more efficient query:

        grid:N100_S50_E250_W412

This is how GIS systems work, right?

Lance

-----Original Message-----
From: Ryan McKinley [mailto:[hidden email]]
Sent: Monday, February 11, 2008 6:13 PM
To: [hidden email]
Subject: Re: range vs. filter queries

>>
>> Would this be better as four individual filters?
>
> Only if there were likely to occur again in combination with different
> constraints.
> My guess would be no.

this is because the filter could not be cached?

Since i know it should not cached, is there any way to make sure it does not
purge useful stuff from the cache?

>
> Perhaps you want 2 fields (lat and long) instead of 4?
>

2 is fine if I was dealing with points, but this is a region, so i need to
deal with a whole region (N,S,E,and W).


> One issue here is range queries that include many terms are currently
slow.
> That's something we need to address sometime (there has been some work
> on this in Lucene, but nothing yet committed AFAIK).
>

do range queries operate on the whole index, or can they be limited
first?  That is, if i can throw out half the docs with a simple
TermQuery, does the range still have to go through everything?

thanks
ryan


Reply | Threaded
Open this post in threaded view
|

Re: range vs. filter queries

Ryan McKinley
Lance Norskog wrote:
> Is it not possible to make a grid of your boxes? It seems like this would be
> a more efficient query:
>
> grid:N100_S50_E250_W412
>
> This is how GIS systems work, right?
>

something like that...  I was just checking if I could get away with
range queries for now...  I'll also check if local lucene is possible:
http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm

ryan

Reply | Threaded
Open this post in threaded view
|

Re: range vs. filter queries

hossman
In reply to this post by Ryan McKinley

: essentially, this is:
:  +north:[* TO nnn] +south:[sss TO *] +east:[* TO eee] +west:[www TO *]

: Would this be better as four individual filters?

it depends on the granularity you exepct clients to query with ... if
cleents can get really granular, then the odds of reuse are lower, so the
advantages of individual filters are gone.

if however you know that the granularity of your input will allways be
something course -- like a multiple of 15 degrees, or even 5 degrees --
then it's probably practical to break it out.

Something else to consider: use cached range filters for the course
aspects, but use uncached range filters for the precisision stuff.  i know
you're searching for areas, but for simplicity assume the docs exist at
point and the query is a box... if the input box is "N+42.5 S+13.7 W-78.2
E-62.4" you can cache filters for lat:[15 TO 45] and lon:[-75 TO -60] and
then intersect and union those results with uncached queries for lat:[13.7
TO 15], lat:[42.5 TO 45], lon:[-78.2 TO -75], lon:-62.4 TO 60]

...I've never tried this so i'm not sure if the cost/benefit trade off
actaully makes sense ... but the principle seems sound.




-Hoss