Tuning caching of geofilt queries

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Tuning caching of geofilt queries

Thomas Heigl-2
Hey all,

Our production system is heavily optimized for caching and nearly all parts
of queries are satisfied by filter caches. The only filter that varies a
lot from user to user is the location and distance. Currently we use the
default location field type and index lat/long coordinates as we get them
from Geonames and GMaps with varying decimal precision.

My question is: Does it make sense to round these coordinates (a) while
indexing and/or (b) while querying to optimize cache hits? Our maximum
required resolution for geo queries is 1km and we can tolerate minor errors
so I could round to two decimal points for most of our queries.

E.g. Instead of querying like this

fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
> d=50.0}"&sfield=user.location_p&pt=48.1981,16.394


we would round to

fq=_query_:"{!geofilt sfield=user.location_p pt=48.19,16.39
> d=50.0}"&sfield=user.location_p&pt=48.19,16.39


Any feedback would be greatly appreciated.

Cheers,

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

Erick Erickson
I don't think rounding will affect cache hits in either case _unless_
the input point for different queries can be very close to each other.

Think of the filter cache as being composed of a map where the key
is the (raw) filter query and the value is the set of documents in your
corpus that satisfy it.

So the only time rounding would help, is if it's likely that two
users enter very similar points at query time, i.e.
89.1234 and 89.1236. If you're giving them a set of choices
that are pre-defined (city center, say), then the values should be
identical to all the decimal places so rounding doesn't do you much
good.

You say you can tolerate some slop, so using bounding box might
speed up your queries...

Best
Erick

On Fri, Aug 3, 2012 at 4:56 AM, Thomas Heigl <[hidden email]> wrote:

> Hey all,
>
> Our production system is heavily optimized for caching and nearly all parts
> of queries are satisfied by filter caches. The only filter that varies a
> lot from user to user is the location and distance. Currently we use the
> default location field type and index lat/long coordinates as we get them
> from Geonames and GMaps with varying decimal precision.
>
> My question is: Does it make sense to round these coordinates (a) while
> indexing and/or (b) while querying to optimize cache hits? Our maximum
> required resolution for geo queries is 1km and we can tolerate minor errors
> so I could round to two decimal points for most of our queries.
>
> E.g. Instead of querying like this
>
> fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
>> d=50.0}"&sfield=user.location_p&pt=48.1981,16.394
>
>
> we would round to
>
> fq=_query_:"{!geofilt sfield=user.location_p pt=48.19,16.39
>> d=50.0}"&sfield=user.location_p&pt=48.19,16.39
>
>
> Any feedback would be greatly appreciated.
>
> Cheers,
>
> Thomas
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

Chris Hostetter-3
In reply to this post by Thomas Heigl-2

: My question is: Does it make sense to round these coordinates (a) while
: indexing and/or (b) while querying to optimize cache hits? Our maximum
: required resolution for geo queries is 1km and we can tolerate minor errors
: so I could round to two decimal points for most of our queries.

: fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
: > d=50.0}"&sfield=user.location_p&pt=48.1981,16.394

1) i don't see any reason for the _query_ hack ... this
should be more efficient, and easier on the eyes...

 fq={!geofilt sfield=user.location_p pt=48.19815,16.3943 d=50.0}
&sfield=user.location_p
&pt=48.1981,16.394

2) as Erick mentioned, rounding will only do you good if you expect
lots of queries from differnet users that when rounded, result in the same
point

3) you might consider disabling the caching of your geofilt queries
completley using the cache=false param. for {!geofilt} you should also be
able to combine this with the "cost" localparm to take advantage of
post-filtering, so that the distance calculations are only computed for
documents that already match your query and other cached filters...

http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
http://searchhub.org/dev/2012/02/10/advanced-filter-caching-in-solr/

4) something you also might wnat to consider (depending on your data and
how much geo surface area you are dealing with) is along the lines of
Erick's bounding box suggestion: use two filters; a "course"
bounding box that you cache, and a precise geofilt using teh cache & cost
params mentioned in #3.

that way you have a fininite number of bounding box filters that will be
cached and help quickly prune the total result set down, and then only for
the results inside that bounding box will the distance calculations for
your {!geofilt} filter be applied.  (just make sure your bounding boxes
overlap by at least as much as the max radius you search on, or you migh
miss results when your search point is close to the edge of your grid)


-Hoss
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

David Smiley
In reply to this post by Thomas Heigl-2
Chris's response is quite good, and I have a couple things to add:

1. Since you can tolerate 1km slop, try defining the dynamic field *_coordinate as tfloat instead of tdouble.  This will halve your memory requirements, but I'm not sure if it will be any faster -- it's worth a shot since you've already indicated that your requirements don't call for a double.  Information I've read vary on exactly what is the accuracy of float vs double but at a kilometer there's no question a double is overkill.

2. Try my Solr 3.x spatial plugin called "SOLR-2155" at github: https://github.com/dsmiley/SOLR-2155   It is very fast at filtering (even for circles) as indicated in this stackoverflow thread:  http://stackoverflow.com/questions/11636376/solr-performance-on-ec2-for-geospatial-queries   in which it destroys LatLonType in a big data speed test :-D.   You should be happy to know that this technology is on its way into Solr 4, albeit not quite yet.

Cheers,
  ~ David Smiley
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

Yonik Seeley-2-2
On Fri, Aug 10, 2012 at 1:47 PM, David Smiley (@MITRE.org)
<[hidden email]> wrote:
> Information I've read vary on exactly what is the accuracy of float
> vs double but at a kilometer there's no question a double is overkill.

Back of the envelope:

23 mantissa bits + 1 implied bit == 24 effective mantissa bits in a 32
bit float.

40,000 km circumference / (2^24) = .0024 km  (i.e. our resolution at
the equator is 2.4m at best - there will be some lost unused space at
the beginning and end of the +-180 number-line).

Is that in line with what you've read?

-Yonik
http://lucidworks.com
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

David Smiley
Yeah it is... I rather like this write-up:
https://sites.google.com/site/trescopter/Home/concepts/required-precision-for-gps-calculations#TOC-Precision-of-Float-and-Double
-- which also arrives at 2.37m worse case.  

Aside from RAM savings, I wonder if there is any noticeable performance difference for LatLonType.
Reply | Threaded
Open this post in threaded view
|

Re: Tuning caching of geofilt queries

Lance Norskog-2
In other computations I found exactly zero performance difference
between floats & doubles. Even with long arrays number which you would
expect to be sensitive to locality effects.

On Fri, Aug 10, 2012 at 11:20 AM, David Smiley (@MITRE.org)
<[hidden email]> wrote:

> Yeah it is... I rather like this write-up:
> https://sites.google.com/site/trescopter/Home/concepts/required-precision-for-gps-calculations#TOC-Precision-of-Float-and-Double
> -- which also arrives at 2.37m worse case.
>
> Aside from RAM savings, I wonder if there is any noticeable performance
> difference for LatLonType.
>
>
>
> -----
>  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tuning-caching-of-geofilt-queries-tp3998975p4000534.html
> Sent from the Solr - User mailing list archive at Nabble.com.



--
Lance Norskog
[hidden email]