Hacking proximity search: looking for feedback

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Hacking proximity search: looking for feedback

jrodenburg
I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.  For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range.  In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff
Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

Chris Hostetter-3

: Geo definition:
: Boxing around a center point.  It's not critical to do a radius search with
: a given circle.  A boxed approach allows for taller or wider frames of
: reference, which are applicable for our use.

if you are just loking to confine your results to a box then i think
RangeFiltering on both the X and Y axis will be more efficient then the
individual term queries you are producing.

It will have the added bonus of not artificially affecting the scores of
hte documents based on how often a particular digit apears in a particular
position of hte latitue accross your corpus.

Once you've filtered down to a particular bounding box, you might consider
going back to the function query approach to score documents inside that
box based on their actual distance from the center point.  I don't recall
at the moment but i believe FunctionQuery's Scorer supports skipTo in such
a way that it won't bother computing the function for a document that has
been skiped (ie: when containing in a BooleanQuery with another clause
that has already prohibited it, or when executed in the context of a
Filter)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Hacking proximity search: looking for feedback

Bryzek.Michael
In reply to this post by jrodenburg
Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

       North latitude = 47.77704
       West longitude = -122.41909
       South latitude = 47.34827
       East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.


Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike


-----Original Message-----
From: Jeff Rodenburg [mailto:[hidden email]]
Sent: Tue 2/28/06 3:10 PM
To: [hidden email]
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the same.

There's been some work done in this area publicly (or at least what I could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math, what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups; how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

********************************************************************************************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.  For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and 47.34827.
With lexicographical ordering in mind, I can safely include this phrase in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and the
fourth position containing either 4 or 5 or 6 will yield a qualifying match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8 Lat5:9)

The logic here is simply an extension of the first query, extended to next
position in the latitude range.  In the Northern latitude, with the first
four positions matching those values exactly, anything less than 7 in the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

*******************************************************************************************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this far):
feedback?

Thanks,
-- jeff




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Hacking proximity search: looking for feedback

jpowers
In reply to this post by jrodenburg
I don't know if this matters, but we do all of our geolocating in sql
with decent speed.   All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show.    Is the data such that you
need lucene?    Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 2:49 PM
To: [hidden email]
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

       North latitude = 47.77704
       West longitude = -122.41909
       South latitude = 47.34827
       East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.


Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike


-----Original Message-----
From: Jeff Rodenburg [mailto:[hidden email]]
Sent: Tue 2/28/06 3:10 PM
To: [hidden email]
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search
with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.
For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required
clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range.  In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Hacking proximity search: looking for feedback

Bryzek.Michael
Our geo searches are combined with keyword searches. We previously performed all of our queries in the database (Oracle 10g w/ interMedia for the unstructured portion) but found that it was easier to scale search outside the database than within.


-----Original Message-----
From: John Powers [mailto:[hidden email]]
Sent: Tue 2/28/06 3:53 PM
To: [hidden email]
Cc:
Subject: RE: Hacking proximity search: looking for feedback

I don't know if this matters, but we do all of our geolocating in sql
with decent speed.   All the trig is in the query itself and then we can
limit top 5, top 10 etc for what we show.    Is the data such that you
need lucene?    Can I ask what causes it to be beyond a databases
ability?

-----Original Message-----
From: Bryzek.Michael [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 2:49 PM
To: [hidden email]
Subject: RE: Hacking proximity search: looking for feedback

Jeff -

This is an interesting approach. On our end, we have experimented with
two variants:

Variant 1: Use Range Query

Rather than precomputing the boolean clauses yourself, index encoded
latitude and longitude values and use a Range Query. We encode by
adding 1000 to each of the values. Note: We only deal with zip codes
in the US for which this encoding works fine, but is worth another
look if you have a broader range of coordinates.

Example: Compute the box as you describe, then encode each value and
use a RangeQuery. Using the box you provided and lucene's query
parser:

       North latitude = 47.77704
       West longitude = -122.41909
       South latitude = 47.34827
       East longitude = -122.22031

gives us this query:

latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
877.77969]

Lucene then handles expanding this query into the appropriate set of
Boolean Queries.


Variant 2: Combine the above w/ a custom scorer

For us, boxing works well to retrieve relevant documents, but our
users want those documents sorted by distance. We modified the custom
scoring class that Hatcher provides in Lucene in Action to compute the
distance between two points for only those documents which actually
match. Thus, we do our search using a Range Query, then for all
matching documents we compute an actual score that incorporates the
distance from the actual location from which the user is searching.

On our data set, we can still end up with 1000s of matching documents
after boxing. Thus, we still see a bottleneck computing the score for
even this smaller set of documents which we are still working through.

-Mike


-----Original Message-----
From: Jeff Rodenburg [mailto:[hidden email]]
Sent: Tue 2/28/06 3:10 PM
To: [hidden email]
Cc:
Subject: Hacking proximity search: looking for feedback

I've been wrestling with a way to index and search data with a
geo-positional aspect.  By a geo-positional search, I want to constrain
search results within a given location range.  Furthermore, I want to
allow
the user to set/change the geo-positional boundaries as needed for their
search.  This isn't a foreign concept to anyone who's needed to do the
same.

There's been some work done in this area publicly (or at least what I
could
find via the list archives and google), but not a lot.  The guys at
eventax.de have done some very impressive work.  Their implementation is
within the constructs of nutch; there's more here at
http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
interesting and is predicated by having data indexed in a certain
format.
I've considered building something based on the FunctionQuery class as
well.  Within this class, I could actually do the mathematical
calculations
(Haversine formula, anyone?) for geo-positional filtering.  Range
queries on
this data set were an option as well.

I've hit a performance wall with these approaches.  The geo-positional
calculations are proving to be intensive.  With the combination of my
data
set, hardware, and OS, this just wasn't getting it done.

So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
what
if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
how
could I do geo-positional search in that framework?  I seized upon an
approach that I've validated in testing, but wanted to get more feedback
from the community.

************************************************************************
********************************

Data structure:
Latitude & longitude values in decimal format, i.e. latitude=47.480227,
longitude=-122.333670 (btw, a tire shop I recently visited).

Geo definition:
Boxing around a center point.  It's not critical to do a radius search
with
a given circle.  A boxed approach allows for taller or wider frames of
reference, which are applicable for our use.


How I'm solving:
Treat the data as strings and formulate boolean query lookups based on
positional comparison.  Here are the steps:

Indexing:
1. Break up lat/long values to individual characters, and store each
field
in progression.  Field storage type = Keyword.  The tire shop example:
Lat1 = [4]
Lat2 = [7]
Lat3 = [.]
Lat4 = [4]
Lat5 = [8]
...
Long1 = [-]
Long2 = [1]
Long3 = [2]
...

Searching:
1. Start with box coordinates - North/West corner, South/East corner.
For
example, a sample box around Seattle, WA:
North latitude = 47.77704
West longitude = -122.41909
South latitude = 47.34827
East longitude = -122.22031
2. Break up lat/long values in a manner similar to indexing.
3. Begin to build boolean query.  Query will contain two required
clauses:
the North/South query, and the West/East query.  Both queries are built
using the same progressional evaluation of characters by position.
4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
chart:

North = [4][7][.][7][7][7][0][4]
South = [4][7][.][3][4][8][2][7]

Qualifying records will have latitude values between 47.77704 and
47.34827.
With lexicographical ordering in mind, I can safely include this phrase
in
my North/South query:

   (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)

The logic is that any latitude with the first three values matching, and
the
fourth position containing either 4 or 5 or 6 will yield a qualifying
match.

What other queries are needed?  Ones that match 7 or 3 in the fourth
position should be considered, but they need further qualification.  The
further qualification is based on values from the fifth position.  I can
safely included the following phrases in my North/South query:

    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
Lat5:2
Lat5:1 Lat5:0)
    (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
Lat5:9)

The logic here is simply an extension of the first query, extended to
next
position in the latitude range.  In the Northern latitude, with the
first
four positions matching those values exactly, anything less than 7 in
the
fifth position will yield a matching latitude.  Same goes for the South
(bottom) query.

************************************************************************
*******************

This approach yields a higher number of boolean query clauses, the more
granular you get.  In this scenario, I've estimated approximately 145
clauses within the final constructed query.  In validation testing, this
approach has proven to be:

1) Accurate.
2) Performant (thus far).


At last, my question to everyone who cares to respond (and read this
far):
feedback?

Thanks,
-- jeff




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

jrodenburg
I'm in the same boat as Michael on this one.  It's not a matter of finding
the right technology to do geo-locational calculations, but rather being
able to accomplish that task in conjunction with keyword search.

-- j

On 2/28/06, Bryzek.Michael <[hidden email]> wrote:

>
> Our geo searches are combined with keyword searches. We previously
> performed all of our queries in the database (Oracle 10g w/ interMedia for
> the unstructured portion) but found that it was easier to scale search
> outside the database than within.
>
>
> -----Original Message-----
> From:   John Powers [mailto:[hidden email]]
> Sent:   Tue 2/28/06 3:53 PM
> To:     [hidden email]
> Cc:
> Subject:        RE: Hacking proximity search: looking for feedback
>
> I don't know if this matters, but we do all of our geolocating in sql
> with decent speed.   All the trig is in the query itself and then we can
> limit top 5, top 10 etc for what we show.    Is the data such that you
> need lucene?    Can I ask what causes it to be beyond a databases
> ability?
>
> -----Original Message-----
> From: Bryzek.Michael [mailto:[hidden email]]
> Sent: Tuesday, February 28, 2006 2:49 PM
> To: [hidden email]
> Subject: RE: Hacking proximity search: looking for feedback
>
> Jeff -
>
> This is an interesting approach. On our end, we have experimented with
> two variants:
>
> Variant 1: Use Range Query
>
> Rather than precomputing the boolean clauses yourself, index encoded
> latitude and longitude values and use a Range Query. We encode by
> adding 1000 to each of the values. Note: We only deal with zip codes
> in the US for which this encoding works fine, but is worth another
> look if you have a broader range of coordinates.
>
> Example: Compute the box as you describe, then encode each value and
> use a RangeQuery. Using the box you provided and lucene's query
> parser:
>
>        North latitude = 47.77704
>        West longitude = -122.41909
>        South latitude = 47.34827
>        East longitude = -122.22031
>
> gives us this query:
>
> latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO
> 877.77969]
>
> Lucene then handles expanding this query into the appropriate set of
> Boolean Queries.
>
>
> Variant 2: Combine the above w/ a custom scorer
>
> For us, boxing works well to retrieve relevant documents, but our
> users want those documents sorted by distance. We modified the custom
> scoring class that Hatcher provides in Lucene in Action to compute the
> distance between two points for only those documents which actually
> match. Thus, we do our search using a Range Query, then for all
> matching documents we compute an actual score that incorporates the
> distance from the actual location from which the user is searching.
>
> On our data set, we can still end up with 1000s of matching documents
> after boxing. Thus, we still see a bottleneck computing the score for
> even this smaller set of documents which we are still working through.
>
> -Mike
>
>
> -----Original Message-----
> From:   Jeff Rodenburg [mailto:[hidden email]]
> Sent:   Tue 2/28/06 3:10 PM
> To:     [hidden email]
> Cc:
> Subject:        Hacking proximity search: looking for feedback
>
> I've been wrestling with a way to index and search data with a
> geo-positional aspect.  By a geo-positional search, I want to constrain
> search results within a given location range.  Furthermore, I want to
> allow
> the user to set/change the geo-positional boundaries as needed for their
> search.  This isn't a foreign concept to anyone who's needed to do the
> same.
>
> There's been some work done in this area publicly (or at least what I
> could
> find via the list archives and google), but not a lot.  The guys at
> eventax.de have done some very impressive work.  Their implementation is
> within the constructs of nutch; there's more here at
> http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
> interesting and is predicated by having data indexed in a certain
> format.
> I've considered building something based on the FunctionQuery class as
> well.  Within this class, I could actually do the mathematical
> calculations
> (Haversine formula, anyone?) for geo-positional filtering.  Range
> queries on
> this data set were an option as well.
>
> I've hit a performance wall with these approaches.  The geo-positional
> calculations are proving to be intensive.  With the combination of my
> data
> set, hardware, and OS, this just wasn't getting it done.
>
> So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
> what
> if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
> how
> could I do geo-positional search in that framework?  I seized upon an
> approach that I've validated in testing, but wanted to get more feedback
> from the community.
>
> ************************************************************************
> ********************************
>
> Data structure:
> Latitude & longitude values in decimal format, i.e. latitude=47.480227,
> longitude=-122.333670 (btw, a tire shop I recently visited).
>
> Geo definition:
> Boxing around a center point.  It's not critical to do a radius search
> with
> a given circle.  A boxed approach allows for taller or wider frames of
> reference, which are applicable for our use.
>
>
> How I'm solving:
> Treat the data as strings and formulate boolean query lookups based on
> positional comparison.  Here are the steps:
>
> Indexing:
> 1. Break up lat/long values to individual characters, and store each
> field
> in progression.  Field storage type = Keyword.  The tire shop example:
> Lat1 = [4]
> Lat2 = [7]
> Lat3 = [.]
> Lat4 = [4]
> Lat5 = [8]
> ...
> Long1 = [-]
> Long2 = [1]
> Long3 = [2]
> ...
>
> Searching:
> 1. Start with box coordinates - North/West corner, South/East corner.
> For
> example, a sample box around Seattle, WA:
> North latitude = 47.77704
> West longitude = -122.41909
> South latitude = 47.34827
> East longitude = -122.22031
> 2. Break up lat/long values in a manner similar to indexing.
> 3. Begin to build boolean query.  Query will contain two required
> clauses:
> the North/South query, and the West/East query.  Both queries are built
> using the same progressional evaluation of characters by position.
> 4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
> chart:
>
> North = [4][7][.][7][7][7][0][4]
> South = [4][7][.][3][4][8][2][7]
>
> Qualifying records will have latitude values between 47.77704 and
> 47.34827.
> With lexicographical ordering in mind, I can safely include this phrase
> in
> my North/South query:
>
>    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)
>
> The logic is that any latitude with the first three values matching, and
> the
> fourth position containing either 4 or 5 or 6 will yield a qualifying
> match.
>
> What other queries are needed?  Ones that match 7 or 3 in the fourth
> position should be considered, but they need further qualification.  The
> further qualification is based on values from the fifth position.  I can
> safely included the following phrases in my North/South query:
>
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3
> Lat5:2
> Lat5:1 Lat5:0)
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
> Lat5:9)
>
> The logic here is simply an extension of the first query, extended to
> next
> position in the latitude range.  In the Northern latitude, with the
> first
> four positions matching those values exactly, anything less than 7 in
> the
> fifth position will yield a matching latitude.  Same goes for the South
> (bottom) query.
>
> ************************************************************************
> *******************
>
> This approach yields a higher number of boolean query clauses, the more
> granular you get.  In this scenario, I've estimated approximately 145
> clauses within the final constructed query.  In validation testing, this
> approach has proven to be:
>
> 1) Accurate.
> 2) Performant (thus far).
>
>
> At last, my question to everyone who cares to respond (and read this
> far):
> feedback?
>
> Thanks,
> -- jeff
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

jrodenburg
In reply to this post by Bryzek.Michael
Michael -

Great thoughts, and thanks for the feedback.

Following on the Range Query approach, how is performance?  I found the
range approach (albeit with the exact values) to be slower than the
parsed-string approach I posited.

On the custom scoring, is the distance element for sorting or as a component
of relevance?  We have a need for distance sorting, but I'm trying to slay
that beast at a later stage.

-- jeff

On 2/28/06, Bryzek.Michael <[hidden email]> wrote:

>
> Jeff -
>
> This is an interesting approach. On our end, we have experimented with
> two variants:
>
> Variant 1: Use Range Query
>
> Rather than precomputing the boolean clauses yourself, index encoded
> latitude and longitude values and use a Range Query. We encode by
> adding 1000 to each of the values. Note: We only deal with zip codes
> in the US for which this encoding works fine, but is worth another
> look if you have a broader range of coordinates.
>
> Example: Compute the box as you describe, then encode each value and
> use a RangeQuery. Using the box you provided and lucene's query
> parser:
>
>        North latitude = 47.77704
>        West longitude = -122.41909
>        South latitude = 47.34827
>        East longitude = -122.22031
>
> gives us this query:
>
> latitude:[1047.34827 TO 1047.77704] AND longitude:[877.58091 TO 877.77969]
>
> Lucene then handles expanding this query into the appropriate set of
> Boolean Queries.
>
>
> Variant 2: Combine the above w/ a custom scorer
>
> For us, boxing works well to retrieve relevant documents, but our
> users want those documents sorted by distance. We modified the custom
> scoring class that Hatcher provides in Lucene in Action to compute the
> distance between two points for only those documents which actually
> match. Thus, we do our search using a Range Query, then for all
> matching documents we compute an actual score that incorporates the
> distance from the actual location from which the user is searching.
>
> On our data set, we can still end up with 1000s of matching documents
> after boxing. Thus, we still see a bottleneck computing the score for
> even this smaller set of documents which we are still working through.
>
> -Mike
>
>
> -----Original Message-----
> From:   Jeff Rodenburg [mailto:[hidden email]]
> Sent:   Tue 2/28/06 3:10 PM
> To:     [hidden email]
> Cc:
> Subject:        Hacking proximity search: looking for feedback
>
> I've been wrestling with a way to index and search data with a
> geo-positional aspect.  By a geo-positional search, I want to constrain
> search results within a given location range.  Furthermore, I want to
> allow
> the user to set/change the geo-positional boundaries as needed for their
> search.  This isn't a foreign concept to anyone who's needed to do the
> same.
>
> There's been some work done in this area publicly (or at least what I
> could
> find via the list archives and google), but not a lot.  The guys at
> eventax.de have done some very impressive work.  Their implementation is
> within the constructs of nutch; there's more here at
> http://wiki.apache.org/nutch/GeoPosition.  Their approach is very
> interesting and is predicated by having data indexed in a certain format.
> I've considered building something based on the FunctionQuery class as
> well.  Within this class, I could actually do the mathematical
> calculations
> (Haversine formula, anyone?) for geo-positional filtering.  Range queries
> on
> this data set were an option as well.
>
> I've hit a performance wall with these approaches.  The geo-positional
> calculations are proving to be intensive.  With the combination of my data
> set, hardware, and OS, this just wasn't getting it done.
>
> So, I reversed my thinking.  Instead of getting Lucene to do geo-math,
> what
> if I made geo-math do Lucene?  Lucene is exceptional at string lookups;
> how
> could I do geo-positional search in that framework?  I seized upon an
> approach that I've validated in testing, but wanted to get more feedback
> from the community.
>
>
> ********************************************************************************************************
>
> Data structure:
> Latitude & longitude values in decimal format, i.e. latitude=47.480227,
> longitude=-122.333670 (btw, a tire shop I recently visited).
>
> Geo definition:
> Boxing around a center point.  It's not critical to do a radius search
> with
> a given circle.  A boxed approach allows for taller or wider frames of
> reference, which are applicable for our use.
>
>
> How I'm solving:
> Treat the data as strings and formulate boolean query lookups based on
> positional comparison.  Here are the steps:
>
> Indexing:
> 1. Break up lat/long values to individual characters, and store each field
> in progression.  Field storage type = Keyword.  The tire shop example:
> Lat1 = [4]
> Lat2 = [7]
> Lat3 = [.]
> Lat4 = [4]
> Lat5 = [8]
> ...
> Long1 = [-]
> Long2 = [1]
> Long3 = [2]
> ...
>
> Searching:
> 1. Start with box coordinates - North/West corner, South/East corner.  For
> example, a sample box around Seattle, WA:
> North latitude = 47.77704
> West longitude = -122.41909
> South latitude = 47.34827
> East longitude = -122.22031
> 2. Break up lat/long values in a manner similar to indexing.
> 3. Begin to build boolean query.  Query will contain two required clauses:
> the North/South query, and the West/East query.  Both queries are built
> using the same progressional evaluation of characters by position.
> 4. Compare North/South (top/bottom) values.  Here's a pseudo-graphical
> chart:
>
> North = [4][7][.][7][7][7][0][4]
> South = [4][7][.][3][4][8][2][7]
>
> Qualifying records will have latitude values between 47.77704 and 47.34827
> .
> With lexicographical ordering in mind, I can safely include this phrase in
> my North/South query:
>
>    (+Lat1:4 +Lat2:7 +Lat3:. +(Lat4:4 Lat4:5 Lat4:6)
>
> The logic is that any latitude with the first three values matching, and
> the
> fourth position containing either 4 or 5 or 6 will yield a qualifying
> match.
>
> What other queries are needed?  Ones that match 7 or 3 in the fourth
> position should be considered, but they need further qualification.  The
> further qualification is based on values from the fifth position.  I can
> safely included the following phrases in my North/South query:
>
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:7) +(Lat5:6 Lat5:5 Lat5:4 Lat5:3 Lat5:2
> Lat5:1 Lat5:0)
>     (+Lat1:4 +Lat2:7 +Lat3:. +Lat4:3) +(Lat5:5 Lat5:6 Lat5:7 Lat5:8
> Lat5:9)
>
> The logic here is simply an extension of the first query, extended to next
> position in the latitude range.  In the Northern latitude, with the first
> four positions matching those values exactly, anything less than 7 in the
> fifth position will yield a matching latitude.  Same goes for the South
> (bottom) query.
>
>
> *******************************************************************************************
>
> This approach yields a higher number of boolean query clauses, the more
> granular you get.  In this scenario, I've estimated approximately 145
> clauses within the final constructed query.  In validation testing, this
> approach has proven to be:
>
> 1) Accurate.
> 2) Performant (thus far).
>
>
> At last, my question to everyone who cares to respond (and read this far):
> feedback?
>
> Thanks,
> -- jeff
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

jrodenburg
In reply to this post by Chris Hostetter-3
Very good points, I hadn't considered the term frequency of the digits
affecting scoring.  As an aside, can that aspect of the score be ignored for
these fields?

I need to spend more time with FunctionQuery, I haven't given it the
attention it deserves.

Great feedback, thanks for the notes.

-- jeff

On 2/28/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : Geo definition:
> : Boxing around a center point.  It's not critical to do a radius search
> with
> : a given circle.  A boxed approach allows for taller or wider frames of
> : reference, which are applicable for our use.
>
> if you are just loking to confine your results to a box then i think
> RangeFiltering on both the X and Y axis will be more efficient then the
> individual term queries you are producing.
>
> It will have the added bonus of not artificially affecting the scores of
> hte documents based on how often a particular digit apears in a particular
> position of hte latitue accross your corpus.
>
> Once you've filtered down to a particular bounding box, you might consider
> going back to the function query approach to score documents inside that
> box based on their actual distance from the center point.  I don't recall
> at the moment but i believe FunctionQuery's Scorer supports skipTo in such
> a way that it won't bother computing the function for a document that has
> been skiped (ie: when containing in a BooleanQuery with another clause
> that has already prohibited it, or when executed in the context of a
> Filter)
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

Chris Hostetter-3

: Very good points, I hadn't considered the term frequency of the digits
: affecting scoring.  As an aside, can that aspect of the score be ignored for
: these fields?

The easiest way is to use a boost that is so low it's insignificant, or
you could subclass TermQuery and override getSimilarity to return a
DelegateSimilarity which wraps the real instance and returns constant
values for things like tf() and idf() ... but i'm 95% sure that using a
RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
of those TermQueries no matter what.

: I need to spend more time with FunctionQuery, I haven't given it the
: attention it deserves.

i would start by trying out an apples to apples comparison of your current
approach with one where your index only has one indexed field each for
long/lat that uses ConstantScoreRangeQuery to do the boxing.  Compare both
the size of the resulting indexes, the memory footprint while open, and
the time spent executing comparable queries.  You should probably compare
queries that involve both large boxes and small boxes, and depending on
hte usage pattern you expect consider caching your Filters if you expect
many boxes to be reused frequently.

once you've found the "best" way to do your boxing ... then look into
using FunctionQueries to influence your scores based on distance fro mthe
center of hte box.

:
: Great feedback, thanks for the notes.
:
: -- jeff
:
: On 2/28/06, Chris Hostetter <[hidden email]> wrote:
: >
: >
: > : Geo definition:
: > : Boxing around a center point.  It's not critical to do a radius search
: > with
: > : a given circle.  A boxed approach allows for taller or wider frames of
: > : reference, which are applicable for our use.
: >
: > if you are just loking to confine your results to a box then i think
: > RangeFiltering on both the X and Y axis will be more efficient then the
: > individual term queries you are producing.
: >
: > It will have the added bonus of not artificially affecting the scores of
: > hte documents based on how often a particular digit apears in a particular
: > position of hte latitue accross your corpus.
: >
: > Once you've filtered down to a particular bounding box, you might consider
: > going back to the function query approach to score documents inside that
: > box based on their actual distance from the center point.  I don't recall
: > at the moment but i believe FunctionQuery's Scorer supports skipTo in such
: > a way that it won't bother computing the function for a document that has
: > been skiped (ie: when containing in a BooleanQuery with another clause
: > that has already prohibited it, or when executed in the context of a
: > Filter)
: >
: >
: >
: > -Hoss
: >
: >
: > ---------------------------------------------------------------------
: > To unsubscribe, e-mail: [hidden email]
: > For additional commands, e-mail: [hidden email]
: >
: >
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

jrodenburg
Thanks to everyone on the replies.  I'm going to try several of these
approaches and with equivalent data sets and run some side-by-side tests.

No timeframes guarantees here, but I'll report back with the different
approaches and the test results.

cheers,
-- j


On 2/28/06, Chris Hostetter <[hidden email]> wrote:

>
>
> : Very good points, I hadn't considered the term frequency of the digits
> : affecting scoring.  As an aside, can that aspect of the score be ignored
> for
> : these fields?
>
> The easiest way is to use a boost that is so low it's insignificant, or
> you could subclass TermQuery and override getSimilarity to return a
> DelegateSimilarity which wraps the real instance and returns constant
> values for things like tf() and idf() ... but i'm 95% sure that using a
> RangeFilter (or a ConstantScoreRangeQuery) is going to be faster then all
> of those TermQueries no matter what.
>
> : I need to spend more time with FunctionQuery, I haven't given it the
> : attention it deserves.
>
> i would start by trying out an apples to apples comparison of your current
> approach with one where your index only has one indexed field each for
> long/lat that uses ConstantScoreRangeQuery to do the boxing.  Compare both
> the size of the resulting indexes, the memory footprint while open, and
> the time spent executing comparable queries.  You should probably compare
> queries that involve both large boxes and small boxes, and depending on
> hte usage pattern you expect consider caching your Filters if you expect
> many boxes to be reused frequently.
>
> once you've found the "best" way to do your boxing ... then look into
> using FunctionQueries to influence your scores based on distance fro mthe
> center of hte box.
>
> :
> : Great feedback, thanks for the notes.
> :
> : -- jeff
> :
> : On 2/28/06, Chris Hostetter <[hidden email]> wrote:
> : >
> : >
> : > : Geo definition:
> : > : Boxing around a center point.  It's not critical to do a radius
> search
> : > with
> : > : a given circle.  A boxed approach allows for taller or wider frames
> of
> : > : reference, which are applicable for our use.
> : >
> : > if you are just loking to confine your results to a box then i think
> : > RangeFiltering on both the X and Y axis will be more efficient then
> the
> : > individual term queries you are producing.
> : >
> : > It will have the added bonus of not artificially affecting the scores
> of
> : > hte documents based on how often a particular digit apears in a
> particular
> : > position of hte latitue accross your corpus.
> : >
> : > Once you've filtered down to a particular bounding box, you might
> consider
> : > going back to the function query approach to score documents inside
> that
> : > box based on their actual distance from the center point.  I don't
> recall
> : > at the moment but i believe FunctionQuery's Scorer supports skipTo in
> such
> : > a way that it won't bother computing the function for a document that
> has
> : > been skiped (ie: when containing in a BooleanQuery with another clause
> : > that has already prohibited it, or when executed in the context of a
> : > Filter)
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: [hidden email]
> : > For additional commands, e-mail: [hidden email]
> : >
> : >
> :
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

Doug Cutting
In reply to this post by jrodenburg
Jeff Rodenburg wrote:
> Following on the Range Query approach, how is performance?  I found the
> range approach (albeit with the exact values) to be slower than the
> parsed-string approach I posited.

Note that Hoss suggested RangeFilter, not RangeQuery.  Or perhaps
ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
But RangeQuery will be slower than either RangeFilter or
ConstantScoreRangeQuery.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Hacking proximity search: looking for feedback

jrodenburg
Very good note, I missed that.  I need the development environment in front
of me to remember all the different class names correctly.  ;-)

-- j


On 3/1/06, Doug Cutting <[hidden email]> wrote:

>
> Jeff Rodenburg wrote:
> > Following on the Range Query approach, how is performance?  I found the
> > range approach (albeit with the exact values) to be slower than the
> > parsed-string approach I posited.
>
> Note that Hoss suggested RangeFilter, not RangeQuery.  Or perhaps
> ConstantScoreRangeQuery, which is implemented in terms of RangeFilter.
> But RangeQuery will be slower than either RangeFilter or
> ConstantScoreRangeQuery.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>