Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

小鱼儿-2
Background: i need to implement a document indexing and search for
POIs(point of interest) under LBS scene. A POI has name, address, and
location(LatLonPoint), and i want to combine a text query with a
geo-spatial 2d range filter.

The problem is, when i first build a native in-memory index which use a
simple BitSet as DocIDSet type and STRTree class from the famous JTS lib, i
get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64, use
mmap codec). But when i use Lucene-8.3 to implement the same
functionality(which use LatLonPoint.newDistanceQuery which seems use the
default BKD tree index), i only get 150ms/130qps which is a very bad
degrade?

So my idea is, can i do a custom filter query, which builds a fully
in-memory R-tree index to boost the spatial2d range filter performance? I
need to access Lucene's internal DocIDSet class so i can do a fast merge
with no scoring needed. Hope this will improve the query performance.

Any suggestions?
Reply | Threaded
Open this post in threaded view
|

Re: Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

Adrien Grand
Are you sure you are comparing apples to apples? The first paragraph
mentions a range filter, which would be LatLonPoint#newBoxQuery, but
then you mentioned LatLonPoint#newDistanceQuery, which is
significantly more costly due to the need to compute distances.

If you plan to combine text queries with your geo queries, I'd also
advise to index both with LatLonPoint and LatLonDocValuesField, and
then use IndexOrDocValuesQuery at query time. Typically something like
this:

```
Query textQuery = ...;
Query latLonPointQuery = LatLonPoint.newBoxQuery("poi", www, xxx, yyy, zzz);
Query latLonDocValuesQuery =
LatLonDocValuesField.newSlowBoxQuery("poi", www, xxx, yyy, zzz);
Query poiQuery = new IndexOrDocValuesQuery(latLonPointQuery,
latLonDocValuesQuery);
Query query = new BooleanQuery.Builder()
    .add(textQuery, Occur.MUST)
    .add(poiQuery, Occur.FILTER)
    .build();
```

On Wed, Dec 4, 2019 at 5:31 AM 小鱼儿 <[hidden email]> wrote:

>
> Background: i need to implement a document indexing and search for
> POIs(point of interest) under LBS scene. A POI has name, address, and
> location(LatLonPoint), and i want to combine a text query with a
> geo-spatial 2d range filter.
>
> The problem is, when i first build a native in-memory index which use a
> simple BitSet as DocIDSet type and STRTree class from the famous JTS lib, i
> get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64, use
> mmap codec). But when i use Lucene-8.3 to implement the same
> functionality(which use LatLonPoint.newDistanceQuery which seems use the
> default BKD tree index), i only get 150ms/130qps which is a very bad
> degrade?
>
> So my idea is, can i do a custom filter query, which builds a fully
> in-memory R-tree index to boost the spatial2d range filter performance? I
> need to access Lucene's internal DocIDSet class so i can do a fast merge
> with no scoring needed. Hope this will improve the query performance.
>
> Any suggestions?



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

小鱼儿-2
Hi, adrien

As to my native impl. which combines inverted index and R-tree distance
query(index data is fully loaded into memory), i use a bound box to do
filter and then use concise "contains" check to filter, so they are both
"distance query" (or i call it "point nearby query")

I have implemented this custom lucene Query, which filters the POI's in
10KM distance range, and then convert them to a Lucene BitSetIterator, and
test its performance: back to 20ms/1000QPS, retest, increase to
15ms/1400QPS. (doesn't know why), but the initial Lucene's BKD index
performance is only 150ms/130QPS, so this is a big win!

NOTE: I first subclassed the IndexSearher, and overridden the so called
"Low-Level" *search(Query query, Collector results)* method, and thought
Lucene would pass the defractored my custom Query object in. Well, I'm
wrong. But the custom Query subclass method finally works!

But the problem is, why is BKD index supported LatLonPoint.newDistanceQuery
's perf so bad? My 1w8 POIs' index data is only ~7MB on disk, so it's only
in 1 lucene "segment"? When loading them all into memory using mmap codec,
BKD index is stupidly scanning all POI locations? But this is only a guess.

BTW, the text-only query is avg 10ms/2000QPS, at the same level, in my
native in-memory inverted index and Lucene's index.


Adrien Grand <[hidden email]> 于2019年12月4日周三 下午4:14写道:

> Are you sure you are comparing apples to apples? The first paragraph
> mentions a range filter, which would be LatLonPoint#newBoxQuery, but
> then you mentioned LatLonPoint#newDistanceQuery, which is
> significantly more costly due to the need to compute distances.
>
> If you plan to combine text queries with your geo queries, I'd also
> advise to index both with LatLonPoint and LatLonDocValuesField, and
> then use IndexOrDocValuesQuery at query time. Typically something like
> this:
>
> ```
> Query textQuery = ...;
> Query latLonPointQuery = LatLonPoint.newBoxQuery("poi", www, xxx, yyy,
> zzz);
> Query latLonDocValuesQuery =
> LatLonDocValuesField.newSlowBoxQuery("poi", www, xxx, yyy, zzz);
> Query poiQuery = new IndexOrDocValuesQuery(latLonPointQuery,
> latLonDocValuesQuery);
> Query query = new BooleanQuery.Builder()
>     .add(textQuery, Occur.MUST)
>     .add(poiQuery, Occur.FILTER)
>     .build();
> ```
>
> On Wed, Dec 4, 2019 at 5:31 AM 小鱼儿 <[hidden email]> wrote:
> >
> > Background: i need to implement a document indexing and search for
> > POIs(point of interest) under LBS scene. A POI has name, address, and
> > location(LatLonPoint), and i want to combine a text query with a
> > geo-spatial 2d range filter.
> >
> > The problem is, when i first build a native in-memory index which use a
> > simple BitSet as DocIDSet type and STRTree class from the famous JTS
> lib, i
> > get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64,
> use
> > mmap codec). But when i use Lucene-8.3 to implement the same
> > functionality(which use LatLonPoint.newDistanceQuery which seems use the
> > default BKD tree index), i only get 150ms/130qps which is a very bad
> > degrade?
> >
> > So my idea is, can i do a custom filter query, which builds a fully
> > in-memory R-tree index to boost the spatial2d range filter performance? I
> > need to access Lucene's internal DocIDSet class so i can do a fast merge
> > with no scoring needed. Hope this will improve the query performance.
> >
> > Any suggestions?
>
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Need suggestions on implementing a custom query (offload R-tree filter to fully in-memory) on Lucene-8.3

Adrien Grand
I wonder that it might be due to the JTS tree having a smaller leaf
size, which might force Lucene to do many more distance computations,
especially given that you don't seem to have lots of data.

On Wed, Dec 4, 2019 at 10:21 AM 小鱼儿 <[hidden email]> wrote:

>
> Hi, adrien
>
> As to my native impl. which combines inverted index and R-tree distance
> query(index data is fully loaded into memory), i use a bound box to do
> filter and then use concise "contains" check to filter, so they are both
> "distance query" (or i call it "point nearby query")
>
> I have implemented this custom lucene Query, which filters the POI's in
> 10KM distance range, and then convert them to a Lucene BitSetIterator, and
> test its performance: back to 20ms/1000QPS, retest, increase to
> 15ms/1400QPS. (doesn't know why), but the initial Lucene's BKD index
> performance is only 150ms/130QPS, so this is a big win!
>
> NOTE: I first subclassed the IndexSearher, and overridden the so called
> "Low-Level" *search(Query query, Collector results)* method, and thought
> Lucene would pass the defractored my custom Query object in. Well, I'm
> wrong. But the custom Query subclass method finally works!
>
> But the problem is, why is BKD index supported LatLonPoint.newDistanceQuery
> 's perf so bad? My 1w8 POIs' index data is only ~7MB on disk, so it's only
> in 1 lucene "segment"? When loading them all into memory using mmap codec,
> BKD index is stupidly scanning all POI locations? But this is only a guess.
>
> BTW, the text-only query is avg 10ms/2000QPS, at the same level, in my
> native in-memory inverted index and Lucene's index.
>
>
> Adrien Grand <[hidden email]> 于2019年12月4日周三 下午4:14写道:
>
> > Are you sure you are comparing apples to apples? The first paragraph
> > mentions a range filter, which would be LatLonPoint#newBoxQuery, but
> > then you mentioned LatLonPoint#newDistanceQuery, which is
> > significantly more costly due to the need to compute distances.
> >
> > If you plan to combine text queries with your geo queries, I'd also
> > advise to index both with LatLonPoint and LatLonDocValuesField, and
> > then use IndexOrDocValuesQuery at query time. Typically something like
> > this:
> >
> > ```
> > Query textQuery = ...;
> > Query latLonPointQuery = LatLonPoint.newBoxQuery("poi", www, xxx, yyy,
> > zzz);
> > Query latLonDocValuesQuery =
> > LatLonDocValuesField.newSlowBoxQuery("poi", www, xxx, yyy, zzz);
> > Query poiQuery = new IndexOrDocValuesQuery(latLonPointQuery,
> > latLonDocValuesQuery);
> > Query query = new BooleanQuery.Builder()
> >     .add(textQuery, Occur.MUST)
> >     .add(poiQuery, Occur.FILTER)
> >     .build();
> > ```
> >
> > On Wed, Dec 4, 2019 at 5:31 AM 小鱼儿 <[hidden email]> wrote:
> > >
> > > Background: i need to implement a document indexing and search for
> > > POIs(point of interest) under LBS scene. A POI has name, address, and
> > > location(LatLonPoint), and i want to combine a text query with a
> > > geo-spatial 2d range filter.
> > >
> > > The problem is, when i first build a native in-memory index which use a
> > > simple BitSet as DocIDSet type and STRTree class from the famous JTS
> > lib, i
> > > get 20ms/1000qps perf metrics with 1w8 POIs on my laptop(Windows 7 x64,
> > use
> > > mmap codec). But when i use Lucene-8.3 to implement the same
> > > functionality(which use LatLonPoint.newDistanceQuery which seems use the
> > > default BKD tree index), i only get 150ms/130qps which is a very bad
> > > degrade?
> > >
> > > So my idea is, can i do a custom filter query, which builds a fully
> > > in-memory R-tree index to boost the spatial2d range filter performance? I
> > > need to access Lucene's internal DocIDSet class so i can do a fast merge
> > > with no scoring needed. Hope this will improve the query performance.
> > >
> > > Any suggestions?
> >
> >
> >
> > --
> > Adrien
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >



--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]