Making "routing" strategies for writing segments explicit ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Making "routing" strategies for writing segments explicit ?

Tommaso Teofili
Hi all,

having been involved in such kind of challenge and having seen a few more similar enquiries on the dev list, I was wondering if it may be time to think about making it possible to have an explicit (customizable and therefore pluggable) policy which allows people to chime into where documents and / or segments get written (on write or on merge).
Recently there was someone asking about possibly having segments sorted by a field using SortingMergePolicy, but as Uwe noted it's currently an implementation detail. Personally I have tried (and failed because it was too costly) to make sure docs belonging to certain clusters (identified by a field) being written within same segments (for data locality / memory footprint concerns when "loading" docs from a certain cluster).

As of today that'd be *really* hard, but I just wanted to share my feeling that such topic might be something to keep an eye on.

My 2 cents,
Tommaso
Reply | Threaded
Open this post in threaded view
|

Re: Making "routing" strategies for writing segments explicit ?

david.w.smiley@gmail.com
Hi Tomaso,

It's definitely something I've pondered on occasion but I'm left wondering (a) is it worth it (experimentation will tell), and (b) perhaps Lucene doesn't need anything new here: see MultiReader. Arguably this can be handled at the search server layer by constructing multiple IndexWriters and then a MultiReader over their collective indexes.  Perhaps a special IndexSearcher QueryCache could be developed to partition itself on the separate underlying readers.  Of course it would probably take a lot of work to retrofit, say Solr, to do this but I'm dubious Lucene should be saddled with unneeded complexity for this.

On Thu, Oct 12, 2017 at 9:55 AM Tommaso Teofili <[hidden email]> wrote:
Hi all,

having been involved in such kind of challenge and having seen a few more similar enquiries on the dev list, I was wondering if it may be time to think about making it possible to have an explicit (customizable and therefore pluggable) policy which allows people to chime into where documents and / or segments get written (on write or on merge).
Recently there was someone asking about possibly having segments sorted by a field using SortingMergePolicy, but as Uwe noted it's currently an implementation detail. Personally I have tried (and failed because it was too costly) to make sure docs belonging to certain clusters (identified by a field) being written within same segments (for data locality / memory footprint concerns when "loading" docs from a certain cluster).

As of today that'd be *really* hard, but I just wanted to share my feeling that such topic might be something to keep an eye on.

My 2 cents,
Tommaso
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
Reply | Threaded
Open this post in threaded view
|

Re: Making "routing" strategies for writing segments explicit ?

Tommaso Teofili
Hi David,

I see your point, I am not saying such big low level changes are badly needed today for most of production scenarios; I am just observing that it might become a possibly useful extension, e.g. today word / document embeddings are being used more and more (mostly in research) so that retrieving / scoring docs belonging to same the cluster (or near/similar embeddings wise, regardless of the metric) is a significant part of the query (retrieving/ranking) part.

However I think your suggestion to look in easier solutions first like MultiReader is a good one, e.g. in "my" use case if each doc belongs to a single cluster it might be good to create an index per cluster.

Thanks and regards,
Tommaso

Il giorno lun 16 ott 2017 alle ore 21:28 David Smiley <[hidden email]> ha scritto:
Hi Tomaso,

It's definitely something I've pondered on occasion but I'm left wondering (a) is it worth it (experimentation will tell), and (b) perhaps Lucene doesn't need anything new here: see MultiReader. Arguably this can be handled at the search server layer by constructing multiple IndexWriters and then a MultiReader over their collective indexes.  Perhaps a special IndexSearcher QueryCache could be developed to partition itself on the separate underlying readers.  Of course it would probably take a lot of work to retrofit, say Solr, to do this but I'm dubious Lucene should be saddled with unneeded complexity for this.

On Thu, Oct 12, 2017 at 9:55 AM Tommaso Teofili <[hidden email]> wrote:
Hi all,

having been involved in such kind of challenge and having seen a few more similar enquiries on the dev list, I was wondering if it may be time to think about making it possible to have an explicit (customizable and therefore pluggable) policy which allows people to chime into where documents and / or segments get written (on write or on merge).
Recently there was someone asking about possibly having segments sorted by a field using SortingMergePolicy, but as Uwe noted it's currently an implementation detail. Personally I have tried (and failed because it was too costly) to make sure docs belonging to certain clusters (identified by a field) being written within same segments (for data locality / memory footprint concerns when "loading" docs from a certain cluster).

As of today that'd be *really* hard, but I just wanted to share my feeling that such topic might be something to keep an eye on.

My 2 cents,
Tommaso
--
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker