Sol rCloud collection design considerations / best practice

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Sol rCloud collection design considerations / best practice

shamik
Hi,

    I'm looking for some input on design considerations for defining
collections in a SolrCloud cluster. Right now, our cluster consists of two
collections in a 2 shard / 2 replica mode. Each collection has a dedicated
set of source and don't overlap, which made it an easy decision.
Recently, we've a requirement to index a bunch of new sources that are
region based. The search result corresponding to those region needs to come
from their specific source as well sources from one of our existing
collection. Here's an example of our existing collection and their
corresponding source(s).

Existing Collection:
--------------------------
Collection A --> Source_A, Source_B
Collection B --> Source_C, Source_D, Source_E

Proposed Collection:
----------------------------
Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E
Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E
Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E

The proposed collection part shows that each geo has its dedicated source
as well as source(s) from existing collection B.

Just wondering if creating a dedicated collection for each geo is the right
approach here. The main motivation is to support a geo-specific relevancy
model which can easily be customized without stepping into each other. On
the downside, I'm not sure if it's a good idea to replicate data from the
same source across various collections. Moreover, the data within the
source are not relational, so joining across collection might not be
an easy proposition.
The other consideration is the hardware design. Right now, both shards and
their replicas run on their dedicated instance. With two collections, we
sometimes run into OOM scenarios, so I'm a little bit worried about adding
more collections. Does the best practice (I know it's subjective) in
scenarios like this call for a dedicated Solr cluster per collection? From
index size perspective, Source_C,Source_D and Source_E combines close to10
million documents with 60gb volume size. Each geo based source is small,
won't exceed more than 500k documents.

Any pointers will be appreciated.

Thanks,
Shamik
Reply | Threaded
Open this post in threaded view
|

Re: Sol rCloud collection design considerations / best practice

Erick Erickson
Have you considered collection aliasing? You can create an alias that
points to multiple collections. So you could keep specific collections
and have aliases that encompass your regions....

The one caveat here is that sorting the final result set by score will
require that the collections be roughly similar in terms of TF/IDF.

Best,
Erick

On Mon, Nov 13, 2017 at 11:33 AM, Shamik Bandopadhyay <[hidden email]> wrote:

> Hi,
>
>     I'm looking for some input on design considerations for defining
> collections in a SolrCloud cluster. Right now, our cluster consists of two
> collections in a 2 shard / 2 replica mode. Each collection has a dedicated
> set of source and don't overlap, which made it an easy decision.
> Recently, we've a requirement to index a bunch of new sources that are
> region based. The search result corresponding to those region needs to come
> from their specific source as well sources from one of our existing
> collection. Here's an example of our existing collection and their
> corresponding source(s).
>
> Existing Collection:
> --------------------------
> Collection A --> Source_A, Source_B
> Collection B --> Source_C, Source_D, Source_E
>
> Proposed Collection:
> ----------------------------
> Collection_Asia --> Source_Asia, Source_C, Source_D, Source_E
> Collection_Europe --> Source_Europe, Source_C, Source_D, Source_E
> Collection_Australia --> Source_Asutralia, Source_C, Source_D, Source_E
>
> The proposed collection part shows that each geo has its dedicated source
> as well as source(s) from existing collection B.
>
> Just wondering if creating a dedicated collection for each geo is the right
> approach here. The main motivation is to support a geo-specific relevancy
> model which can easily be customized without stepping into each other. On
> the downside, I'm not sure if it's a good idea to replicate data from the
> same source across various collections. Moreover, the data within the
> source are not relational, so joining across collection might not be
> an easy proposition.
> The other consideration is the hardware design. Right now, both shards and
> their replicas run on their dedicated instance. With two collections, we
> sometimes run into OOM scenarios, so I'm a little bit worried about adding
> more collections. Does the best practice (I know it's subjective) in
> scenarios like this call for a dedicated Solr cluster per collection? From
> index size perspective, Source_C,Source_D and Source_E combines close to10
> million documents with 60gb volume size. Each geo based source is small,
> won't exceed more than 500k documents.
>
> Any pointers will be appreciated.
>
> Thanks,
> Shamik
Reply | Threaded
Open this post in threaded view
|

Re: Sol rCloud collection design considerations / best practice

Shawn Heisey-2
In reply to this post by shamik
On 11/13/2017 12:33 PM, Shamik Bandopadhyay wrote:
>     I'm looking for some input on design considerations for defining
> collections in a SolrCloud cluster. Right now, our cluster consists of two
> collections in a 2 shard / 2 replica mode. Each collection has a dedicated
> set of source and don't overlap, which made it an easy decision.
> Recently, we've a requirement to index a bunch of new sources that are
> region based. The search result corresponding to those region needs to come
> from their specific source as well sources from one of our existing
> collection. Here's an example of our existing collection and their
> corresponding source(s).

You haven't defined in *ANY* way exactly what a "source" is or how that
data actually gets into Solr.  Without that information, it'll be
difficult to even understand your requirements.

If I make one assumption that for all of the data sources, the config
and schema are going to be identical, then I can give you this information:

If you set up each source as a collection in your SolrCloud, you can
create collection aliases that let you query multiple collections with
one query.  Whether or not this will work correctly will depend on a few
factors, but most of all whether or not all the data is using the same
(or extremely similar) Solr config/schema.

> The other consideration is the hardware design. Right now, both shards and
> their replicas run on their dedicated instance. With two collections, we
> sometimes run into OOM scenarios, so I'm a little bit worried about adding
> more collections. Does the best practice (I know it's subjective) in
> scenarios like this call for a dedicated Solr cluster per collection? From
> index size perspective, Source_C,Source_D and Source_E combines close to10
> million documents with 60gb volume size. Each geo based source is small,
> won't exceed more than 500k documents.

10 million documents producing 60GB of index data means that the
documents are relatively large, but aren't super huge -- or that the
data in them is duplicated several times.  For contrast, I have an index
where each shard has about 30 million docs, and each of those shards is
36GB in size.  The entire index has six of these large shards and one
tiny hot shard.

I always get a little anxious when somebody wants best practice
information about Solr configurations and hardware.  Any recommendation
that we make will be COMPLETELY wrong for some use cases, indexes,
and/or query patterns.  Solr configurations and hardware must be
tailored specifically for the use case, index data, and query patterns
that actually exist.  Typically, this means that you have to actually
set up a full system and try it to make any determinations about how
much hardware you need.

https://lucidworks.com/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Regarding your hardware sizing, the only general advice I can give you
is this:  Good performance usually ends up requiring significantly more
RAM than users plan on.

Thanks,
Shawn

Reply | Threaded
Open this post in threaded view
|

Re: Sol rCloud collection design considerations / best practice

alessandro.benedetti
"The main motivation is to support a geo-specific relevancy
model which can easily be customized without stepping into each other"

Is your relevancy tuning massively index time based ?
i.e. will create massively different index content based on the geo location
?

If it is just query time based or lightly index based ( few fields of
difference across region), you don't need different collections at all to
have a customized relevancy model per use case.

In Solr you can define different request handlers with different query
parsers and search components specifications.
If you go in deep with relevancy tuning and for example you experiment
Learning To Rank, it supports passing the model name at query time, which
means you can use a different relevancy mode just passing it as a request
parameter.

Regards



-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io