clarification regarding shard splitting and composite IDs

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

clarification regarding shard splitting and composite IDs

Ian Rose
Howdy -

We are using composite IDs of the form <user>!<event>.  This ensures that
all events for a user are stored in the same shard.

I'm assuming from the description of how composite ID routing works, that
if you split a shard the "split point" of the hash range for that shard is
chosen to maintain the invariant that all documents that share a routing
prefix (before the "!") will still map to the same (new) shard.  Is that
accurate?

A naive shard-split implementation (e.g. that chose the hash range split
point arbitrarily) could end up with "child" shards that split a routing
prefix.

Thanks,
Ian
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Gili Nachum-2
Hi, I'm also interested. When using composite the ID, the _route_
information is not kept on the document itself, so to me it looks like it's
not possible as the split API
<https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3>
doesn't have a relevant parameter to split correctly.
Could report back once I try it in practice.

On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]> wrote:

> Howdy -
>
> We are using composite IDs of the form <user>!<event>.  This ensures that
> all events for a user are stored in the same shard.
>
> I'm assuming from the description of how composite ID routing works, that
> if you split a shard the "split point" of the hash range for that shard is
> chosen to maintain the invariant that all documents that share a routing
> prefix (before the "!") will still map to the same (new) shard.  Is that
> accurate?
>
> A naive shard-split implementation (e.g. that chose the hash range split
> point arbitrarily) could end up with "child" shards that split a routing
> prefix.
>
> Thanks,
> Ian
>
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Anshum Gupta
In one line, shard splitting doesn't cater to depend on the routing
mechanism but just the hash range so you could have documents for the same
prefix split up.

Here's an overview of routing in SolrCloud:
* Happens based on a hash value
* The hash is calculated using the multiple parts of the routing key. In
case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits of
the routing key are obtained from murmurhash(B). This sends the docs to the
right shard.
* When querying using A!, all shards that contain hashes from the range 16
bits from murmurhash(A)-0000 to murmurhash(A)-ffff are used.

When you split a shard, for say range 00000000 - ffffffff , it is split
from the middle (by default) and over multiple split, docs for the same A!
prefix might end up on different shards, but the request routing should
take care of that.

You can read more about routing here:
https://lucidworks.com/blog/solr-cloud-document-routing/
http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/

and shard splitting here:
http://lucidworks.com/blog/shard-splitting-in-solrcloud/


On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum <[hidden email]> wrote:

> Hi, I'm also interested. When using composite the ID, the _route_
> information is not kept on the document itself, so to me it looks like it's
> not possible as the split API
> <
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> >
> doesn't have a relevant parameter to split correctly.
> Could report back once I try it in practice.
>
> On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]> wrote:
>
> > Howdy -
> >
> > We are using composite IDs of the form <user>!<event>.  This ensures that
> > all events for a user are stored in the same shard.
> >
> > I'm assuming from the description of how composite ID routing works, that
> > if you split a shard the "split point" of the hash range for that shard
> is
> > chosen to maintain the invariant that all documents that share a routing
> > prefix (before the "!") will still map to the same (new) shard.  Is that
> > accurate?
> >
> > A naive shard-split implementation (e.g. that chose the hash range split
> > point arbitrarily) could end up with "child" shards that split a routing
> > prefix.
> >
> > Thanks,
> > Ian
> >
>



--
Anshum Gupta
http://about.me/anshumgupta
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Gili Nachum-2
Alright. So shard splitting and composite routing plays nicely together.
Thank you Anshum.

On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta <[hidden email]>
wrote:

> In one line, shard splitting doesn't cater to depend on the routing
> mechanism but just the hash range so you could have documents for the same
> prefix split up.
>
> Here's an overview of routing in SolrCloud:
> * Happens based on a hash value
> * The hash is calculated using the multiple parts of the routing key. In
> case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits of
> the routing key are obtained from murmurhash(B). This sends the docs to the
> right shard.
> * When querying using A!, all shards that contain hashes from the range 16
> bits from murmurhash(A)-0000 to murmurhash(A)-ffff are used.
>
> When you split a shard, for say range 00000000 - ffffffff , it is split
> from the middle (by default) and over multiple split, docs for the same A!
> prefix might end up on different shards, but the request routing should
> take care of that.
>
> You can read more about routing here:
> https://lucidworks.com/blog/solr-cloud-document-routing/
> http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
>
> and shard splitting here:
> http://lucidworks.com/blog/shard-splitting-in-solrcloud/
>
>
> On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum <[hidden email]> wrote:
>
> > Hi, I'm also interested. When using composite the ID, the _route_
> > information is not kept on the document itself, so to me it looks like
> it's
> > not possible as the split API
> > <
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > >
> > doesn't have a relevant parameter to split correctly.
> > Could report back once I try it in practice.
> >
> > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]> wrote:
> >
> > > Howdy -
> > >
> > > We are using composite IDs of the form <user>!<event>.  This ensures
> that
> > > all events for a user are stored in the same shard.
> > >
> > > I'm assuming from the description of how composite ID routing works,
> that
> > > if you split a shard the "split point" of the hash range for that shard
> > is
> > > chosen to maintain the invariant that all documents that share a
> routing
> > > prefix (before the "!") will still map to the same (new) shard.  Is
> that
> > > accurate?
> > >
> > > A naive shard-split implementation (e.g. that chose the hash range
> split
> > > point arbitrarily) could end up with "child" shards that split a
> routing
> > > prefix.
> > >
> > > Thanks,
> > > Ian
> > >
> >
>
>
>
> --
> Anshum Gupta
> http://about.me/anshumgupta
>
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Dan Davis-2
Doesn't relevancy for that assume that the IDF and TF for user1 and user2
are not too different?    SolrCloud still doesn't use a distributed IDF,
correct?

On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum <[hidden email]> wrote:

> Alright. So shard splitting and composite routing plays nicely together.
> Thank you Anshum.
>
> On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta <[hidden email]>
> wrote:
>
> > In one line, shard splitting doesn't cater to depend on the routing
> > mechanism but just the hash range so you could have documents for the
> same
> > prefix split up.
> >
> > Here's an overview of routing in SolrCloud:
> > * Happens based on a hash value
> > * The hash is calculated using the multiple parts of the routing key. In
> > case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16 bits
> of
> > the routing key are obtained from murmurhash(B). This sends the docs to
> the
> > right shard.
> > * When querying using A!, all shards that contain hashes from the range
> 16
> > bits from murmurhash(A)-0000 to murmurhash(A)-ffff are used.
> >
> > When you split a shard, for say range 00000000 - ffffffff , it is split
> > from the middle (by default) and over multiple split, docs for the same
> A!
> > prefix might end up on different shards, but the request routing should
> > take care of that.
> >
> > You can read more about routing here:
> > https://lucidworks.com/blog/solr-cloud-document-routing/
> > http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
> >
> > and shard splitting here:
> > http://lucidworks.com/blog/shard-splitting-in-solrcloud/
> >
> >
> > On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum <[hidden email]>
> wrote:
> >
> > > Hi, I'm also interested. When using composite the ID, the _route_
> > > information is not kept on the document itself, so to me it looks like
> > it's
> > > not possible as the split API
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > > >
> > > doesn't have a relevant parameter to split correctly.
> > > Could report back once I try it in practice.
> > >
> > > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]>
> wrote:
> > >
> > > > Howdy -
> > > >
> > > > We are using composite IDs of the form <user>!<event>.  This ensures
> > that
> > > > all events for a user are stored in the same shard.
> > > >
> > > > I'm assuming from the description of how composite ID routing works,
> > that
> > > > if you split a shard the "split point" of the hash range for that
> shard
> > > is
> > > > chosen to maintain the invariant that all documents that share a
> > routing
> > > > prefix (before the "!") will still map to the same (new) shard.  Is
> > that
> > > > accurate?
> > > >
> > > > A naive shard-split implementation (e.g. that chose the hash range
> > split
> > > > point arbitrarily) could end up with "child" shards that split a
> > routing
> > > > prefix.
> > > >
> > > > Thanks,
> > > > Ian
> > > >
> > >
> >
> >
> >
> > --
> > Anshum Gupta
> > http://about.me/anshumgupta
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Anshum Gupta
Solr 5.0 has support for distributed IDF. Also, users having the same IDF
is orthogonal to the original question.

In general, the Doc Freq. is only per-shard. If for some reason, a single
user has documents split across shards, the IDF used would be different for
docs on different shards.

On Wed, Feb 4, 2015 at 9:06 PM, Dan Davis <[hidden email]> wrote:

> Doesn't relevancy for that assume that the IDF and TF for user1 and user2
> are not too different?    SolrCloud still doesn't use a distributed IDF,
> correct?
>
> On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum <[hidden email]> wrote:
>
> > Alright. So shard splitting and composite routing plays nicely together.
> > Thank you Anshum.
> >
> > On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta <[hidden email]>
> > wrote:
> >
> > > In one line, shard splitting doesn't cater to depend on the routing
> > > mechanism but just the hash range so you could have documents for the
> > same
> > > prefix split up.
> > >
> > > Here's an overview of routing in SolrCloud:
> > > * Happens based on a hash value
> > > * The hash is calculated using the multiple parts of the routing key.
> In
> > > case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16
> bits
> > of
> > > the routing key are obtained from murmurhash(B). This sends the docs to
> > the
> > > right shard.
> > > * When querying using A!, all shards that contain hashes from the range
> > 16
> > > bits from murmurhash(A)-0000 to murmurhash(A)-ffff are used.
> > >
> > > When you split a shard, for say range 00000000 - ffffffff , it is split
> > > from the middle (by default) and over multiple split, docs for the same
> > A!
> > > prefix might end up on different shards, but the request routing should
> > > take care of that.
> > >
> > > You can read more about routing here:
> > > https://lucidworks.com/blog/solr-cloud-document-routing/
> > > http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
> > >
> > > and shard splitting here:
> > > http://lucidworks.com/blog/shard-splitting-in-solrcloud/
> > >
> > >
> > > On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum <[hidden email]>
> > wrote:
> > >
> > > > Hi, I'm also interested. When using composite the ID, the _route_
> > > > information is not kept on the document itself, so to me it looks
> like
> > > it's
> > > > not possible as the split API
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
> > > > >
> > > > doesn't have a relevant parameter to split correctly.
> > > > Could report back once I try it in practice.
> > > >
> > > > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]>
> > wrote:
> > > >
> > > > > Howdy -
> > > > >
> > > > > We are using composite IDs of the form <user>!<event>.  This
> ensures
> > > that
> > > > > all events for a user are stored in the same shard.
> > > > >
> > > > > I'm assuming from the description of how composite ID routing
> works,
> > > that
> > > > > if you split a shard the "split point" of the hash range for that
> > shard
> > > > is
> > > > > chosen to maintain the invariant that all documents that share a
> > > routing
> > > > > prefix (before the "!") will still map to the same (new) shard.  Is
> > > that
> > > > > accurate?
> > > > >
> > > > > A naive shard-split implementation (e.g. that chose the hash range
> > > split
> > > > > point arbitrarily) could end up with "child" shards that split a
> > > routing
> > > > > prefix.
> > > > >
> > > > > Thanks,
> > > > > Ian
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Anshum Gupta
> > > http://about.me/anshumgupta
> > >
> >
>



--
Anshum Gupta
http://about.me/anshumgupta
Reply | Threaded
Open this post in threaded view
|

Re: clarification regarding shard splitting and composite IDs

Dan Davis-3
Thanks, Anshum - I should never have posted so late.    It is true that
different users will have different word frequencies, but an application
exploiting that for better relevancy would be going far for the relevancy
of individual user's results.

On Thu, Feb 5, 2015 at 12:41 AM, Anshum Gupta <[hidden email]>
wrote:

> Solr 5.0 has support for distributed IDF. Also, users having the same IDF
> is orthogonal to the original question.
>
> In general, the Doc Freq. is only per-shard. If for some reason, a single
> user has documents split across shards, the IDF used would be different for
> docs on different shards.
>
> On Wed, Feb 4, 2015 at 9:06 PM, Dan Davis <[hidden email]> wrote:
>
>> Doesn't relevancy for that assume that the IDF and TF for user1 and user2
>> are not too different?    SolrCloud still doesn't use a distributed IDF,
>> correct?
>>
>> On Wed, Feb 4, 2015 at 7:05 PM, Gili Nachum <[hidden email]> wrote:
>>
>> > Alright. So shard splitting and composite routing plays nicely together.
>> > Thank you Anshum.
>> >
>> > On Wed, Feb 4, 2015 at 11:24 AM, Anshum Gupta <[hidden email]>
>> > wrote:
>> >
>> > > In one line, shard splitting doesn't cater to depend on the routing
>> > > mechanism but just the hash range so you could have documents for the
>> > same
>> > > prefix split up.
>> > >
>> > > Here's an overview of routing in SolrCloud:
>> > > * Happens based on a hash value
>> > > * The hash is calculated using the multiple parts of the routing key.
>> In
>> > > case of A!B, 16 bits are obtained from murmurhash(A) and the LSB 16
>> bits
>> > of
>> > > the routing key are obtained from murmurhash(B). This sends the docs
>> to
>> > the
>> > > right shard.
>> > > * When querying using A!, all shards that contain hashes from the
>> range
>> > 16
>> > > bits from murmurhash(A)-0000 to murmurhash(A)-ffff are used.
>> > >
>> > > When you split a shard, for say range 00000000 - ffffffff , it is
>> split
>> > > from the middle (by default) and over multiple split, docs for the
>> same
>> > A!
>> > > prefix might end up on different shards, but the request routing
>> should
>> > > take care of that.
>> > >
>> > > You can read more about routing here:
>> > > https://lucidworks.com/blog/solr-cloud-document-routing/
>> > >
>> http://lucidworks.com/blog/multi-level-composite-id-routing-solrcloud/
>> > >
>> > > and shard splitting here:
>> > > http://lucidworks.com/blog/shard-splitting-in-solrcloud/
>> > >
>> > >
>> > > On Wed, Feb 4, 2015 at 12:59 AM, Gili Nachum <[hidden email]>
>> > wrote:
>> > >
>> > > > Hi, I'm also interested. When using composite the ID, the _route_
>> > > > information is not kept on the document itself, so to me it looks
>> like
>> > > it's
>> > > > not possible as the split API
>> > > > <
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api3
>> > > > >
>> > > > doesn't have a relevant parameter to split correctly.
>> > > > Could report back once I try it in practice.
>> > > >
>> > > > On Mon, Nov 10, 2014 at 7:27 PM, Ian Rose <[hidden email]>
>> > wrote:
>> > > >
>> > > > > Howdy -
>> > > > >
>> > > > > We are using composite IDs of the form <user>!<event>.  This
>> ensures
>> > > that
>> > > > > all events for a user are stored in the same shard.
>> > > > >
>> > > > > I'm assuming from the description of how composite ID routing
>> works,
>> > > that
>> > > > > if you split a shard the "split point" of the hash range for that
>> > shard
>> > > > is
>> > > > > chosen to maintain the invariant that all documents that share a
>> > > routing
>> > > > > prefix (before the "!") will still map to the same (new) shard.
>> Is
>> > > that
>> > > > > accurate?
>> > > > >
>> > > > > A naive shard-split implementation (e.g. that chose the hash range
>> > > split
>> > > > > point arbitrarily) could end up with "child" shards that split a
>> > > routing
>> > > > > prefix.
>> > > > >
>> > > > > Thanks,
>> > > > > Ian
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Anshum Gupta
>> > > http://about.me/anshumgupta
>> > >
>> >
>>
>
>
>
> --
> Anshum Gupta
> http://about.me/anshumgupta
>