SolrCloud logical shards

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

SolrCloud logical shards

Yonik Seeley-2-2
The shards parameter currently references physical shards.
There's also a concept of a logical shard (i.e. all physical shards
with identical index content share the same logical shards...
sometimes what I've also called a shard replica).
Should we use logical shard for this, or does anyone have any better ideas?

Related: it seems like we would want to enable querying of specific
logical shards (say if a user partitioned their shards by time or by
geographic region), so the terminology above could affect the
parameter we use for this.  Suggestions?  logicalshards=shard1,shard2?
lshards=shard1,shard2?  slice=shard1,shard2? It doesn't seem like it
would be easy to reuse the "shards" parameter for this since it refers
to physical shard addresses.

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Ted Dunning
Logical shard sounds good as "the collection of all identical physical
shards"

Another concept from Katta that is AFAIK missing from the Solr lexicon is
the distinction between node and shard.  In Katta, a node is a server worker
instance that contains and queries physical shards.  There is usually one
node per physical server, but not always.  In Katta an important performance
and reliability optimization is that nodes do not contain identical shard
sets.  That is, shards are assigned randomly even when replicated.  This
improves robustness, code simplicity and load balancing.

On Thu, Jan 14, 2010 at 9:08 AM, Yonik Seeley <[hidden email]>wrote:

> Should we use logical shard for this, or does anyone have any better ideas?




--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Yonik Seeley-2-2
I'm actually starting to lean toward "slice" instead of "logical shard".
In the future we'll want to enable overlapping shards I think (due to
an Amazon Dynamo type of replication, or due to merging shards, etc),
and a separate word for a logical slice of the index seems desirable.

For instance, one could specify slice=1000-1999 (defined by the ids or
hashcodes of the ids) and that could end up querying multiple servers.
 For this first iteration, slices would just be opaque identifiers
though (and that functionality would always remain, allowing for user
partitioning by time or by geo region).

So "slice" would be logical, "shard" would be physical.
To get a full result, one needs to query all of the slices of an
index, but not necessarily all of the shards.

-Yonik
http://www.lucidimagination.com



On Thu, Jan 14, 2010 at 12:08 PM, Yonik Seeley
<[hidden email]> wrote:

> The shards parameter currently references physical shards.
> There's also a concept of a logical shard (i.e. all physical shards
> with identical index content share the same logical shards...
> sometimes what I've also called a shard replica).
> Should we use logical shard for this, or does anyone have any better ideas?
>
> Related: it seems like we would want to enable querying of specific
> logical shards (say if a user partitioned their shards by time or by
> geographic region), so the terminology above could affect the
> parameter we use for this.  Suggestions?  logicalshards=shard1,shard2?
> lshards=shard1,shard2?  slice=shard1,shard2? It doesn't seem like it
> would be easy to reuse the "shards" parameter for this since it refers
> to physical shard addresses.
>
> -Yonik
> http://www.lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Ted Dunning
On Thu, Jan 14, 2010 at 12:30 PM, Ted Dunning <[hidden email]> wrote:
> Another concept from Katta that is AFAIK missing from the Solr lexicon is
> the distinction between node and shard.  In Katta, a node is a server worker
> instance that contains and queries physical shards.

I think it's sort of missing because a single Solr core can only
support a single lucene index at this point, and we're starting with
low hanging fruit.

So it's still a bit up in the air if we're modeling a "node" as a
single JVM webapp, or as a single solr core.  I'd really like to not
model the core at all and go with node and shards... but I'm not sure
how well that abstraction will hold up with the reality of solr cores
that's here today.

The first iteration won't have automatic shard assignment at all I think.
It will just be centralized configuration and automatic load
balancing.  Just a start, but will still make peoples lives easier.
Baby steps...

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Yonik Seeley-2-2
On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
<[hidden email]> wrote:
> I'm actually starting to lean toward "slice" instead of "logical shard".

I've gone with this for now and updated http://wiki.apache.org/solr/SolrCloud
but it's certainly not written in stone if people want to try and come
up with better naming...

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Ted Dunning
In reply to this post by Yonik Seeley-2-2
I think that most of these complications go away to a remarkable degree if
you combine katta style random assignment of small shards.

The major simplifications there include:

- no need to move individual documents, nor to split or merge shards, no
need for search-server to search-server communications

- search servers do search and nothing else

- placement, balance, replication and query balancing policy is factored out
of all real-time paths

- real-time updates can be accommodated in the same framework with minimal
changes to the shard management layer

- the shard management is completely agnostic to the actual search
semantics.

On Thu, Jan 14, 2010 at 9:46 AM, Yonik Seeley <[hidden email]>wrote:

> I'm actually starting to lean toward "slice" instead of "logical shard".
> In the future we'll want to enable overlapping shards I think (due to
> an Amazon Dynamo type of replication, or due to merging shards, etc),
> and a separate word for a logical slice of the index seems desirable.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

hossman
In reply to this post by Yonik Seeley-2-2

: parameter we use for this.  Suggestions?  logicalshards=shard1,shard2?
: lshards=shard1,shard2?  slice=shard1,shard2? It doesn't seem like it
: would be easy to reuse the "shards" parameter for this since it refers
: to physical shard addresses.

I haven't been following the SolrCloud stuff much, but from a client
perspective is there really any difference between asking for a physical
shard, vs asking for a logical shard (or slice name)? ... shouldn't the
later case just result in a resolution from logical->physical w/o
requiring the client code to know/care wether the String they have is a
physical shard URL, or a slice name.

This seems completley analogous to hostnames:
- I'm an applciation.
- via some means, i've got a (String) $host
- I ask my networking library to open a connection to $host
- the networking library worries about wether $host is a name or an IP
- if $host is an alias, the DNS server resolves it to a hostname
- if $host is a hostname, the DNS server resolves it to an IP (possibly
round robin)

Likewise in Solr:
- I'm an applciation.
- via some means, i've got a (Set<String>) $shards
- I ask Solr to search across $shards
- Solr looks at each item in $shards
  - if it's the name of a slice, it picks a physical shard
  - if it's a physical shard, it uses that shard


...there's got to be a mapping from slice_name=>Set(physical_shards)
anyway right? why should the client have to know the difference?

-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
On Thu, Jan 14, 2010 at 1:58 PM, Chris Hostetter
<[hidden email]> wrote:

> : parameter we use for this.  Suggestions?  logicalshards=shard1,shard2?
> : lshards=shard1,shard2?  slice=shard1,shard2? It doesn't seem like it
> : would be easy to reuse the "shards" parameter for this since it refers
> : to physical shard addresses.
>
> I haven't been following the SolrCloud stuff much, but from a client
> perspective is there really any difference between asking for a physical
> shard, vs asking for a logical shard (or slice name)? ... shouldn't the
> later case just result in a resolution from logical->physical w/o
> requiring the client code to know/care wether the String they have is a
> physical shard URL, or a slice name.

That might be doable... but we would need to be able to tell the difference.
Perhaps we could always require a slash in a physical address
(localhost/context) and prohibit it in slice names?

But... I think there's still a potentially bigger difference: today,
if shards is set, it means it's a distributed search (and shards is
removed for sub-requests).  But the slice of the index being requested
may not have a one-to-one mapping with a full request on a solr core.
And shards may be able to move around, and so it seems important to be
able to declare what part of the index you're looking for when you're
querying a shard.

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Yonik Seeley-2-2
On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
<[hidden email]> wrote:
> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
> <[hidden email]> wrote:
>> I'm actually starting to lean toward "slice" instead of "logical shard".

Alternate terminology could be "index" for the actual physical lucene
lindex (and also enough of the URL that unambiguously identifies it),
and then "shard" could be the logical entity.

But I've kind of gotten used to thinking of shards as the actual
physical queryable things...

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Ted Dunning
I have found that users of the system like to use index as the composite of
all nodes/shards/slices that is searched in response to a query.  It is the
ultimate logical entity.   Really, this is the same abstraction that users
of Lucene have.  They really don't want to care that a Lucene index is made
up of several files and even possibly several indexes in various states of
merging.  The same should really be true of a parallel system, but more so.


On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
<[hidden email]>wrote:

> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
> <[hidden email]> wrote:
> > On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
> > <[hidden email]> wrote:
> >> I'm actually starting to lean toward "slice" instead of "logical shard".
>
> Alternate terminology could be "index" for the actual physical lucene
> lindex (and also enough of the URL that unambiguously identifies it),
> and then "shard" could be the logical entity.
>
> But I've kind of gotten used to thinking of shards as the actual
> physical queryable things...
>
> -Yonik
> http://www.lucidimagination.com
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Jason Rutherglen
In reply to this post by Yonik Seeley-2-2
> But I've kind of gotten used to thinking of shards as the
> actual physical queryable things...

I think a mistake was made referring to Solr cores as shards.
It's the same thing with 2 different names. Slices adds yet
another name which seems to imply the same thing yet again. I'd
rather see disambiguation here, and call them cores (partially
because that's what's in the code and on the wiki), and cores
only. It's a Solr specific term, it's going to be confused with
microprocessor cores, but at least there's only one name, which
as search people, we know creates fewer posting lists :).

Logical groupings of cores can occur, which can be aptly named
core groups. This way I can submit a query to a core group, and
it's reasonable to assume I'm hitting N cores. Further, cores
could point to a logical or physical entity via a URL. (As a
side note, I've always found it odd that the shards param to
RequestHandler lacks the protocol, what if I want to use HTTPS
for example?).

So there could be http://host/solr/core1 (physical),
core://megacorename (logical),
coregroup://supergreatcoregroupname (a group of cores) in the
"shards" parameter (whose name can perhaps be changed for
clarity in a future release). Then people can mix and match and
we won't have many different XML elements floating around. We'd
have a simple list of URLs that are transposed into a real
physical network request.


On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
<[hidden email]> wrote:

> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
> <[hidden email]> wrote:
>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>> <[hidden email]> wrote:
>>> I'm actually starting to lean toward "slice" instead of "logical shard".
>
> Alternate terminology could be "index" for the actual physical lucene
> lindex (and also enough of the URL that unambiguously identifies it),
> and then "shard" could be the logical entity.
>
> But I've kind of gotten used to thinking of shards as the actual
> physical queryable things...
>
> -Yonik
> http://www.lucidimagination.com
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Uri Boness
Although Jason has some valid points here, I'm with Yonik here. I do
believe that we've gotten used to the terms "core" to represent a single
index and "shard" to be represented by a single core. A "node" seems to
indicate a machine or a JVM. Changing any of these (informal perhaps)
definitions will only cause confusion. That's why I think a "slice" is a
good solution now... first it's a new term to a new view of the index
(logical shard AFAIK don't really exists yet) so people won't need to
get used to it, but it's also descriptive and intuitive. I do like
Jason's idea about having a protocol attached to the URL's.

Cheers,
Uri

Jason Rutherglen wrote:

>> But I've kind of gotten used to thinking of shards as the
>> actual physical queryable things...
>>    
>
> I think a mistake was made referring to Solr cores as shards.
> It's the same thing with 2 different names. Slices adds yet
> another name which seems to imply the same thing yet again. I'd
> rather see disambiguation here, and call them cores (partially
> because that's what's in the code and on the wiki), and cores
> only. It's a Solr specific term, it's going to be confused with
> microprocessor cores, but at least there's only one name, which
> as search people, we know creates fewer posting lists :).
>
> Logical groupings of cores can occur, which can be aptly named
> core groups. This way I can submit a query to a core group, and
> it's reasonable to assume I'm hitting N cores. Further, cores
> could point to a logical or physical entity via a URL. (As a
> side note, I've always found it odd that the shards param to
> RequestHandler lacks the protocol, what if I want to use HTTPS
> for example?).
>
> So there could be http://host/solr/core1 (physical),
> core://megacorename (logical),
> coregroup://supergreatcoregroupname (a group of cores) in the
> "shards" parameter (whose name can perhaps be changed for
> clarity in a future release). Then people can mix and match and
> we won't have many different XML elements floating around. We'd
> have a simple list of URLs that are transposed into a real
> physical network request.
>
>
> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
> <[hidden email]> wrote:
>  
>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>> <[hidden email]> wrote:
>>    
>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>> <[hidden email]> wrote:
>>>      
>>>> I'm actually starting to lean toward "slice" instead of "logical shard".
>>>>        
>> Alternate terminology could be "index" for the actual physical lucene
>> lindex (and also enough of the URL that unambiguously identifies it),
>> and then "shard" could be the logical entity.
>>
>> But I've kind of gotten used to thinking of shards as the actual
>> physical queryable things...
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>    
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Jason Rutherglen
Uri,

> "core" to represent a single index and "shard" to be
> represented by a single core

Can you elaborate on what you mean, isn't a core a single index
too? It seems like shard was used to represent a remote index
(perhaps?). Though here I'd prefer "remote core", because to the
uninitiated Solr outsider it's immediately obvious (i.e. they
need only know what a core is, in the Solr glossary or term
dictionary).

In Google vernacular, which is where the name shard came from, a
"shard" is basically a local sub-index
http://research.google.com/archive/googlecluster.html where
there would be many "shards" per server. However that's a
digression at this point.

I personally prefer relatively straightforward names, that are
self-evident, rather than inventing new language for fairly
simple concepts. Slice, even though it comes from our buddy
Yonik, probably doesn't make any immediate sense to external
users when compared with the word shard. Of course software
projects have a tendency to create their own words to somewhat
mystify users into believing in some sort of magic occurring
underneath. If that's what we're after, it's cool, I mean that
makes sense. And I don't mean to be derogatory here however this
is an open source project created in part to educate users on
search and be made easily accessible as possible, to the
greatest number of users possible. I think Doug did a create job
of this when Lucene started with amazingly succinct code for
fairly complex concepts (eg, anti-mystification of search).

Jason

On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness <[hidden email]> wrote:

> Although Jason has some valid points here, I'm with Yonik here. I do believe
> that we've gotten used to the terms "core" to represent a single index and
> "shard" to be represented by a single core. A "node" seems to indicate a
> machine or a JVM. Changing any of these (informal perhaps) definitions will
> only cause confusion. That's why I think a "slice" is a good solution now...
> first it's a new term to a new view of the index (logical shard AFAIK don't
> really exists yet) so people won't need to get used to it, but it's also
> descriptive and intuitive. I do like Jason's idea about having a protocol
> attached to the URL's.
>
> Cheers,
> Uri
>
> Jason Rutherglen wrote:
>>>
>>> But I've kind of gotten used to thinking of shards as the
>>> actual physical queryable things...
>>>
>>
>> I think a mistake was made referring to Solr cores as shards.
>> It's the same thing with 2 different names. Slices adds yet
>> another name which seems to imply the same thing yet again. I'd
>> rather see disambiguation here, and call them cores (partially
>> because that's what's in the code and on the wiki), and cores
>> only. It's a Solr specific term, it's going to be confused with
>> microprocessor cores, but at least there's only one name, which
>> as search people, we know creates fewer posting lists :).
>>
>> Logical groupings of cores can occur, which can be aptly named
>> core groups. This way I can submit a query to a core group, and
>> it's reasonable to assume I'm hitting N cores. Further, cores
>> could point to a logical or physical entity via a URL. (As a
>> side note, I've always found it odd that the shards param to
>> RequestHandler lacks the protocol, what if I want to use HTTPS
>> for example?).
>>
>> So there could be http://host/solr/core1 (physical),
>> core://megacorename (logical),
>> coregroup://supergreatcoregroupname (a group of cores) in the
>> "shards" parameter (whose name can perhaps be changed for
>> clarity in a future release). Then people can mix and match and
>> we won't have many different XML elements floating around. We'd
>> have a simple list of URLs that are transposed into a real
>> physical network request.
>>
>>
>> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
>> <[hidden email]> wrote:
>>
>>>
>>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>>> <[hidden email]> wrote:
>>>
>>>>
>>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>>> <[hidden email]> wrote:
>>>>
>>>>>
>>>>> I'm actually starting to lean toward "slice" instead of "logical
>>>>> shard".
>>>>>
>>>
>>> Alternate terminology could be "index" for the actual physical lucene
>>> lindex (and also enough of the URL that unambiguously identifies it),
>>> and then "shard" could be the logical entity.
>>>
>>> But I've kind of gotten used to thinking of shards as the actual
>>> physical queryable things...
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Ted Dunning
Shard has the interesting additional implication that it is part of a
composite index made up of many sub-indexes.

A lucene index could be a complete index or a shard.  I would presume the
same of what might be called a core.

On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen <
[hidden email]> wrote:

> Uri,
>
> > "core" to represent a single index and "shard" to be
> > represented by a single core
>
> Can you elaborate on what you mean, isn't a core a single index
> too? It seems like shard was used to represent a remote index
> (perhaps?). Though here I'd prefer "remote core", because to the
> uninitiated Solr outsider it's immediately obvious (i.e. they
> need only know what a core is, in the Solr glossary or term
> dictionary).
>
> In Google vernacular, which is where the name shard came from, a
> "shard" is basically a local sub-index
> http://research.google.com/archive/googlecluster.html where
> there would be many "shards" per server. However that's a
> digression at this point.
>
> I personally prefer relatively straightforward names, that are
> self-evident, rather than inventing new language for fairly
> simple concepts. Slice, even though it comes from our buddy
> Yonik, probably doesn't make any immediate sense to external
> users when compared with the word shard. Of course software
> projects have a tendency to create their own words to somewhat
> mystify users into believing in some sort of magic occurring
> underneath. If that's what we're after, it's cool, I mean that
> makes sense. And I don't mean to be derogatory here however this
> is an open source project created in part to educate users on
> search and be made easily accessible as possible, to the
> greatest number of users possible. I think Doug did a create job
> of this when Lucene started with amazingly succinct code for
> fairly complex concepts (eg, anti-mystification of search).
>
> Jason
>
> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness <[hidden email]> wrote:
> > Although Jason has some valid points here, I'm with Yonik here. I do
> believe
> > that we've gotten used to the terms "core" to represent a single index
> and
> > "shard" to be represented by a single core. A "node" seems to indicate a
> > machine or a JVM. Changing any of these (informal perhaps) definitions
> will
> > only cause confusion. That's why I think a "slice" is a good solution
> now...
> > first it's a new term to a new view of the index (logical shard AFAIK
> don't
> > really exists yet) so people won't need to get used to it, but it's also
> > descriptive and intuitive. I do like Jason's idea about having a protocol
> > attached to the URL's.
> >
> > Cheers,
> > Uri
> >
> > Jason Rutherglen wrote:
> >>>
> >>> But I've kind of gotten used to thinking of shards as the
> >>> actual physical queryable things...
> >>>
> >>
> >> I think a mistake was made referring to Solr cores as shards.
> >> It's the same thing with 2 different names. Slices adds yet
> >> another name which seems to imply the same thing yet again. I'd
> >> rather see disambiguation here, and call them cores (partially
> >> because that's what's in the code and on the wiki), and cores
> >> only. It's a Solr specific term, it's going to be confused with
> >> microprocessor cores, but at least there's only one name, which
> >> as search people, we know creates fewer posting lists :).
> >>
> >> Logical groupings of cores can occur, which can be aptly named
> >> core groups. This way I can submit a query to a core group, and
> >> it's reasonable to assume I'm hitting N cores. Further, cores
> >> could point to a logical or physical entity via a URL. (As a
> >> side note, I've always found it odd that the shards param to
> >> RequestHandler lacks the protocol, what if I want to use HTTPS
> >> for example?).
> >>
> >> So there could be http://host/solr/core1 (physical),
> >> core://megacorename (logical),
> >> coregroup://supergreatcoregroupname (a group of cores) in the
> >> "shards" parameter (whose name can perhaps be changed for
> >> clarity in a future release). Then people can mix and match and
> >> we won't have many different XML elements floating around. We'd
> >> have a simple list of URLs that are transposed into a real
> >> physical network request.
> >>
> >>
> >> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
> >> <[hidden email]> wrote:
> >>
> >>>
> >>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
> >>> <[hidden email]> wrote:
> >>>
> >>>>
> >>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
> >>>> <[hidden email]> wrote:
> >>>>
> >>>>>
> >>>>> I'm actually starting to lean toward "slice" instead of "logical
> >>>>> shard".
> >>>>>
> >>>
> >>> Alternate terminology could be "index" for the actual physical lucene
> >>> lindex (and also enough of the URL that unambiguously identifies it),
> >>> and then "shard" could be the logical entity.
> >>>
> >>> But I've kind of gotten used to thinking of shards as the actual
> >>> physical queryable things...
> >>>
> >>> -Yonik
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>
> >>
> >
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Lance Norskog-2
Yonik spake-
    I'm actually starting to lean toward "slice" instead of "logical shard".
    In the future we'll want to enable overlapping shards I think (due to
   an Amazon Dynamo type of replication, or due to merging shards, etc),v
   and a separate word for a logical slice of the index seems desirable.

   For instance, one could specify slice=1000-1999 (defined by the ids or
   hashcodes of the ids) and that could end up querying multiple servers.
   For this first iteration, slices would just be opaque identifiers
   though (and that functionality would always remain, allowing for user
   partitioning by time or by geo region).

+1

Logical-to-physical mapping should not assume that the logical has an
integral number of the physical. Overlapping and partial physical
shards should be addressable as a logical shard. If you're going to do
something this major, do it right.

On Thu, Jan 14, 2010 at 3:29 PM, Ted Dunning <[hidden email]> wrote:

> Shard has the interesting additional implication that it is part of a
> composite index made up of many sub-indexes.
>
> A lucene index could be a complete index or a shard.  I would presume the
> same of what might be called a core.
>
> On Thu, Jan 14, 2010 at 3:21 PM, Jason Rutherglen <
> [hidden email]> wrote:
>
>> Uri,
>>
>> > "core" to represent a single index and "shard" to be
>> > represented by a single core
>>
>> Can you elaborate on what you mean, isn't a core a single index
>> too? It seems like shard was used to represent a remote index
>> (perhaps?). Though here I'd prefer "remote core", because to the
>> uninitiated Solr outsider it's immediately obvious (i.e. they
>> need only know what a core is, in the Solr glossary or term
>> dictionary).
>>
>> In Google vernacular, which is where the name shard came from, a
>> "shard" is basically a local sub-index
>> http://research.google.com/archive/googlecluster.html where
>> there would be many "shards" per server. However that's a
>> digression at this point.
>>
>> I personally prefer relatively straightforward names, that are
>> self-evident, rather than inventing new language for fairly
>> simple concepts. Slice, even though it comes from our buddy
>> Yonik, probably doesn't make any immediate sense to external
>> users when compared with the word shard. Of course software
>> projects have a tendency to create their own words to somewhat
>> mystify users into believing in some sort of magic occurring
>> underneath. If that's what we're after, it's cool, I mean that
>> makes sense. And I don't mean to be derogatory here however this
>> is an open source project created in part to educate users on
>> search and be made easily accessible as possible, to the
>> greatest number of users possible. I think Doug did a create job
>> of this when Lucene started with amazingly succinct code for
>> fairly complex concepts (eg, anti-mystification of search).
>>
>> Jason
>>
>> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness <[hidden email]> wrote:
>> > Although Jason has some valid points here, I'm with Yonik here. I do
>> believe
>> > that we've gotten used to the terms "core" to represent a single index
>> and
>> > "shard" to be represented by a single core. A "node" seems to indicate a
>> > machine or a JVM. Changing any of these (informal perhaps) definitions
>> will
>> > only cause confusion. That's why I think a "slice" is a good solution
>> now...
>> > first it's a new term to a new view of the index (logical shard AFAIK
>> don't
>> > really exists yet) so people won't need to get used to it, but it's also
>> > descriptive and intuitive. I do like Jason's idea about having a protocol
>> > attached to the URL's.
>> >
>> > Cheers,
>> > Uri
>> >
>> > Jason Rutherglen wrote:
>> >>>
>> >>> But I've kind of gotten used to thinking of shards as the
>> >>> actual physical queryable things...
>> >>>
>> >>
>> >> I think a mistake was made referring to Solr cores as shards.
>> >> It's the same thing with 2 different names. Slices adds yet
>> >> another name which seems to imply the same thing yet again. I'd
>> >> rather see disambiguation here, and call them cores (partially
>> >> because that's what's in the code and on the wiki), and cores
>> >> only. It's a Solr specific term, it's going to be confused with
>> >> microprocessor cores, but at least there's only one name, which
>> >> as search people, we know creates fewer posting lists :).
>> >>
>> >> Logical groupings of cores can occur, which can be aptly named
>> >> core groups. This way I can submit a query to a core group, and
>> >> it's reasonable to assume I'm hitting N cores. Further, cores
>> >> could point to a logical or physical entity via a URL. (As a
>> >> side note, I've always found it odd that the shards param to
>> >> RequestHandler lacks the protocol, what if I want to use HTTPS
>> >> for example?).
>> >>
>> >> So there could be http://host/solr/core1 (physical),
>> >> core://megacorename (logical),
>> >> coregroup://supergreatcoregroupname (a group of cores) in the
>> >> "shards" parameter (whose name can perhaps be changed for
>> >> clarity in a future release). Then people can mix and match and
>> >> we won't have many different XML elements floating around. We'd
>> >> have a simple list of URLs that are transposed into a real
>> >> physical network request.
>> >>
>> >>
>> >> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
>> >> <[hidden email]> wrote:
>> >>
>> >>>
>> >>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>> >>> <[hidden email]> wrote:
>> >>>
>> >>>>
>> >>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>> >>>> <[hidden email]> wrote:
>> >>>>
>> >>>>>
>> >>>>> I'm actually starting to lean toward "slice" instead of "logical
>> >>>>> shard".
>> >>>>>
>> >>>
>> >>> Alternate terminology could be "index" for the actual physical lucene
>> >>> lindex (and also enough of the URL that unambiguously identifies it),
>> >>> and then "shard" could be the logical entity.
>> >>>
>> >>> But I've kind of gotten used to thinking of shards as the actual
>> >>> physical queryable things...
>> >>>
>> >>> -Yonik
>> >>> http://www.lucidimagination.com
>> >>>
>> >>>
>> >>
>> >>
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Ted Dunning
My definition of right is simple and modularized with minimal conceptual
upheaval.

As such, simply giving SOLR a good shard manager that broadcasts queries
without respect to content is a preferable solution than something very
fancy.

On Thu, Jan 14, 2010 at 4:31 PM, Lance Norskog <[hidden email]> wrote:

> Logical-to-physical mapping should not assume that the logical has an
> integral number of the physical. Overlapping and partial physical
> shards should be addressable as a logical shard. If you're going to do
> something this major, do it right.
>



--
Ted Dunning, CTO
DeepDyve
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Uri Boness
In reply to this post by Jason Rutherglen
>
> Can you elaborate on what you mean, isn't a core a single index
> too? It seems like shard was used to represent a remote index
> (perhaps?).
Yes, a core is a single index and a shard is a conceptual idea which at
the moment concretely refers to a remote core (but not a specific one as
the same shard can be represented by multiple core replicas). The point
I was trying to make is that I believe that if you start changing
terminologies now people will be very confused. And I thought of
sticking to Yonik's suggestion of a "slice" just to prevent this
confusion. On the other hand one can argue that the terminology as it is
today is already confusing... and if you really want to get it right and
be aligned with the "rest of the world" (if there is such a thing...
from what I've seen so far sharding is used differently in different
contexts), then perhaps a "good" timing for making such terminology
changes is with a major release (Solr 2.0?) as with such release people
tend to be more open for new/changed concepts.

Cheers,
Uri

Jason Rutherglen wrote:

> Uri,
>
>  
>> "core" to represent a single index and "shard" to be
>> represented by a single core
>>    
>
> Can you elaborate on what you mean, isn't a core a single index
> too? It seems like shard was used to represent a remote index
> (perhaps?). Though here I'd prefer "remote core", because to the
> uninitiated Solr outsider it's immediately obvious (i.e. they
> need only know what a core is, in the Solr glossary or term
> dictionary).
>
> In Google vernacular, which is where the name shard came from, a
> "shard" is basically a local sub-index
> http://research.google.com/archive/googlecluster.html where
> there would be many "shards" per server. However that's a
> digression at this point.
>
> I personally prefer relatively straightforward names, that are
> self-evident, rather than inventing new language for fairly
> simple concepts. Slice, even though it comes from our buddy
> Yonik, probably doesn't make any immediate sense to external
> users when compared with the word shard. Of course software
> projects have a tendency to create their own words to somewhat
> mystify users into believing in some sort of magic occurring
> underneath. If that's what we're after, it's cool, I mean that
> makes sense. And I don't mean to be derogatory here however this
> is an open source project created in part to educate users on
> search and be made easily accessible as possible, to the
> greatest number of users possible. I think Doug did a create job
> of this when Lucene started with amazingly succinct code for
> fairly complex concepts (eg, anti-mystification of search).
>
> Jason
>
> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness <[hidden email]> wrote:
>  
>> Although Jason has some valid points here, I'm with Yonik here. I do believe
>> that we've gotten used to the terms "core" to represent a single index and
>> "shard" to be represented by a single core. A "node" seems to indicate a
>> machine or a JVM. Changing any of these (informal perhaps) definitions will
>> only cause confusion. That's why I think a "slice" is a good solution now...
>> first it's a new term to a new view of the index (logical shard AFAIK don't
>> really exists yet) so people won't need to get used to it, but it's also
>> descriptive and intuitive. I do like Jason's idea about having a protocol
>> attached to the URL's.
>>
>> Cheers,
>> Uri
>>
>> Jason Rutherglen wrote:
>>    
>>>> But I've kind of gotten used to thinking of shards as the
>>>> actual physical queryable things...
>>>>
>>>>        
>>> I think a mistake was made referring to Solr cores as shards.
>>> It's the same thing with 2 different names. Slices adds yet
>>> another name which seems to imply the same thing yet again. I'd
>>> rather see disambiguation here, and call them cores (partially
>>> because that's what's in the code and on the wiki), and cores
>>> only. It's a Solr specific term, it's going to be confused with
>>> microprocessor cores, but at least there's only one name, which
>>> as search people, we know creates fewer posting lists :).
>>>
>>> Logical groupings of cores can occur, which can be aptly named
>>> core groups. This way I can submit a query to a core group, and
>>> it's reasonable to assume I'm hitting N cores. Further, cores
>>> could point to a logical or physical entity via a URL. (As a
>>> side note, I've always found it odd that the shards param to
>>> RequestHandler lacks the protocol, what if I want to use HTTPS
>>> for example?).
>>>
>>> So there could be http://host/solr/core1 (physical),
>>> core://megacorename (logical),
>>> coregroup://supergreatcoregroupname (a group of cores) in the
>>> "shards" parameter (whose name can perhaps be changed for
>>> clarity in a future release). Then people can mix and match and
>>> we won't have many different XML elements floating around. We'd
>>> have a simple list of URLs that are transposed into a real
>>> physical network request.
>>>
>>>
>>> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
>>> <[hidden email]> wrote:
>>>
>>>      
>>>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>>>> <[hidden email]> wrote:
>>>>
>>>>        
>>>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>>>> <[hidden email]> wrote:
>>>>>
>>>>>          
>>>>>> I'm actually starting to lean toward "slice" instead of "logical
>>>>>> shard".
>>>>>>
>>>>>>            
>>>> Alternate terminology could be "index" for the actual physical lucene
>>>> lindex (and also enough of the URL that unambiguously identifies it),
>>>> and then "shard" could be the logical entity.
>>>>
>>>> But I've kind of gotten used to thinking of shards as the actual
>>>> physical queryable things...
>>>>
>>>> -Yonik
>>>> http://www.lucidimagination.com
>>>>
>>>>
>>>>        
>>>      
>
>  
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Ted Dunning
On Thu, Jan 14, 2010 at 1:38 PM, Ted Dunning <[hidden email]> wrote:
> I think that most of these complications go away to a remarkable degree if
> you combine katta style random assignment of small shards.
>
> The major simplifications there include:
>
> - no need to move individual documents, nor to split or merge shards, no
> need for search-server to search-server communications

Yeah, keeping shards smaller allows cluster growth (to some degree)
w/o getting into shard splitting.
Until a single core can handle multiple shards though, this isn't too practical.

While I think we should eventually support this model, I don't think
we want to limit ourselves to it.
The idea is to also support the type of cluster architectures that
people have today.  And yes, I think that does cause complications :-)

-Yonik
http://www.lucidimagination.com
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Jason Rutherglen
In reply to this post by Uri Boness
> The point I was trying to make is that I believe that if you start changing terminologies now people will be very confused

So shard -> remote core... Slice -> core group.  Though semantically
they're synonyms.  In any case, I need to spend some time looking at
the cloud branch, and less time jibber-jabberin' about it.

On Fri, Jan 15, 2010 at 1:24 AM, Uri Boness <[hidden email]> wrote:

>>
>> Can you elaborate on what you mean, isn't a core a single index
>> too? It seems like shard was used to represent a remote index
>> (perhaps?).
>
> Yes, a core is a single index and a shard is a conceptual idea which at the
> moment concretely refers to a remote core (but not a specific one as the
> same shard can be represented by multiple core replicas). The point I was
> trying to make is that I believe that if you start changing terminologies
> now people will be very confused. And I thought of sticking to Yonik's
> suggestion of a "slice" just to prevent this confusion. On the other hand
> one can argue that the terminology as it is today is already confusing...
> and if you really want to get it right and be aligned with the "rest of the
> world" (if there is such a thing... from what I've seen so far sharding is
> used differently in different contexts), then perhaps a "good" timing for
> making such terminology changes is with a major release (Solr 2.0?) as with
> such release people tend to be more open for new/changed concepts.
>
> Cheers,
> Uri
>
> Jason Rutherglen wrote:
>>
>> Uri,
>>
>>
>>>
>>> "core" to represent a single index and "shard" to be
>>> represented by a single core
>>>
>>
>> Can you elaborate on what you mean, isn't a core a single index
>> too? It seems like shard was used to represent a remote index
>> (perhaps?). Though here I'd prefer "remote core", because to the
>> uninitiated Solr outsider it's immediately obvious (i.e. they
>> need only know what a core is, in the Solr glossary or term
>> dictionary).
>>
>> In Google vernacular, which is where the name shard came from, a
>> "shard" is basically a local sub-index
>> http://research.google.com/archive/googlecluster.html where
>> there would be many "shards" per server. However that's a
>> digression at this point.
>>
>> I personally prefer relatively straightforward names, that are
>> self-evident, rather than inventing new language for fairly
>> simple concepts. Slice, even though it comes from our buddy
>> Yonik, probably doesn't make any immediate sense to external
>> users when compared with the word shard. Of course software
>> projects have a tendency to create their own words to somewhat
>> mystify users into believing in some sort of magic occurring
>> underneath. If that's what we're after, it's cool, I mean that
>> makes sense. And I don't mean to be derogatory here however this
>> is an open source project created in part to educate users on
>> search and be made easily accessible as possible, to the
>> greatest number of users possible. I think Doug did a create job
>> of this when Lucene started with amazingly succinct code for
>> fairly complex concepts (eg, anti-mystification of search).
>>
>> Jason
>>
>> On Thu, Jan 14, 2010 at 2:58 PM, Uri Boness <[hidden email]> wrote:
>>
>>>
>>> Although Jason has some valid points here, I'm with Yonik here. I do
>>> believe
>>> that we've gotten used to the terms "core" to represent a single index
>>> and
>>> "shard" to be represented by a single core. A "node" seems to indicate a
>>> machine or a JVM. Changing any of these (informal perhaps) definitions
>>> will
>>> only cause confusion. That's why I think a "slice" is a good solution
>>> now...
>>> first it's a new term to a new view of the index (logical shard AFAIK
>>> don't
>>> really exists yet) so people won't need to get used to it, but it's also
>>> descriptive and intuitive. I do like Jason's idea about having a protocol
>>> attached to the URL's.
>>>
>>> Cheers,
>>> Uri
>>>
>>> Jason Rutherglen wrote:
>>>
>>>>>
>>>>> But I've kind of gotten used to thinking of shards as the
>>>>> actual physical queryable things...
>>>>>
>>>>>
>>>>
>>>> I think a mistake was made referring to Solr cores as shards.
>>>> It's the same thing with 2 different names. Slices adds yet
>>>> another name which seems to imply the same thing yet again. I'd
>>>> rather see disambiguation here, and call them cores (partially
>>>> because that's what's in the code and on the wiki), and cores
>>>> only. It's a Solr specific term, it's going to be confused with
>>>> microprocessor cores, but at least there's only one name, which
>>>> as search people, we know creates fewer posting lists :).
>>>>
>>>> Logical groupings of cores can occur, which can be aptly named
>>>> core groups. This way I can submit a query to a core group, and
>>>> it's reasonable to assume I'm hitting N cores. Further, cores
>>>> could point to a logical or physical entity via a URL. (As a
>>>> side note, I've always found it odd that the shards param to
>>>> RequestHandler lacks the protocol, what if I want to use HTTPS
>>>> for example?).
>>>>
>>>> So there could be http://host/solr/core1 (physical),
>>>> core://megacorename (logical),
>>>> coregroup://supergreatcoregroupname (a group of cores) in the
>>>> "shards" parameter (whose name can perhaps be changed for
>>>> clarity in a future release). Then people can mix and match and
>>>> we won't have many different XML elements floating around. We'd
>>>> have a simple list of URLs that are transposed into a real
>>>> physical network request.
>>>>
>>>>
>>>> On Thu, Jan 14, 2010 at 12:56 PM, Yonik Seeley
>>>> <[hidden email]> wrote:
>>>>
>>>>
>>>>>
>>>>> On Thu, Jan 14, 2010 at 1:38 PM, Yonik Seeley
>>>>> <[hidden email]> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Jan 14, 2010 at 12:46 PM, Yonik Seeley
>>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I'm actually starting to lean toward "slice" instead of "logical
>>>>>>> shard".
>>>>>>>
>>>>>>>
>>>>>
>>>>> Alternate terminology could be "index" for the actual physical lucene
>>>>> lindex (and also enough of the URL that unambiguously identifies it),
>>>>> and then "shard" could be the logical entity.
>>>>>
>>>>> But I've kind of gotten used to thinking of shards as the actual
>>>>> physical queryable things...
>>>>>
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: SolrCloud logical shards

Yonik Seeley-2-2
In reply to this post by Yonik Seeley-2-2
On Thu, Jan 14, 2010 at 2:43 PM, Yonik Seeley
<[hidden email]> wrote:

> On Thu, Jan 14, 2010 at 1:58 PM, Chris Hostetter
> <[hidden email]> wrote:
>> : parameter we use for this.  Suggestions?  logicalshards=shard1,shard2?
>> : lshards=shard1,shard2?  slice=shard1,shard2? It doesn't seem like it
>> : would be easy to reuse the "shards" parameter for this since it refers
>> : to physical shard addresses.
>>
>> I haven't been following the SolrCloud stuff much, but from a client
>> perspective is there really any difference between asking for a physical
>> shard, vs asking for a logical shard (or slice name)? ... shouldn't the
>> later case just result in a resolution from logical->physical w/o
>> requiring the client code to know/care wether the String they have is a
>> physical shard URL, or a slice name.
>
> That might be doable... but we would need to be able to tell the difference.
> Perhaps we could always require a slash in a physical address
> (localhost/context) and prohibit it in slice names?
>
> But... I think there's still a potentially bigger difference: today,
> if shards is set, it means it's a distributed search (and shards is
> removed for sub-requests).  But the slice of the index being requested
> may not have a one-to-one mapping with a full request on a solr core.
> And shards may be able to move around, and so it seems important to be
> able to declare what part of the index you're looking for when you're
> querying a shard.

If we want to go this route for parameters (allowing use of both
physical or logical shards in the shards param), I've updated the wiki
with one way to do it:

"""
The presence of "shards" is what currently signals that a request is
distributed, and distrib search removes this param for sub-requests.
But with future micro-sharding or having a single core support
multiple shards, the request will need to contain what shards are
being requested. Reusing "shards" for this (per Hoss' suggestion) by
allowing either physical urls or logical shards (slices) would require
that either

    * a) The search component detect when it has all of the shards
requested, and turn it into a non-distributed request (any error here
could easily result in an infinite request loop until deadlock). It
seems better to return a specific error if this node no longer
contains the shard being queried in a non-distrib search.
    * b) Use a different distrib=true flag to indicate if this is a
distributed search. This isn't back compatible though? Unless we also
consider any request where shards contains a url to be distributed.

http://localhost:8983/solr/collection1/select?shards=shard_200911,shard_200912,shard_201001&distrib=true

If we adopt "distrib=true" then it should replace "shards=auto" in the
other example URLs
"""

So the top-level distributed request shown above would resolve to
potentially multiple sub-requests of the form
http://localhost:1234/solr/collection1/select?shards=shard_200911
(note, distrib=true has been removed)
http://localhost:1235/solr/collection1/select?shards=shard_200912
http://localhost:1236/solr/collection1/select?shards=shard_201001

-Yonik
http://www.lucidimagination.com