My Plan to Scale Solr

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

My Plan to Scale Solr

Bing Li
Dear all,

I started to learn how to use Solr three months ago. My experiences are
still limited.

Now I crawl Web pages with my crawler and send the data to a single Solr
server. It runs fine.

Since the potential users are large, I decide to scale Solr. After
configuring replication, a single index can be replicated to multiple
servers.

For shards, I think it is also required. I attempt to split the index
according to the data categories and priorities. After that, I will use the
above replication techniques and get high performance. The following work is
not so difficult.

I noticed some new terms, such as SolrClould, Katta and ZooKeeper. According
to my current understandings, it seems that I can ignore them. Am I right?
What benefits can I get if using them?

Thanks so much!
LB
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Markus Jelsma-2
Hi Bing Li,

On Thursday 17 February 2011 10:32:11 Bing Li wrote:

> Dear all,
>
> I started to learn how to use Solr three months ago. My experiences are
> still limited.
>
> Now I crawl Web pages with my crawler and send the data to a single Solr
> server. It runs fine.
>
> Since the potential users are large, I decide to scale Solr. After
> configuring replication, a single index can be replicated to multiple
> servers.
>
> For shards, I think it is also required. I attempt to split the index
> according to the data categories and priorities. After that, I will use the
> above replication techniques and get high performance. The following work
> is not so difficult.

It's better to use a consistent hashing algorithm to decide which server takes
which documents if you want good relevancy. Using a modulo with the number of
servers will return the shard per document. If you have integers as unique key
then just a modulo will suffice.

>
> I noticed some new terms, such as SolrClould, Katta and ZooKeeper.
> According to my current understandings, it seems that I can ignore them.
> Am I right? What benefits can I get if using them?

SolrCloud comes with ZooKeeper. It's designed to provide a fail-over cluster
and more useful features. I haven't tried Katta.

>
> Thanks so much!
> LB

--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Stijn Vanhoorelbeke
In reply to this post by Bing Li
Hi,

I'm currently looking at SolrCloud. I've managed to set up a scalable
cluster with ZooKeeper.
( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
understanding )
This way, all different shards / replicas are stored in a centralised
configuration.

Moreover the ZooKeeper contains out-of-the-box loadbalancing.
So, lets say - you have 2 different shards and each is replicated 2 times.
Your zookeeper config will look like this:

\config
 ...
   /live_nodes (v=6 children=4)
          lP_Port:7500_solr (ephemeral v=0)
          lP_Port:7574_solr (ephemeral v=0)
          lP_Port:8900_solr (ephemeral v=0)
          lP_Port:8983_solr (ephemeral v=0)
     /collections (v=20 children=1)
          collection1 (v=0 children=1) "configName=myconf"
               shards (v=0 children=2)
                    shard1 (v=0 children=3)
                         lP_Port:8983_solr_ (v=4)
"node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"
                         lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
                         lP_Port:8900_solr_ (v=1)
"node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"
                    shard2 (v=0 children=2)
                         lP_Port:7500_solr_ (v=0)
"node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"
                         lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"

--> This setup can be realised, by 1 ZooKeeper module - the other solr
machines need just to know the IP_Port were the zookeeper is active & that's
it.
--> So no configuration / installing is needed to realise quick a scalable /
load balanced cluster.

Disclaimer:
ZooKeeper is a relative new feature - I'm not sure if it will work out in a
real production environment, which has a tight SLA pending.
But - definitely keep your eyes on this stuff - this will mature quickly!

Stijn Vanhoorelbeke
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

gearond
What's an 'LSA'

 Dennis Gearon


Signature Warning
----------------
It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.




________________________________
From: Stijn Vanhoorelbeke <[hidden email]>
To: [hidden email]; [hidden email]
Sent: Thu, February 17, 2011 4:28:13 AM
Subject: Re: My Plan to Scale Solr

Hi,

I'm currently looking at SolrCloud. I've managed to set up a scalable
cluster with ZooKeeper.
( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
understanding )
This way, all different shards / replicas are stored in a centralised
configuration.

Moreover the ZooKeeper contains out-of-the-box loadbalancing.
So, lets say - you have 2 different shards and each is replicated 2 times.
Your zookeeper config will look like this:

\config
...
   /live_nodes (v=6 children=4)
          lP_Port:7500_solr (ephemeral v=0)
          lP_Port:7574_solr (ephemeral v=0)
          lP_Port:8900_solr (ephemeral v=0)
          lP_Port:8983_solr (ephemeral v=0)
     /collections (v=20 children=1)
          collection1 (v=0 children=1) "configName=myconf"
               shards (v=0 children=2)
                    shard1 (v=0 children=3)
                         lP_Port:8983_solr_ (v=4)
"node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"
                         lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
                         lP_Port:8900_solr_ (v=1)
"node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"
                    shard2 (v=0 children=2)
                         lP_Port:7500_solr_ (v=0)
"node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"
                         lP_Port:7574_solr_ (v=1)
"node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"

--> This setup can be realised, by 1 ZooKeeper module - the other solr
machines need just to know the IP_Port were the zookeeper is active & that's
it.
--> So no configuration / installing is needed to realise quick a scalable /
load balanced cluster.

Disclaimer:
ZooKeeper is a relative new feature - I'm not sure if it will work out in a
real production environment, which has a tight SLA pending.
But - definitely keep your eyes on this stuff - this will mature quickly!

Stijn Vanhoorelbeke
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Walter Underwood
http://lmgtfy.com/?q=SLA

wunder

On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:

> What's an 'LSA'
>
> Dennis Gearon
>
>
> Signature Warning
> ----------------
> It is always a good idea to learn from your own mistakes. It is usually a better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>
>
>
> ________________________________
> From: Stijn Vanhoorelbeke <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Thu, February 17, 2011 4:28:13 AM
> Subject: Re: My Plan to Scale Solr
>
> Hi,
>
> I'm currently looking at SolrCloud. I've managed to set up a scalable
> cluster with ZooKeeper.
> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
> understanding )
> This way, all different shards / replicas are stored in a centralised
> configuration.
>
> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
> So, lets say - you have 2 different shards and each is replicated 2 times.
> Your zookeeper config will look like this:
>
> \config
> ...
>   /live_nodes (v=6 children=4)
>          lP_Port:7500_solr (ephemeral v=0)
>          lP_Port:7574_solr (ephemeral v=0)
>          lP_Port:8900_solr (ephemeral v=0)
>          lP_Port:8983_solr (ephemeral v=0)
>     /collections (v=20 children=1)
>          collection1 (v=0 children=1) "configName=myconf"
>               shards (v=0 children=2)
>                    shard1 (v=0 children=3)
>                         lP_Port:8983_solr_ (v=4)
> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"
>                         lP_Port:7574_solr_ (v=1)
> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>                         lP_Port:8900_solr_ (v=1)
> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"
>                    shard2 (v=0 children=2)
>                         lP_Port:7500_solr_ (v=0)
> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"
>                         lP_Port:7574_solr_ (v=1)
> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>
> --> This setup can be realised, by 1 ZooKeeper module - the other solr
> machines need just to know the IP_Port were the zookeeper is active & that's
> it.
> --> So no configuration / installing is needed to realise quick a scalable /
> load balanced cluster.
>
> Disclaimer:
> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
> real production environment, which has a tight SLA pending.
> But - definitely keep your eyes on this stuff - this will mature quickly!
>
> Stijn Vanhoorelbeke

--
Walter Underwood
Venture ASM, Troop 14, Palo Alto



Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Lance Norskog-2
Or even better, search with 'LSA'.

On Thu, Feb 17, 2011 at 9:22 AM, Walter Underwood <[hidden email]> wrote:

> http://lmgtfy.com/?q=SLA
>
> wunder
>
> On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:
>
>> What's an 'LSA'
>>
>> Dennis Gearon
>>
>>
>> Signature Warning
>> ----------------
>> It is always a good idea to learn from your own mistakes. It is usually a better
>> idea to learn from others’ mistakes, so you do not have to make them yourself.
>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>
>>
>> EARTH has a Right To Life,
>> otherwise we all die.
>>
>>
>>
>>
>> ________________________________
>> From: Stijn Vanhoorelbeke <[hidden email]>
>> To: [hidden email]; [hidden email]
>> Sent: Thu, February 17, 2011 4:28:13 AM
>> Subject: Re: My Plan to Scale Solr
>>
>> Hi,
>>
>> I'm currently looking at SolrCloud. I've managed to set up a scalable
>> cluster with ZooKeeper.
>> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
>> understanding )
>> This way, all different shards / replicas are stored in a centralised
>> configuration.
>>
>> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
>> So, lets say - you have 2 different shards and each is replicated 2 times.
>> Your zookeeper config will look like this:
>>
>> \config
>> ...
>>   /live_nodes (v=6 children=4)
>>          lP_Port:7500_solr (ephemeral v=0)
>>          lP_Port:7574_solr (ephemeral v=0)
>>          lP_Port:8900_solr (ephemeral v=0)
>>          lP_Port:8983_solr (ephemeral v=0)
>>     /collections (v=20 children=1)
>>          collection1 (v=0 children=1) "configName=myconf"
>>               shards (v=0 children=2)
>>                    shard1 (v=0 children=3)
>>                         lP_Port:8983_solr_ (v=4)
>> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>>                         lP_Port:8900_solr_ (v=1)
>> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"
>>                    shard2 (v=0 children=2)
>>                         lP_Port:7500_solr_ (v=0)
>> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"
>>                         lP_Port:7574_solr_ (v=1)
>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>>
>> --> This setup can be realised, by 1 ZooKeeper module - the other solr
>> machines need just to know the IP_Port were the zookeeper is active & that's
>> it.
>> --> So no configuration / installing is needed to realise quick a scalable /
>> load balanced cluster.
>>
>> Disclaimer:
>> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
>> real production environment, which has a tight SLA pending.
>> But - definitely keep your eyes on this stuff - this will mature quickly!
>>
>> Stijn Vanhoorelbeke
>
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
>
>
>
>



--
Lance Norskog
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Grijesh
its just a joke?
Thanx:
Grijesh
www.gettinhahead.co.in
Reply | Threaded
Open this post in threaded view
|

Re: My Plan to Scale Solr

Walter Underwood
In reply to this post by Lance Norskog-2
He misspelled it as "LSA". The original post says "'m not sure if it will work out in a real production environment, which has a tight SLA pending." Clearly a Service Level Agreement, not Latent Semantic Analysis.

Since we're working on search engines, let's all try to figure stuff out for ourselves at least once, before we interrupt a few hundred people with questions.

wunder

On Feb 17, 2011, at 11:47 PM, Lance Norskog wrote:

> Or even better, search with 'LSA'.
>
> On Thu, Feb 17, 2011 at 9:22 AM, Walter Underwood <[hidden email]> wrote:
>> http://lmgtfy.com/?q=SLA
>>
>> wunder
>>
>> On Feb 17, 2011, at 11:04 AM, Dennis Gearon wrote:
>>
>>> What's an 'LSA'
>>>
>>> Dennis Gearon
>>>
>>>
>>> Signature Warning
>>> ----------------
>>> It is always a good idea to learn from your own mistakes. It is usually a better
>>> idea to learn from others’ mistakes, so you do not have to make them yourself.
>>> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>>>
>>>
>>> EARTH has a Right To Life,
>>> otherwise we all die.
>>>
>>>
>>>
>>>
>>> ________________________________
>>> From: Stijn Vanhoorelbeke <[hidden email]>
>>> To: [hidden email]; [hidden email]
>>> Sent: Thu, February 17, 2011 4:28:13 AM
>>> Subject: Re: My Plan to Scale Solr
>>>
>>> Hi,
>>>
>>> I'm currently looking at SolrCloud. I've managed to set up a scalable
>>> cluster with ZooKeeper.
>>> ( see the examples in http://wiki.apache.org/solr/SolrCloud for a quick
>>> understanding )
>>> This way, all different shards / replicas are stored in a centralised
>>> configuration.
>>>
>>> Moreover the ZooKeeper contains out-of-the-box loadbalancing.
>>> So, lets say - you have 2 different shards and each is replicated 2 times.
>>> Your zookeeper config will look like this:
>>>
>>> \config
>>> ...
>>>   /live_nodes (v=6 children=4)
>>>          lP_Port:7500_solr (ephemeral v=0)
>>>          lP_Port:7574_solr (ephemeral v=0)
>>>          lP_Port:8900_solr (ephemeral v=0)
>>>          lP_Port:8983_solr (ephemeral v=0)
>>>     /collections (v=20 children=1)
>>>          collection1 (v=0 children=1) "configName=myconf"
>>>               shards (v=0 children=2)
>>>                    shard1 (v=0 children=3)
>>>                         lP_Port:8983_solr_ (v=4)
>>> "node_name=lP_Port:8983_solr url=http://lP_Port:8983/solr/"
>>>                         lP_Port:7574_solr_ (v=1)
>>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>>>                         lP_Port:8900_solr_ (v=1)
>>> "node_name=lP_Port:8900_solr url=http://lP_Port:8900/solr/"
>>>                    shard2 (v=0 children=2)
>>>                         lP_Port:7500_solr_ (v=0)
>>> "node_name=lP_Port:7500_solr url=http://lP_Port:7500/solr/"
>>>                         lP_Port:7574_solr_ (v=1)
>>> "node_name=lP_Port:7574_solr url=http://lP_Port:7574/solr/"
>>>
>>> --> This setup can be realised, by 1 ZooKeeper module - the other solr
>>> machines need just to know the IP_Port were the zookeeper is active & that's
>>> it.
>>> --> So no configuration / installing is needed to realise quick a scalable /
>>> load balanced cluster.
>>>
>>> Disclaimer:
>>> ZooKeeper is a relative new feature - I'm not sure if it will work out in a
>>> real production environment, which has a tight SLA pending.
>>> But - definitely keep your eyes on this stuff - this will mature quickly!
>>>
>>> Stijn Vanhoorelbeke
>>