High-Availability deployment

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

High-Availability deployment

dma_bamboo
Hi

I'm about to deploy SOLR in a production environment and so far I'm a bit
concerned about availability.

I have a system that is responsible for fetching data from a database and
then pushing it to SOLR using its XML/HTTP interface.

So I'm going to deploy N instances of my application so it's going to be
redundant enough.

And I'm deploying SOLR in a Master / Slaves structure, so I'm using the
slaves nodes as a way to keep my index replicated and to be able to use them
to serve my queries. But my problem lies on the indexing side of things. Is
there a good alternative like a Master/Master structure that I could use so
if my current master dies I can automatically switch to my secondary master
keeping my index integrity? Or it would be needed a manual index merge after
this switch over so I can redefine my primary master server?

Thanks,
Daniel  


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

Yonik Seeley-2
On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
> I'm about to deploy SOLR in a production environment

Cool, can you share exactly what it will be used for?

> and so far I'm a bit
> concerned about availability.
>
> I have a system that is responsible for fetching data from a database and
> then pushing it to SOLR using its XML/HTTP interface.
>
> So I'm going to deploy N instances of my application so it's going to be
> redundant enough.
>
> And I'm deploying SOLR in a Master / Slaves structure, so I'm using the
> slaves nodes as a way to keep my index replicated and to be able to use them
> to serve my queries. But my problem lies on the indexing side of things. Is
> there a good alternative like a Master/Master structure that I could use so
> if my current master dies I can automatically switch to my secondary master
> keeping my index integrity?

In all the setups I've dealt with, master redundancy wasn't an issue.
If something bad happens to corrupt the index, shut off replication to
the slaves and do a complete rebuild on the master.  If the master
hardware dies, reconfigure one of the slaves to be the new master.
These are manual steps and assumes that it's not the end of the world
if your search is "stale" for a couple of hours.  A schema change that
required reindexing would also cause this window of staleness.

If your index build takes a long time, you could set up a secondary
master to pull from the primary (just like another slave).  But
there's no support for automatically switching over slaves, and the
secondary wouldn't have stuff between the last commit and the primary
crash... so something would need to update it... (query for latest doc
and start from there).

You could also have two search tiers... another copy of the master and
multiple slaves.  If one was down, being upgraded, or being rebuilt,
you could direct search traffic to the other set of servers.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

Walter Underwood, Netflix
In reply to this post by dma_bamboo
We run multiple, identical, independent copies. No master/slave
dependencies. Yes, we run indexing N times for N servers, but
that's what CPU is for and I sleep better at night. It makes
testing and deployment trivial, too.

wunder
==
Walter Underwood
Search Guy, Netflix


On 10/8/07 4:05 AM, "Daniel Alheiros" <[hidden email]> wrote:

> Hi
>
> I'm about to deploy SOLR in a production environment and so far I'm a bit
> concerned about availability.
>
> I have a system that is responsible for fetching data from a database and
> then pushing it to SOLR using its XML/HTTP interface.
>
> So I'm going to deploy N instances of my application so it's going to be
> redundant enough.
>
> And I'm deploying SOLR in a Master / Slaves structure, so I'm using the
> slaves nodes as a way to keep my index replicated and to be able to use them
> to serve my queries. But my problem lies on the indexing side of things. Is
> there a good alternative like a Master/Master structure that I could use so
> if my current master dies I can automatically switch to my secondary master
> keeping my index integrity? Or it would be needed a manual index merge after
> this switch over so I can redefine my primary master server?
>
> Thanks,
> Daniel  
>
>
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance on
> it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>

Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

dma_bamboo
In reply to this post by Yonik Seeley-2
Hi YoniK.

I'll check if I comment about it at this level and if it's OK I'll bring
other details. Sorry if I can't do it right now, but I don't want to brake
my company's policies.

Well I believe I can live with some staleness at certain moments, but it's
not good as users are supposed to need it 24x7. So the common practice is to
make one of the slaves as the new master and switch things over to it and
after the outage put them in sync again and do the proper switch back? OK,
I'll follow this, but I'm still concerned about the amount of manual steps
to be done... It would be really great having a more automated way of
handling this situations... Maybe I can think a bit more about it and come
with a suggestion to be discussed here about how to make this failover
transparent or close to it...

I'm setting up a backup task to keep a copy of my master index, just to
avoid having to re-build my index from scratch. And other important issue is
how frequently have you seen indexes getting corrupted? If I try to run a
commit or optimize on a Solr master instance and it's index got corrupted
will it run the command? And more importantly, will it run the
postOptimize/postCommit scripts generating snapshots and then possibly
propagating the bad index?

Thanks again,
Daniel  


On 8/10/07 16:12, "Yonik Seeley" <[hidden email]> wrote:

> On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
>> I'm about to deploy SOLR in a production environment
>
> Cool, can you share exactly what it will be used for?
>
>> and so far I'm a bit
>> concerned about availability.
>>
>> I have a system that is responsible for fetching data from a database and
>> then pushing it to SOLR using its XML/HTTP interface.
>>
>> So I'm going to deploy N instances of my application so it's going to be
>> redundant enough.
>>
>> And I'm deploying SOLR in a Master / Slaves structure, so I'm using the
>> slaves nodes as a way to keep my index replicated and to be able to use them
>> to serve my queries. But my problem lies on the indexing side of things. Is
>> there a good alternative like a Master/Master structure that I could use so
>> if my current master dies I can automatically switch to my secondary master
>> keeping my index integrity?
>
> In all the setups I've dealt with, master redundancy wasn't an issue.
> If something bad happens to corrupt the index, shut off replication to
> the slaves and do a complete rebuild on the master.  If the master
> hardware dies, reconfigure one of the slaves to be the new master.
> These are manual steps and assumes that it's not the end of the world
> if your search is "stale" for a couple of hours.  A schema change that
> required reindexing would also cause this window of staleness.
>
> If your index build takes a long time, you could set up a secondary
> master to pull from the primary (just like another slave).  But
> there's no support for automatically switching over slaves, and the
> secondary wouldn't have stuff between the last commit and the primary
> crash... so something would need to update it... (query for latest doc
> and start from there).
>
> You could also have two search tiers... another copy of the master and
> multiple slaves.  If one was down, being upgraded, or being rebuilt,
> you could direct search traffic to the other set of servers.
>
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

Yonik Seeley-2
On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
> Well I believe I can live with some staleness at certain moments, but it's
> not good as users are supposed to need it 24x7. So the common practice is to
> make one of the slaves as the new master and switch things over to it and
> after the outage put them in sync again and do the proper switch back? OK,
> I'll follow this, but I'm still concerned about the amount of manual steps
> to be done...

That was the plan - never needed it though... (never had a master
completely die that I know of).  Having the collection not be updated
for an hour or so while the ops folks fixed things always worked fine.

> And other important issue is
> how frequently have you seen indexes getting corrupted?

Just once I think - no idea of the cause (and I think it was quite an
old version of lucene).

> If I try to run a
> commit or optimize on a Solr master instance and it's index got corrupted
> will it run the command?

Almost all of the cases I've seen of a master failing was an OOM
error, often during segment merging (again, older versions of Lucene,
and someone forgot to change the JVM heap size from the default).
This could cause a situation where you added a document but the old
one was not deleted (overwritten).  Not "corrupted" at the Lucene
level, but if the JVM died at the wrong spot, search results could
possibly return two documents for the same unique key.  We normally
just rebuilt after a crash.

> And more importantly, will it run the
> postOptimize/postCommit scripts generating snapshots and then possibly
> propagating the bad index?

Normally not, I think... the JVM crash/restart left the lucene write
lock aquired on the index and further attempts to modify it failed.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

dma_bamboo
Hi Yonik.

It looks pretty good.

I hope I'm not the one who will post a very odd crash after a while. :)
OK, so is very unlikely that a OOM it's going to happen, as I've set my JVM
heap size to 1.5G.

Hmm, is there any exception thrown in case the index get corrupted (if it's
not caused by OOM and the JVM crashes)? The document uniqueness SOLR offers
is one of the many reasons I'm using it and should be excellent to know when
it's gone. :)
Does it mean that after recovering from a JVM crash should be recommended to
rebuild my indexes instead of just re-starting it?

Thanks again,
Daniel


On 8/10/07 17:30, "Yonik Seeley" <[hidden email]> wrote:

> On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
>> Well I believe I can live with some staleness at certain moments, but it's
>> not good as users are supposed to need it 24x7. So the common practice is to
>> make one of the slaves as the new master and switch things over to it and
>> after the outage put them in sync again and do the proper switch back? OK,
>> I'll follow this, but I'm still concerned about the amount of manual steps
>> to be done...
>
> That was the plan - never needed it though... (never had a master
> completely die that I know of).  Having the collection not be updated
> for an hour or so while the ops folks fixed things always worked fine.
>
>> And other important issue is
>> how frequently have you seen indexes getting corrupted?
>
> Just once I think - no idea of the cause (and I think it was quite an
> old version of lucene).
>
>> If I try to run a
>> commit or optimize on a Solr master instance and it's index got corrupted
>> will it run the command?
>
> Almost all of the cases I've seen of a master failing was an OOM
> error, often during segment merging (again, older versions of Lucene,
> and someone forgot to change the JVM heap size from the default).
> This could cause a situation where you added a document but the old
> one was not deleted (overwritten).  Not "corrupted" at the Lucene
> level, but if the JVM died at the wrong spot, search results could
> possibly return two documents for the same unique key.  We normally
> just rebuilt after a crash.
>
>> And more importantly, will it run the
>> postOptimize/postCommit scripts generating snapshots and then possibly
>> propagating the bad index?
>
> Normally not, I think... the JVM crash/restart left the lucene write
> lock aquired on the index and further attempts to modify it failed.
>
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

Yonik Seeley-2
On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
> Hmm, is there any exception thrown in case the index get corrupted (if it's
> not caused by OOM and the JVM crashes)? The document uniqueness SOLR offers
> is one of the many reasons I'm using it and should be excellent to know when
> it's gone. :)
> Does it mean that after recovering from a JVM crash should be recommended to
> rebuild my indexes instead of just re-starting it?

Yes, it's safer to do so.
I think in a future release we will be able to guarantee document
uniqueness even in the face of a crash.

-Yonik
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

dma_bamboo
OK, I'll define it as a procedure in my disaster recovery plan.

That would be great. I'm looking forward to it.

Thanks,
Daniel

On 8/10/07 18:07, "Yonik Seeley" <[hidden email]> wrote:

> On 10/8/07, Daniel Alheiros <[hidden email]> wrote:
>> Hmm, is there any exception thrown in case the index get corrupted (if it's
>> not caused by OOM and the JVM crashes)? The document uniqueness SOLR offers
>> is one of the many reasons I'm using it and should be excellent to know when
>> it's gone. :)
>> Does it mean that after recovering from a JVM crash should be recommended to
>> rebuild my indexes instead of just re-starting it?
>
> Yes, it's safer to do so.
> I think in a future release we will be able to guarantee document
> uniqueness even in the face of a crash.
>
> -Yonik


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.
                                       
Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

hossman
In reply to this post by dma_bamboo
: I'm setting up a backup task to keep a copy of my master index, just to
: avoid having to re-build my index from scratch. And other important issue is

every slave is a backup of the master, so you don't usually need a
seperate backup mechanism.

re-building hte index is more about peace of mind when asking "why did it
crash?  what did/didn't get writen the index before it crashed?"




-Hoss

Reply | Threaded
Open this post in threaded view
|

Re: High-Availability deployment

dma_bamboo
Hi Hoss,

Yes I know that, but I want to have a proper dummy backup (something that
could be kept in a very controlled environment). I thought about using this
approach (a slave just for this purpose), but if I'm using it just as a
backup node there is no reason I don't use a proper backup structure (as I
have all needed infra-structure in place for that). It's just an extra
redundancy level as I'm going to have a Master/Slaves structure and the
index is replicated amongst them anyway.

Yes, I got it. I have implemented ways to re-index stuff in an incremental
way so I can just re-index a slice of my content (based on dates or id's)
which should be enough to keep my index up-to-date quickly after a possible
disaster.

Thank you for your considerations,
Daniel


On 8/10/07 18:29, "Chris Hostetter" <[hidden email]> wrote:

> : I'm setting up a backup task to keep a copy of my master index, just to
> : avoid having to re-build my index from scratch. And other important issue is
>
> every slave is a backup of the master, so you don't usually need a
> seperate backup mechanism.
>
> re-building hte index is more about peace of mind when asking "why did it
> crash?  what did/didn't get writen the index before it crashed?"
>
>
>
>
> -Hoss
>


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.