replication should include the schema also

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

replication should include the schema also

Noble Paul നോബിള്‍  नोब्ळ्
The current Solr replication just copy the data directory . So if the
schema changes and I do a re-index it will blissfully copy the index
and the slaves will fail because of incompatible schema.

So the steps we follow are
 * Stop rsync on slaves
 * Update the master with new schema
 * re-index data
 * forEach slave
 ** Kill the slave
 ** clean the data directory
 ** install the new schema
 ** restart
 ** do a manual snappull

The amount of work the admin needs to do is quite significant
(depending on the no:of slaves). These are manual steps and very error
prone

The solution :
Make the replication mechanism handle the schema replication also. So
all I need to do is to just change the master and the slaves synch
automatically

What is a good way to implement this?

We have an idea along the following lines

This should involve changes to the snapshooter and snappuller scripts
and the snapinstaller components

Everytime the snapshooter takes a snapshot it must keep the timestamps
of schema.xml and elevate.xml (all the files which might affect the
runtime behavior in slaves)
For subsequent snapshots if the timestamps of any of them is changed
it must copy the all of them also for replication.

The snappuller copies the new directory as usual

The snapinstaller checks if these config files are present ,

if yes,
 * It can create a temporary core
 * install the changed index and configuration
 * load it completely and swap it out with the original core

--Noble
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Otis Gospodnetic-2
I think that sounds correct.  Why not also include solrconfig.xml?  Also, why bother with checking timestamps of those .xml files - they are small enough that it's not worth complicating scripts to save a few KB of xfer only when those files change.  But maybe you want that to detect their change easily?


In any case, this would be a nice improvement!

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----

> From: Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
> To: [hidden email]
> Sent: Wednesday, April 23, 2008 4:54:28 AM
> Subject: replication should include the schema also
>
> The current Solr replication just copy the data directory . So if the
> schema changes and I do a re-index it will blissfully copy the index
> and the slaves will fail because of incompatible schema.
>
> So the steps we follow are
> * Stop rsync on slaves
> * Update the master with new schema
> * re-index data
> * forEach slave
> ** Kill the slave
> ** clean the data directory
> ** install the new schema
> ** restart
> ** do a manual snappull
>
> The amount of work the admin needs to do is quite significant
> (depending on the no:of slaves). These are manual steps and very error
> prone
>
> The solution :
> Make the replication mechanism handle the schema replication also. So
> all I need to do is to just change the master and the slaves synch
> automatically
>
> What is a good way to implement this?
>
> We have an idea along the following lines
>
> This should involve changes to the snapshooter and snappuller scripts
> and the snapinstaller components
>
> Everytime the snapshooter takes a snapshot it must keep the timestamps
> of schema.xml and elevate.xml (all the files which might affect the
> runtime behavior in slaves)
> For subsequent snapshots if the timestamps of any of them is changed
> it must copy the all of them also for replication.
>
> The snappuller copies the new directory as usual
>
> The snapinstaller checks if these config files are present ,
>
> if yes,
> * It can create a temporary core
> * install the changed index and configuration
> * load it completely and swap it out with the original core
>
> --Noble

Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Noble Paul നോബിള്‍  नोब्ळ्
Synchronizing solrconfig is not a very desired behavior. Typically the
solrconfigs of master and slaves tend to differ. For instance we may
disable the UpdateHandler in slaves and there may be tuning done in
master to optimize indexing etc etc. The index data is not dependent
on the config itself.

Checking timestamps of schema is not done to optimize data transfer.
New schema means recreating the core, which is expensive.

--Noble


On Fri, Apr 25, 2008 at 4:15 AM, Otis Gospodnetic
<[hidden email]> wrote:

> I think that sounds correct.  Why not also include solrconfig.xml?  Also, why bother with checking timestamps of those .xml files - they are small enough that it's not worth complicating scripts to save a few KB of xfer only when those files change.  But maybe you want that to detect their change easily?
>
>
>  In any case, this would be a nice improvement!
>
>  Otis
>  --
>  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
>  ----- Original Message ----
>  > From: Noble Paul നോബിള്‍ नोब्ळ् <[hidden email]>
>  > To: [hidden email]
>  > Sent: Wednesday, April 23, 2008 4:54:28 AM
>  > Subject: replication should include the schema also
>  >
>  > The current Solr replication just copy the data directory . So if the
>  > schema changes and I do a re-index it will blissfully copy the index
>  > and the slaves will fail because of incompatible schema.
>  >
>  > So the steps we follow are
>  > * Stop rsync on slaves
>  > * Update the master with new schema
>  > * re-index data
>  > * forEach slave
>  > ** Kill the slave
>  > ** clean the data directory
>  > ** install the new schema
>  > ** restart
>  > ** do a manual snappull
>  >
>  > The amount of work the admin needs to do is quite significant
>  > (depending on the no:of slaves). These are manual steps and very error
>  > prone
>  >
>  > The solution :
>  > Make the replication mechanism handle the schema replication also. So
>  > all I need to do is to just change the master and the slaves synch
>  > automatically
>  >
>  > What is a good way to implement this?
>  >
>  > We have an idea along the following lines
>  >
>  > This should involve changes to the snapshooter and snappuller scripts
>  > and the snapinstaller components
>  >
>  > Everytime the snapshooter takes a snapshot it must keep the timestamps
>  > of schema.xml and elevate.xml (all the files which might affect the
>  > runtime behavior in slaves)
>  > For subsequent snapshots if the timestamps of any of them is changed
>  > it must copy the all of them also for replication.
>  >
>  > The snappuller copies the new directory as usual
>  >
>  > The snapinstaller checks if these config files are present ,
>  >
>  > if yes,
>  > * It can create a temporary core
>  > * install the changed index and configuration
>  > * load it completely and swap it out with the original core
>  >
>  > --Noble
>
>
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Guillaume Smet
On Fri, Apr 25, 2008 at 6:05 AM, Noble Paul നോബിള്‍ नोब्ळ्
<[hidden email]> wrote:
> Synchronizing solrconfig is not a very desired behavior. Typically the
>  solrconfigs of master and slaves tend to differ. For instance we may
>  disable the UpdateHandler in slaves and there may be tuning done in
>  master to optimize indexing etc etc. The index data is not dependent
>  on the config itself.

+1 for not synchronizing the solrconfig.xml itself.

But perhaps we could have a solrconfig.slave.xml which could be
synchronized with slaves' solrconfig.xml if present?

--
Guillaume
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Bill Au
Synchronizing solrconfig.xml is definitely not a good idea.  Typically the
master has a post commit/optimize hook to execute
snapshooter.

Solr's replication is meant for replicating the Lucene index.  One can argue
that the schema is part of the Solr index
so it should be included in the replication.  But I don't think it would be
use to do software installation.  I don't think
we should use it to distribute configuration files, just like we shouldn't
use it to distribute updated to the Solr binaries.

Bill

On Fri, Apr 25, 2008 at 4:48 AM, Guillaume Smet <[hidden email]>
wrote:

> On Fri, Apr 25, 2008 at 6:05 AM, Noble Paul നോബിള്‍ नोब्ळ्
> <[hidden email]> wrote:
> > Synchronizing solrconfig is not a very desired behavior. Typically the
> >  solrconfigs of master and slaves tend to differ. For instance we may
> >  disable the UpdateHandler in slaves and there may be tuning done in
> >  master to optimize indexing etc etc. The index data is not dependent
> >  on the config itself.
>
> +1 for not synchronizing the solrconfig.xml itself.
>
> But perhaps we could have a solrconfig.slave.xml which could be
> synchronized with slaves' solrconfig.xml if present?
>
> --
> Guillaume
>
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Noble Paul നോബിള്‍  नोब्ळ्
The idea is to ensure that the replicated index should be in sync with
the rest of the system. As I see it schema.xml is the only
configuration that the index depends on.

If you look at replication mechanisms of Databases, they replicate the
schema (DDL)  also, without which the replicated data will be
inconsistent.But database replications will not replicate the server
configurations . For starters we can assume that schema replication is
 good enough.
--Noble

On Fri, Apr 25, 2008 at 10:17 PM, Bill Au <[hidden email]> wrote:

> Synchronizing solrconfig.xml is definitely not a good idea.  Typically the
>  master has a post commit/optimize hook to execute
>  snapshooter.
>
>  Solr's replication is meant for replicating the Lucene index.  One can argue
>  that the schema is part of the Solr index
>  so it should be included in the replication.  But I don't think it would be
>  use to do software installation.  I don't think
>  we should use it to distribute configuration files, just like we shouldn't
>  use it to distribute updated to the Solr binaries.
>
>  Bill
>
>  On Fri, Apr 25, 2008 at 4:48 AM, Guillaume Smet <[hidden email]>
>  wrote:
>
>
>
>  > On Fri, Apr 25, 2008 at 6:05 AM, Noble Paul നോബിള്‍ नोब्ळ्
>  > <[hidden email]> wrote:
>  > > Synchronizing solrconfig is not a very desired behavior. Typically the
>  > >  solrconfigs of master and slaves tend to differ. For instance we may
>  > >  disable the UpdateHandler in slaves and there may be tuning done in
>  > >  master to optimize indexing etc etc. The index data is not dependent
>  > >  on the config itself.
>  >
>  > +1 for not synchronizing the solrconfig.xml itself.
>  >
>  > But perhaps we could have a solrconfig.slave.xml which could be
>  > synchronized with slaves' solrconfig.xml if present?
>  >
>  > --
>  > Guillaume
>  >
>
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

sunnyShiny06
In reply to this post by Noble Paul നോബിള്‍ नोब्ळ्
Hi,

What do you mean by clean the data directory on the slave servers?
Do you have remove everything from it and then start a new rsyncd ???
and turn on snappuller cronjob ??
thanks a lot,

Noble Paul നോബിള്‍ नोब्ळ् wrote
The current Solr replication just copy the data directory . So if the
schema changes and I do a re-index it will blissfully copy the index
and the slaves will fail because of incompatible schema.

So the steps we follow are
 * Stop rsync on slaves
 * Update the master with new schema
 * re-index data
 * forEach slave
 ** Kill the slave
 ** clean the data directory
 ** install the new schema
 ** restart
 ** do a manual snappull

The amount of work the admin needs to do is quite significant
(depending on the no:of slaves). These are manual steps and very error
prone

The solution :
Make the replication mechanism handle the schema replication also. So
all I need to do is to just change the master and the slaves synch
automatically

What is a good way to implement this?

We have an idea along the following lines

This should involve changes to the snapshooter and snappuller scripts
and the snapinstaller components

Everytime the snapshooter takes a snapshot it must keep the timestamps
of schema.xml and elevate.xml (all the files which might affect the
runtime behavior in slaves)
For subsequent snapshots if the timestamps of any of them is changed
it must copy the all of them also for replication.

The snappuller copies the new directory as usual

The snapinstaller checks if these config files are present ,

if yes,
 * It can create a temporary core
 * install the changed index and configuration
 * load it completely and swap it out with the original core

--Noble
Reply | Threaded
Open this post in threaded view
|

Re: replication should include the schema also

Noble Paul നോബിള്‍  नोब्ळ्
are u using a trunk version? did u try the new replication feature
http://wiki.apache.org/solr/SolrReplication it supports solrconfig
replication automatically

On Fri, Feb 13, 2009 at 6:56 AM, sunnyfr <[hidden email]> wrote:

>
> Hi,
>
> What do you mean by clean the data directory on the slave servers?
> Do you have remove everything from it and then start a new rsyncd ???
> and turn on snappuller cronjob ??
> thanks a lot,
>
>
> Noble Paul നോബിള്‍  नोब्ळ् wrote:
>>
>> The current Solr replication just copy the data directory . So if the
>> schema changes and I do a re-index it will blissfully copy the index
>> and the slaves will fail because of incompatible schema.
>>
>> So the steps we follow are
>>  * Stop rsync on slaves
>>  * Update the master with new schema
>>  * re-index data
>>  * forEach slave
>>  ** Kill the slave
>>  ** clean the data directory
>>  ** install the new schema
>>  ** restart
>>  ** do a manual snappull
>>
>> The amount of work the admin needs to do is quite significant
>> (depending on the no:of slaves). These are manual steps and very error
>> prone
>>
>> The solution :
>> Make the replication mechanism handle the schema replication also. So
>> all I need to do is to just change the master and the slaves synch
>> automatically
>>
>> What is a good way to implement this?
>>
>> We have an idea along the following lines
>>
>> This should involve changes to the snapshooter and snappuller scripts
>> and the snapinstaller components
>>
>> Everytime the snapshooter takes a snapshot it must keep the timestamps
>> of schema.xml and elevate.xml (all the files which might affect the
>> runtime behavior in slaves)
>> For subsequent snapshots if the timestamps of any of them is changed
>> it must copy the all of them also for replication.
>>
>> The snappuller copies the new directory as usual
>>
>> The snapinstaller checks if these config files are present ,
>>
>> if yes,
>>  * It can create a temporary core
>>  * install the changed index and configuration
>>  * load it completely and swap it out with the original core
>>
>> --Noble
>>
>>
>
> --
> View this message in context: http://www.nabble.com/replication-should-include-the-schema-also-tp16851477p21995813.html
> Sent from the Solr - Dev mailing list archive at Nabble.com.
>
>



--
--Noble Paul