solr machine freeze up during first replication after optimization

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

solr machine freeze up during first replication after optimization

Kyle Lau-2
Hi all,

We recently started running into this solr slave server freeze up problem.
After looking into the logs and the timing of such occurrences, it seems
that the problem always follows the first replication after an
optimization.  Once the server freezes up, we are unable to ssh into it, but
ping still returns fine.  The only way to recover is by rebooting the
machine.

In our replication setup, the masters are optimized nightly because we have
a fairly large index (~60GB per master) and are adding millions of documents
everyday.  After the optimization, a snapshot happens automatically.  When
replication kicks in, the corresonding slave server will retrieve the
snapshot using rsync.

Here is the snappuller.log capturing one of the failed pull and one
successful pull before and after it:

2009/05/21 22:55:01 started by biz360
2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
2009/05/21 22:55:11 ended (elapsed time: 10 sec)

##### optimization completes sometime during this gap, and a new snapshot is
created

2009/05/21 23:55:01 started by biz360
2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922

##### slave freezes up, and machine has to be rebooted

2009/05/22 01:55:02 started by biz360
2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
2009/05/22 02:56:12 ended (elapsed time: 3670 sec)


A more detailed debug log shows snappuller simply stopped at some point:

started by biz360
command: /mnt/solr/bin/snappuller ...
pulling snapshot snapshot.20090521233922
receiving file list ... done
deleting segments_16a
deleting _cwu.tis
deleting _cwu.tii
deleting _cwu.prx
deleting _cwu.nrm
deleting _cwu.frq
deleting _cwu.fnm
deleting _cwt.tis
deleting _cwt.tii
deleting _cwt.prx
deleting _cwt.nrm
deleting _cwt.frq
deleting _cwt.fnm
deleting _cws.tis
deleting _cws.tii
deleting _cws.prx
deleting _cws.nrm
deleting _cws.frq
deleting _cws.fnm
deleting _cwr_1.del
deleting _cwr.tis
deleting _cwr.tii
deleting _cwr.prx
deleting _cwr.nrm
deleting _cwr.frq
deleting _cwr.fnm
deleting _cwq.tis
deleting _cwq.tii
deleting _cwq.prx
deleting _cwq.nrm
deleting _cwq.frq
deleting _cwq.fnm
deleting _cwq.fdx
deleting _cwq.fdt
deleting _cwp.tis
deleting _cwp.tii
deleting _cwp.prx
deleting _cwp.nrm
deleting _cwp.frq
deleting _cwq.fnm
deleting _cwq.fdx
deleting _cwq.fdt
deleting _cwp.tis
deleting _cwp.tii
deleting _cwp.prx
deleting _cwp.nrm
deleting _cwp.frq
deleting _cwp.fnm
deleting _cwp.fdx
deleting _cwp.fdt
deleting _cwo_1.del
deleting _cwo.tis
deleting _cwo.tii
deleting _cwo.prx
deleting _cwo.nrm
deleting _cwo.frq
deleting _cwo.fnm
deleting _cwo.fdx
deleting _cwo.fdt
deleting _cwe_1.del
deleting _cwe.tis
deleting _cwe.tii
deleting _cwe.prx
deleting _cwe.nrm
deleting _cwe.frq
deleting _cwe.fnm
deleting _cwe.fdx
deleting _cwe.fdt
deleting _cw2_3.del
deleting _cw2.tis
deleting _cw2.tii
deleting _cw2.prx
deleting _cw2.nrm
deleting _cw2.frq
deleting _cw2.fnm
deleting _cw2.fdx
deleting _cw2.fdt
deleting _cvs_4.del
deleting _cvs.tis
deleting _cvs.tii
deleting _cvs.prx
deleting _cvs.nrm
deleting _cvs.frq
deleting _cvs.fnm
deleting _cvs.fdx
deleting _cvs.fdt
deleting _csp_h.del
deleting _csp.tis
deleting _csp.tii
deleting _csp.prx
deleting _csp.nrm
deleting _csp.frq
deleting _csp.fnm
deleting _csp.fdx
deleting _csp.fdt
deleting _cpn_q.del
deleting _cpn.tis
deleting _cpn.tii
deleting _cpn.prx
deleting _cpn.nrm
deleting _cpn.frq
deleting _cpn.fnm
deleting _cpn.fdx
deleting _cpn.fdt
deleting _cmk_x.del
deleting _cmk.tis
deleting _cmk.tii
deleting _cmk.prx
deleting _cmk.nrm
deleting _cmk.frq
deleting _cmk.fnm
deleting _cmk.fdx
deleting _cmk.fdt
deleting _cjg_14.del
deleting _cjg.tis
deleting _cjg.tii
deleting _cjg.prx
deleting _cjg.nrm
deleting _cjg.frq
deleting _cjg.fnm
deleting _cjg.fdx
deleting _cjg.fdt
deleting _cge_19.del
deleting _cge.tis
deleting _cge.tii
deleting _cge.prx
deleting _cge.nrm
deleting _cge.frq
deleting _cge.fnm
deleting _cge.fdx
deleting _cge.fdt
deleting _cd9_1m.del
deleting _cd9.tis
deleting _cd9.tii
deleting _cd9.prx
deleting _cd9.nrm
deleting _cd9.frq
deleting _cd9.fnm
deleting _cd9.fdx
deleting _cd9.fdt
./
_cww.fdt

We have random Solr slaves failing in the exact same manner almost daily.
Any help is appreciated!
Reply | Threaded
Open this post in threaded view
|

Re: solr machine freeze up during first replication after optimization

Otis Gospodnetic-2

Hm, are you sure this is not a network/switch/disk/something like that problem?
Also, precisely because you have such a large index I'd avoid optimizing the index and then replicating it.  My wild guess is that simply rsyncing this much data over the network kills your machines.  Have you tried manually doing the rsync and watching the machine/switches/NICs/disks to see what's going on?  That's what I'd do.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Kyle Lau <[hidden email]>
> To: [hidden email]
> Sent: Friday, May 22, 2009 7:54:53 PM
> Subject: solr machine freeze up during first replication after optimization
>
> Hi all,
>
> We recently started running into this solr slave server freeze up problem.
> After looking into the logs and the timing of such occurrences, it seems
> that the problem always follows the first replication after an
> optimization.  Once the server freezes up, we are unable to ssh into it, but
> ping still returns fine.  The only way to recover is by rebooting the
> machine.
>
> In our replication setup, the masters are optimized nightly because we have
> a fairly large index (~60GB per master) and are adding millions of documents
> everyday.  After the optimization, a snapshot happens automatically.  When
> replication kicks in, the corresonding slave server will retrieve the
> snapshot using rsync.
>
> Here is the snappuller.log capturing one of the failed pull and one
> successful pull before and after it:
>
> 2009/05/21 22:55:01 started by biz360
> 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
> 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
> 2009/05/21 22:55:11 ended (elapsed time: 10 sec)
>
> ##### optimization completes sometime during this gap, and a new snapshot is
> created
>
> 2009/05/21 23:55:01 started by biz360
> 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
> 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922
>
> ##### slave freezes up, and machine has to be rebooted
>
> 2009/05/22 01:55:02 started by biz360
> 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
> 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
> 2009/05/22 02:56:12 ended (elapsed time: 3670 sec)
>
>
> A more detailed debug log shows snappuller simply stopped at some point:
>
> started by biz360
> command: /mnt/solr/bin/snappuller ...
> pulling snapshot snapshot.20090521233922
> receiving file list ... done
> deleting segments_16a
> deleting _cwu.tis
> deleting _cwu.tii
> deleting _cwu.prx
> deleting _cwu.nrm
> deleting _cwu.frq
> deleting _cwu.fnm
> deleting _cwt.tis
> deleting _cwt.tii
> deleting _cwt.prx
> deleting _cwt.nrm
> deleting _cwt.frq
> deleting _cwt.fnm
> deleting _cws.tis
> deleting _cws.tii
> deleting _cws.prx
> deleting _cws.nrm
> deleting _cws.frq
> deleting _cws.fnm
> deleting _cwr_1.del
> deleting _cwr.tis
> deleting _cwr.tii
> deleting _cwr.prx
> deleting _cwr.nrm
> deleting _cwr.frq
> deleting _cwr.fnm
> deleting _cwq.tis
> deleting _cwq.tii
> deleting _cwq.prx
> deleting _cwq.nrm
> deleting _cwq.frq
> deleting _cwq.fnm
> deleting _cwq.fdx
> deleting _cwq.fdt
> deleting _cwp.tis
> deleting _cwp.tii
> deleting _cwp.prx
> deleting _cwp.nrm
> deleting _cwp.frq
> deleting _cwq.fnm
> deleting _cwq.fdx
> deleting _cwq.fdt
> deleting _cwp.tis
> deleting _cwp.tii
> deleting _cwp.prx
> deleting _cwp.nrm
> deleting _cwp.frq
> deleting _cwp.fnm
> deleting _cwp.fdx
> deleting _cwp.fdt
> deleting _cwo_1.del
> deleting _cwo.tis
> deleting _cwo.tii
> deleting _cwo.prx
> deleting _cwo.nrm
> deleting _cwo.frq
> deleting _cwo.fnm
> deleting _cwo.fdx
> deleting _cwo.fdt
> deleting _cwe_1.del
> deleting _cwe.tis
> deleting _cwe.tii
> deleting _cwe.prx
> deleting _cwe.nrm
> deleting _cwe.frq
> deleting _cwe.fnm
> deleting _cwe.fdx
> deleting _cwe.fdt
> deleting _cw2_3.del
> deleting _cw2.tis
> deleting _cw2.tii
> deleting _cw2.prx
> deleting _cw2.nrm
> deleting _cw2.frq
> deleting _cw2.fnm
> deleting _cw2.fdx
> deleting _cw2.fdt
> deleting _cvs_4.del
> deleting _cvs.tis
> deleting _cvs.tii
> deleting _cvs.prx
> deleting _cvs.nrm
> deleting _cvs.frq
> deleting _cvs.fnm
> deleting _cvs.fdx
> deleting _cvs.fdt
> deleting _csp_h.del
> deleting _csp.tis
> deleting _csp.tii
> deleting _csp.prx
> deleting _csp.nrm
> deleting _csp.frq
> deleting _csp.fnm
> deleting _csp.fdx
> deleting _csp.fdt
> deleting _cpn_q.del
> deleting _cpn.tis
> deleting _cpn.tii
> deleting _cpn.prx
> deleting _cpn.nrm
> deleting _cpn.frq
> deleting _cpn.fnm
> deleting _cpn.fdx
> deleting _cpn.fdt
> deleting _cmk_x.del
> deleting _cmk.tis
> deleting _cmk.tii
> deleting _cmk.prx
> deleting _cmk.nrm
> deleting _cmk.frq
> deleting _cmk.fnm
> deleting _cmk.fdx
> deleting _cmk.fdt
> deleting _cjg_14.del
> deleting _cjg.tis
> deleting _cjg.tii
> deleting _cjg.prx
> deleting _cjg.nrm
> deleting _cjg.frq
> deleting _cjg.fnm
> deleting _cjg.fdx
> deleting _cjg.fdt
> deleting _cge_19.del
> deleting _cge.tis
> deleting _cge.tii
> deleting _cge.prx
> deleting _cge.nrm
> deleting _cge.frq
> deleting _cge.fnm
> deleting _cge.fdx
> deleting _cge.fdt
> deleting _cd9_1m.del
> deleting _cd9.tis
> deleting _cd9.tii
> deleting _cd9.prx
> deleting _cd9.nrm
> deleting _cd9.frq
> deleting _cd9.fnm
> deleting _cd9.fdx
> deleting _cd9.fdt
> ./
> _cww.fdt
>
> We have random Solr slaves failing in the exact same manner almost daily.
> Any help is appreciated!

Reply | Threaded
Open this post in threaded view
|

Re: solr machine freeze up during first replication after optimization

Kyle Lau-2
Thanks for the suggestion, Otis.  At this point, we are not sure what the
real cause is .  We have more than one master-slave groups.  Every day the
first replication after the optimization causes a random slave machine to
freeze; the very same slave succeeded previous replications and would
succeed future replications (after it's fixed by rebooting).  As far as the
group is concerned, all other slaves survive the same replication task. Does
that sound like a hardware related issue?

You brought up a good point that maybe we should avoid replicating optimized
index as that most likely causes the entire index to be rsync'ed over.  I
want to give that a shot after I iron out some of the technical details.

Thanks,
Kyle




On Fri, May 22, 2009 at 7:19 PM, Otis Gospodnetic <
[hidden email]> wrote:

>
> Hm, are you sure this is not a network/switch/disk/something like that
> problem?
> Also, precisely because you have such a large index I'd avoid optimizing
> the index and then replicating it.  My wild guess is that simply rsyncing
> this much data over the network kills your machines.  Have you tried
> manually doing the rsync and watching the machine/switches/NICs/disks to see
> what's going on?  That's what I'd do.
>
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
> > From: Kyle Lau <[hidden email]>
> > To: [hidden email]
> > Sent: Friday, May 22, 2009 7:54:53 PM
> > Subject: solr machine freeze up during first replication after
> optimization
> >
> > Hi all,
> >
> > We recently started running into this solr slave server freeze up
> problem.
> > After looking into the logs and the timing of such occurrences, it seems
> > that the problem always follows the first replication after an
> > optimization.  Once the server freezes up, we are unable to ssh into it,
> but
> > ping still returns fine.  The only way to recover is by rebooting the
> > machine.
> >
> > In our replication setup, the masters are optimized nightly because we
> have
> > a fairly large index (~60GB per master) and are adding millions of
> documents
> > everyday.  After the optimization, a snapshot happens automatically.
>  When
> > replication kicks in, the corresonding slave server will retrieve the
> > snapshot using rsync.
> >
> > Here is the snappuller.log capturing one of the failed pull and one
> > successful pull before and after it:
> >
> > 2009/05/21 22:55:01 started by biz360
> > 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
> > 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
> > 2009/05/21 22:55:11 ended (elapsed time: 10 sec)
> >
> > ##### optimization completes sometime during this gap, and a new snapshot
> is
> > created
> >
> > 2009/05/21 23:55:01 started by biz360
> > 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
> > 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922
> >
> > ##### slave freezes up, and machine has to be rebooted
> >
> > 2009/05/22 01:55:02 started by biz360
> > 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
> > 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
> > 2009/05/22 02:56:12 ended (elapsed time: 3670 sec)
> >
> >
> > A more detailed debug log shows snappuller simply stopped at some point:
> >
> > started by biz360
> > command: /mnt/solr/bin/snappuller ...
> > pulling snapshot snapshot.20090521233922
> > receiving file list ... done
> > deleting segments_16a
> > deleting _cwu.tis
> > deleting _cwu.tii
> > deleting _cwu.prx
> > deleting _cwu.nrm
> > deleting _cwu.frq
> > deleting _cwu.fnm
> > deleting _cwt.tis
> > deleting _cwt.tii
> > deleting _cwt.prx
> > deleting _cwt.nrm
> > deleting _cwt.frq
> > deleting _cwt.fnm
> > deleting _cws.tis
> > deleting _cws.tii
> > deleting _cws.prx
> > deleting _cws.nrm
> > deleting _cws.frq
> > deleting _cws.fnm
> > deleting _cwr_1.del
> > deleting _cwr.tis
> > deleting _cwr.tii
> > deleting _cwr.prx
> > deleting _cwr.nrm
> > deleting _cwr.frq
> > deleting _cwr.fnm
> > deleting _cwq.tis
> > deleting _cwq.tii
> > deleting _cwq.prx
> > deleting _cwq.nrm
> > deleting _cwq.frq
> > deleting _cwq.fnm
> > deleting _cwq.fdx
> > deleting _cwq.fdt
> > deleting _cwp.tis
> > deleting _cwp.tii
> > deleting _cwp.prx
> > deleting _cwp.nrm
> > deleting _cwp.frq
> > deleting _cwq.fnm
> > deleting _cwq.fdx
> > deleting _cwq.fdt
> > deleting _cwp.tis
> > deleting _cwp.tii
> > deleting _cwp.prx
> > deleting _cwp.nrm
> > deleting _cwp.frq
> > deleting _cwp.fnm
> > deleting _cwp.fdx
> > deleting _cwp.fdt
> > deleting _cwo_1.del
> > deleting _cwo.tis
> > deleting _cwo.tii
> > deleting _cwo.prx
> > deleting _cwo.nrm
> > deleting _cwo.frq
> > deleting _cwo.fnm
> > deleting _cwo.fdx
> > deleting _cwo.fdt
> > deleting _cwe_1.del
> > deleting _cwe.tis
> > deleting _cwe.tii
> > deleting _cwe.prx
> > deleting _cwe.nrm
> > deleting _cwe.frq
> > deleting _cwe.fnm
> > deleting _cwe.fdx
> > deleting _cwe.fdt
> > deleting _cw2_3.del
> > deleting _cw2.tis
> > deleting _cw2.tii
> > deleting _cw2.prx
> > deleting _cw2.nrm
> > deleting _cw2.frq
> > deleting _cw2.fnm
> > deleting _cw2.fdx
> > deleting _cw2.fdt
> > deleting _cvs_4.del
> > deleting _cvs.tis
> > deleting _cvs.tii
> > deleting _cvs.prx
> > deleting _cvs.nrm
> > deleting _cvs.frq
> > deleting _cvs.fnm
> > deleting _cvs.fdx
> > deleting _cvs.fdt
> > deleting _csp_h.del
> > deleting _csp.tis
> > deleting _csp.tii
> > deleting _csp.prx
> > deleting _csp.nrm
> > deleting _csp.frq
> > deleting _csp.fnm
> > deleting _csp.fdx
> > deleting _csp.fdt
> > deleting _cpn_q.del
> > deleting _cpn.tis
> > deleting _cpn.tii
> > deleting _cpn.prx
> > deleting _cpn.nrm
> > deleting _cpn.frq
> > deleting _cpn.fnm
> > deleting _cpn.fdx
> > deleting _cpn.fdt
> > deleting _cmk_x.del
> > deleting _cmk.tis
> > deleting _cmk.tii
> > deleting _cmk.prx
> > deleting _cmk.nrm
> > deleting _cmk.frq
> > deleting _cmk.fnm
> > deleting _cmk.fdx
> > deleting _cmk.fdt
> > deleting _cjg_14.del
> > deleting _cjg.tis
> > deleting _cjg.tii
> > deleting _cjg.prx
> > deleting _cjg.nrm
> > deleting _cjg.frq
> > deleting _cjg.fnm
> > deleting _cjg.fdx
> > deleting _cjg.fdt
> > deleting _cge_19.del
> > deleting _cge.tis
> > deleting _cge.tii
> > deleting _cge.prx
> > deleting _cge.nrm
> > deleting _cge.frq
> > deleting _cge.fnm
> > deleting _cge.fdx
> > deleting _cge.fdt
> > deleting _cd9_1m.del
> > deleting _cd9.tis
> > deleting _cd9.tii
> > deleting _cd9.prx
> > deleting _cd9.nrm
> > deleting _cd9.frq
> > deleting _cd9.fnm
> > deleting _cd9.fdx
> > deleting _cd9.fdt
> > ./
> > _cww.fdt
> >
> > We have random Solr slaves failing in the exact same manner almost daily.
> > Any help is appreciated!
>
>
Reply | Threaded
Open this post in threaded view
|

Re: solr machine freeze up during first replication after optimization

Otis Gospodnetic-2

Not only does the whole index have to be rsynced, everything that the FS cached from the index can now be thrown away and the index has to slowly get cached anew.  Even if you warm things up with good queries.  No machine will like that.  Even if this doesn't kill the machine (it shouldn't!) it will be a performance hit.

It could be a hardware error, except you said multiple/different servers die like this, which makes the hw error unlikely.  But possible.  Say, bad RAM.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----

> From: Kyle Lau <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, May 26, 2009 2:33:11 PM
> Subject: Re: solr machine freeze up during first replication after  optimization
>
> Thanks for the suggestion, Otis.  At this point, we are not sure what the
> real cause is .  We have more than one master-slave groups.  Every day the
> first replication after the optimization causes a random slave machine to
> freeze; the very same slave succeeded previous replications and would
> succeed future replications (after it's fixed by rebooting).  As far as the
> group is concerned, all other slaves survive the same replication task. Does
> that sound like a hardware related issue?
>
> You brought up a good point that maybe we should avoid replicating optimized
> index as that most likely causes the entire index to be rsync'ed over.  I
> want to give that a shot after I iron out some of the technical details.
>
> Thanks,
> Kyle
>
>
>
>
> On Fri, May 22, 2009 at 7:19 PM, Otis Gospodnetic <
> [hidden email]> wrote:
>
> >
> > Hm, are you sure this is not a network/switch/disk/something like that
> > problem?
> > Also, precisely because you have such a large index I'd avoid optimizing
> > the index and then replicating it.  My wild guess is that simply rsyncing
> > this much data over the network kills your machines.  Have you tried
> > manually doing the rsync and watching the machine/switches/NICs/disks to see
> > what's going on?  That's what I'd do.
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > ----- Original Message ----
> > > From: Kyle Lau
> > > To: [hidden email]
> > > Sent: Friday, May 22, 2009 7:54:53 PM
> > > Subject: solr machine freeze up during first replication after
> > optimization
> > >
> > > Hi all,
> > >
> > > We recently started running into this solr slave server freeze up
> > problem.
> > > After looking into the logs and the timing of such occurrences, it seems
> > > that the problem always follows the first replication after an
> > > optimization.  Once the server freezes up, we are unable to ssh into it,
> > but
> > > ping still returns fine.  The only way to recover is by rebooting the
> > > machine.
> > >
> > > In our replication setup, the masters are optimized nightly because we
> > have
> > > a fairly large index (~60GB per master) and are adding millions of
> > documents
> > > everyday.  After the optimization, a snapshot happens automatically.
> >  When
> > > replication kicks in, the corresonding slave server will retrieve the
> > > snapshot using rsync.
> > >
> > > Here is the snappuller.log capturing one of the failed pull and one
> > > successful pull before and after it:
> > >
> > > 2009/05/21 22:55:01 started by biz360
> > > 2009/05/21 22:55:01 command: /mnt/solr/bin/snappuller ...
> > > 2009/05/21 22:55:04 pulling snapshot snapshot.20090521221402
> > > 2009/05/21 22:55:11 ended (elapsed time: 10 sec)
> > >
> > > ##### optimization completes sometime during this gap, and a new snapshot
> > is
> > > created
> > >
> > > 2009/05/21 23:55:01 started by biz360
> > > 2009/05/21 23:55:01 command: /mnt/solr/bin/snappuller ...
> > > 2009/05/21 23:55:02 pulling snapshot snapshot.20090521233922
> > >
> > > ##### slave freezes up, and machine has to be rebooted
> > >
> > > 2009/05/22 01:55:02 started by biz360
> > > 2009/05/22 01:55:02 command: /mnt/solr/bin/snappuller ...
> > > 2009/05/22 01:55:03 pulling snapshot snapshot.20090522014528
> > > 2009/05/22 02:56:12 ended (elapsed time: 3670 sec)
> > >
> > >
> > > A more detailed debug log shows snappuller simply stopped at some point:
> > >
> > > started by biz360
> > > command: /mnt/solr/bin/snappuller ...
> > > pulling snapshot snapshot.20090521233922
> > > receiving file list ... done
> > > deleting segments_16a
> > > deleting _cwu.tis
> > > deleting _cwu.tii
> > > deleting _cwu.prx
> > > deleting _cwu.nrm
> > > deleting _cwu.frq
> > > deleting _cwu.fnm
> > > deleting _cwt.tis
> > > deleting _cwt.tii
> > > deleting _cwt.prx
> > > deleting _cwt.nrm
> > > deleting _cwt.frq
> > > deleting _cwt.fnm
> > > deleting _cws.tis
> > > deleting _cws.tii
> > > deleting _cws.prx
> > > deleting _cws.nrm
> > > deleting _cws.frq
> > > deleting _cws.fnm
> > > deleting _cwr_1.del
> > > deleting _cwr.tis
> > > deleting _cwr.tii
> > > deleting _cwr.prx
> > > deleting _cwr.nrm
> > > deleting _cwr.frq
> > > deleting _cwr.fnm
> > > deleting _cwq.tis
> > > deleting _cwq.tii
> > > deleting _cwq.prx
> > > deleting _cwq.nrm
> > > deleting _cwq.frq
> > > deleting _cwq.fnm
> > > deleting _cwq.fdx
> > > deleting _cwq.fdt
> > > deleting _cwp.tis
> > > deleting _cwp.tii
> > > deleting _cwp.prx
> > > deleting _cwp.nrm
> > > deleting _cwp.frq
> > > deleting _cwq.fnm
> > > deleting _cwq.fdx
> > > deleting _cwq.fdt
> > > deleting _cwp.tis
> > > deleting _cwp.tii
> > > deleting _cwp.prx
> > > deleting _cwp.nrm
> > > deleting _cwp.frq
> > > deleting _cwp.fnm
> > > deleting _cwp.fdx
> > > deleting _cwp.fdt
> > > deleting _cwo_1.del
> > > deleting _cwo.tis
> > > deleting _cwo.tii
> > > deleting _cwo.prx
> > > deleting _cwo.nrm
> > > deleting _cwo.frq
> > > deleting _cwo.fnm
> > > deleting _cwo.fdx
> > > deleting _cwo.fdt
> > > deleting _cwe_1.del
> > > deleting _cwe.tis
> > > deleting _cwe.tii
> > > deleting _cwe.prx
> > > deleting _cwe.nrm
> > > deleting _cwe.frq
> > > deleting _cwe.fnm
> > > deleting _cwe.fdx
> > > deleting _cwe.fdt
> > > deleting _cw2_3.del
> > > deleting _cw2.tis
> > > deleting _cw2.tii
> > > deleting _cw2.prx
> > > deleting _cw2.nrm
> > > deleting _cw2.frq
> > > deleting _cw2.fnm
> > > deleting _cw2.fdx
> > > deleting _cw2.fdt
> > > deleting _cvs_4.del
> > > deleting _cvs.tis
> > > deleting _cvs.tii
> > > deleting _cvs.prx
> > > deleting _cvs.nrm
> > > deleting _cvs.frq
> > > deleting _cvs.fnm
> > > deleting _cvs.fdx
> > > deleting _cvs.fdt
> > > deleting _csp_h.del
> > > deleting _csp.tis
> > > deleting _csp.tii
> > > deleting _csp.prx
> > > deleting _csp.nrm
> > > deleting _csp.frq
> > > deleting _csp.fnm
> > > deleting _csp.fdx
> > > deleting _csp.fdt
> > > deleting _cpn_q.del
> > > deleting _cpn.tis
> > > deleting _cpn.tii
> > > deleting _cpn.prx
> > > deleting _cpn.nrm
> > > deleting _cpn.frq
> > > deleting _cpn.fnm
> > > deleting _cpn.fdx
> > > deleting _cpn.fdt
> > > deleting _cmk_x.del
> > > deleting _cmk.tis
> > > deleting _cmk.tii
> > > deleting _cmk.prx
> > > deleting _cmk.nrm
> > > deleting _cmk.frq
> > > deleting _cmk.fnm
> > > deleting _cmk.fdx
> > > deleting _cmk.fdt
> > > deleting _cjg_14.del
> > > deleting _cjg.tis
> > > deleting _cjg.tii
> > > deleting _cjg.prx
> > > deleting _cjg.nrm
> > > deleting _cjg.frq
> > > deleting _cjg.fnm
> > > deleting _cjg.fdx
> > > deleting _cjg.fdt
> > > deleting _cge_19.del
> > > deleting _cge.tis
> > > deleting _cge.tii
> > > deleting _cge.prx
> > > deleting _cge.nrm
> > > deleting _cge.frq
> > > deleting _cge.fnm
> > > deleting _cge.fdx
> > > deleting _cge.fdt
> > > deleting _cd9_1m.del
> > > deleting _cd9.tis
> > > deleting _cd9.tii
> > > deleting _cd9.prx
> > > deleting _cd9.nrm
> > > deleting _cd9.frq
> > > deleting _cd9.fnm
> > > deleting _cd9.fdx
> > > deleting _cd9.fdt
> > > ./
> > > _cww.fdt
> > >
> > > We have random Solr slaves failing in the exact same manner almost daily.
> > > Any help is appreciated!
> >
> >