Corrupted index in SolrCloud

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Corrupted index in SolrCloud

Matt Pearce
Hi,

We've just been working with a client who had a corruption issue with
their SolrCloud install. They're running Solr 5.3.1, with a collection
spread across 12 shards. Each shard has a single replica.

They were seeing "Index Corruption" errors when running certain queries.
We investigated, and narrowed it down to a single shard. Using the
Lucene CheckIndex utility, we tested both the primary and replica copies
of the data, and found the same issue with both - the first segment,
containing the majority of the documents, was reporting corruption. They
were able to restore from a backup, but it would be good to get some
idea what could have caused the problem in SolrCloud. One of the
machines ran out of disk space last week during indexing, which we guess
could have been the starting point for the corrupted data files.

Our question is: why would the corruption have spread to the replica as
well? Could a corrupted document be replicated and cause the replica
index to break as well?

Thanks,

Matt

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

Re: Corrupted index in SolrCloud

Erick Erickson
The disk corruption is, of course, a red flag and likely the root cause.

As for how it replicated let's assume a 2 replica shard (leader +
follower). If the follower ever went into full recovery it would use
old-style replication to copy down the entire index, corrupted index
and all, from the leader. The follower can go into "full recovery" for
a number of reasons, from it being shut down for a while and indexing
still happening to the leader to communications burps.

There's been a lot of work put in to making fewer full recoveries, but
much of that only came to fruition in recent Solr releases, especially
starting with Solr 7.3. (SOLR-11702)

Best,
Erick
On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <[hidden email]> wrote:

>
> Hi,
>
> We've just been working with a client who had a corruption issue with
> their SolrCloud install. They're running Solr 5.3.1, with a collection
> spread across 12 shards. Each shard has a single replica.
>
> They were seeing "Index Corruption" errors when running certain queries.
> We investigated, and narrowed it down to a single shard. Using the
> Lucene CheckIndex utility, we tested both the primary and replica copies
> of the data, and found the same issue with both - the first segment,
> containing the majority of the documents, was reporting corruption. They
> were able to restore from a backup, but it would be good to get some
> idea what could have caused the problem in SolrCloud. One of the
> machines ran out of disk space last week during indexing, which we guess
> could have been the starting point for the corrupted data files.
>
> Our question is: why would the corruption have spread to the replica as
> well? Could a corrupted document be replicated and cause the replica
> index to break as well?
>
> Thanks,
>
> Matt
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk
Reply | Threaded
Open this post in threaded view
|

Re: Corrupted index in SolrCloud

Matt Pearce

Thanks for the explanation Erick, that makes sense!

Matt

On 21/09/2018 15:50, Erick Erickson wrote:

> The disk corruption is, of course, a red flag and likely the root cause.
>
> As for how it replicated let's assume a 2 replica shard (leader +
> follower). If the follower ever went into full recovery it would use
> old-style replication to copy down the entire index, corrupted index
> and all, from the leader. The follower can go into "full recovery" for
> a number of reasons, from it being shut down for a while and indexing
> still happening to the leader to communications burps.
>
> There's been a lot of work put in to making fewer full recoveries, but
> much of that only came to fruition in recent Solr releases, especially
> starting with Solr 7.3. (SOLR-11702)
>
> Best,
> Erick
> On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <[hidden email]> wrote:
>>
>> Hi,
>>
>> We've just been working with a client who had a corruption issue with
>> their SolrCloud install. They're running Solr 5.3.1, with a collection
>> spread across 12 shards. Each shard has a single replica.
>>
>> They were seeing "Index Corruption" errors when running certain queries.
>> We investigated, and narrowed it down to a single shard. Using the
>> Lucene CheckIndex utility, we tested both the primary and replica copies
>> of the data, and found the same issue with both - the first segment,
>> containing the majority of the documents, was reporting corruption. They
>> were able to restore from a backup, but it would be good to get some
>> idea what could have caused the problem in SolrCloud. One of the
>> machines ran out of disk space last week during indexing, which we guess
>> could have been the starting point for the corrupted data files.
>>
>> Our question is: why would the corruption have spread to the replica as
>> well? Could a corrupted document be replicated and cause the replica
>> index to break as well?
>>
>> Thanks,
>>
>> Matt
>>
>> --
>> Matt Pearce
>> Flax - Open Source Enterprise Search
>> www.flax.co.uk

--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk