Solr on HDFS

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr on HDFS

Joe Obernberger
Been using Solr on HDFS for a while now, and I'm seeing an issue with
redundancy/reliability.  If a server goes down, when it comes back up,
it will never recover because of the lock files in HDFS. That solr node
needs to be brought down manually, the lock files deleted, and then
brought back up.  At that point, it appears to copy all the data for its
replicas.  If the index is large, and new data is being indexed, in some
cases it will never recover. The replication retries over and over.

How can we make a reliable Solr Cloud cluster when using HDFS that can
handle servers coming and going?

Thank you!

-Joe

Reply | Threaded
Open this post in threaded view
|

Re: Solr on HDFS

Angie Rabelero
I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collection, while you work your other isues.

On Aug 1, 2019, at 16:40, Joe Obernberger <[hidden email]> wrote:

Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability.  If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up.  At that point, it appears to copy all the data for its replicas.  If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over.

How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going?

Thank you!

-Joe


Reply | Threaded
Open this post in threaded view
|

Re: Solr on HDFS

Joe Obernberger
Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
use Cloudera to manager the solr cluster.  If it is a
configuration/architecture issue, what can I do to fix it?  I'd like a
system where servers can come and go, but the indexes stay available and
recover automatically.  Is that possible with HDFS?
While adding an alias to other collections would be an option, if that
collection is the only collection, or one that is currently needed, in a
live system, we can't bring it down, re-create it, and re-index when
that process may take weeks to do.

Any ideas?

-Joe

On 8/1/2019 6:15 PM, Angie Rabelero wrote:

> I don’t think you’re using claudera or ambari, but ambari has an option to delete the locks. This seems more a configuration/architecture isssue than a realibility issue. You may want to spin up an alias while you bring down, clear locks and directories, recreate and index the affected collection, while you work your other isues.
>
> On Aug 1, 2019, at 16:40, Joe Obernberger <[hidden email]> wrote:
>
> Been using Solr on HDFS for a while now, and I'm seeing an issue with redundancy/reliability.  If a server goes down, when it comes back up, it will never recover because of the lock files in HDFS. That solr node needs to be brought down manually, the lock files deleted, and then brought back up.  At that point, it appears to copy all the data for its replicas.  If the index is large, and new data is being indexed, in some cases it will never recover. The replication retries over and over.
>
> How can we make a reliable Solr Cloud cluster when using HDFS that can handle servers coming and going?
>
> Thank you!
>
> -Joe
>
>
>
> ---
> This email has been checked for viruses by AVG.
> https://www.avg.com
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr on HDFS

lstusr 5u93n4
Hi Joe,

We fought with Solr on HDFS for quite some time, and faced similar issues
as you're seeing. (See this thread, for example:"
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rQk4Hg@...%3e
 )

The Solr lock files on HDFS get deleted if the Solr server gets shut down
gracefully, but we couldn't always guarantee that in our environment so we
ended up writing a custom startup script to search for lock files on HDFS
and delete them before solr startup.

However, the issue that you mention of the Solr server rebuilding its whole
index from replicas on startup was enough of a show-stopper for us that we
switched away from HDFS to local disk. It literally made the difference
between 24+ hours of recovery time after an unexpected outage to less than
a minute...

If you do end up finding a solution to this issue, please post it to this
mailing list, because there are others out there (like us!) who would most
definitely make use it.

Thanks

Kyle

On Fri, 2 Aug 2019 at 08:58, Joe Obernberger <[hidden email]>
wrote:

> Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
> use Cloudera to manager the solr cluster.  If it is a
> configuration/architecture issue, what can I do to fix it?  I'd like a
> system where servers can come and go, but the indexes stay available and
> recover automatically.  Is that possible with HDFS?
> While adding an alias to other collections would be an option, if that
> collection is the only collection, or one that is currently needed, in a
> live system, we can't bring it down, re-create it, and re-index when
> that process may take weeks to do.
>
> Any ideas?
>
> -Joe
>
> On 8/1/2019 6:15 PM, Angie Rabelero wrote:
> > I don’t think you’re using claudera or ambari, but ambari has an option
> to delete the locks. This seems more a configuration/architecture isssue
> than a realibility issue. You may want to spin up an alias while you bring
> down, clear locks and directories, recreate and index the affected
> collection, while you work your other isues.
> >
> > On Aug 1, 2019, at 16:40, Joe Obernberger <[hidden email]>
> wrote:
> >
> > Been using Solr on HDFS for a while now, and I'm seeing an issue with
> redundancy/reliability.  If a server goes down, when it comes back up, it
> will never recover because of the lock files in HDFS. That solr node needs
> to be brought down manually, the lock files deleted, and then brought back
> up.  At that point, it appears to copy all the data for its replicas.  If
> the index is large, and new data is being indexed, in some cases it will
> never recover. The replication retries over and over.
> >
> > How can we make a reliable Solr Cloud cluster when using HDFS that can
> handle servers coming and going?
> >
> > Thank you!
> >
> > -Joe
> >
> >
> >
> > ---
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Solr on HDFS

Joe Obernberger
Hi Kyle - Thank you.

Our current index is split across 3 solr collections; our largest
collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across
100 shards.  There are 40 machines hosting this cluster. We've found
that when dealing with large collections having no replicas (but lots of
shards) ends up being more reliable since there is a much smaller
recovery time.  We keep another 30 day index (1.4TBytes) that does have
replicas (40 shards, 3 replicas each), and if a node goes down, we
manually delete lock files and then bring it back up and yes - lots of
network IO, but it usually recovers OK.

Having a large collection like this with no replicas seems like a recipe
for disaster.  So, we've been experimenting with the latest version
(8.2) and our index process to split up the data into many solr
collections that do have replicas, and then build the list of
collections to search at query time.  Our searches are date based, so we
can define what collections we want to query at query time. As a test,
we ran just two machines, HDFS, and 500 collections. One server ran out
of memory and crashed.  We had over 1,600 lock files to delete.

If you think about it, having a shard with 3 replicas on top of a file
system that does 3x replication seems a little excessive! I'd love to
see Solr take more advantage of a shared FS.  Perhaps an idea is to use
HDFS but with an NFS gateway.  Seems like that may be slow. 
Architecturally, I love only having one large file system to manage
instead of lots of individual file systems across many machines.  HDFS
makes this easy.

-Joe

On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote:

> Hi Joe,
>
> We fought with Solr on HDFS for quite some time, and faced similar issues
> as you're seeing. (See this thread, for example:"
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rQk4Hg@...%3e
>   )
>
> The Solr lock files on HDFS get deleted if the Solr server gets shut down
> gracefully, but we couldn't always guarantee that in our environment so we
> ended up writing a custom startup script to search for lock files on HDFS
> and delete them before solr startup.
>
> However, the issue that you mention of the Solr server rebuilding its whole
> index from replicas on startup was enough of a show-stopper for us that we
> switched away from HDFS to local disk. It literally made the difference
> between 24+ hours of recovery time after an unexpected outage to less than
> a minute...
>
> If you do end up finding a solution to this issue, please post it to this
> mailing list, because there are others out there (like us!) who would most
> definitely make use it.
>
> Thanks
>
> Kyle
>
> On Fri, 2 Aug 2019 at 08:58, Joe Obernberger <[hidden email]>
> wrote:
>
>> Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
>> use Cloudera to manager the solr cluster.  If it is a
>> configuration/architecture issue, what can I do to fix it?  I'd like a
>> system where servers can come and go, but the indexes stay available and
>> recover automatically.  Is that possible with HDFS?
>> While adding an alias to other collections would be an option, if that
>> collection is the only collection, or one that is currently needed, in a
>> live system, we can't bring it down, re-create it, and re-index when
>> that process may take weeks to do.
>>
>> Any ideas?
>>
>> -Joe
>>
>> On 8/1/2019 6:15 PM, Angie Rabelero wrote:
>>> I don’t think you’re using claudera or ambari, but ambari has an option
>> to delete the locks. This seems more a configuration/architecture isssue
>> than a realibility issue. You may want to spin up an alias while you bring
>> down, clear locks and directories, recreate and index the affected
>> collection, while you work your other isues.
>>> On Aug 1, 2019, at 16:40, Joe Obernberger <[hidden email]>
>> wrote:
>>> Been using Solr on HDFS for a while now, and I'm seeing an issue with
>> redundancy/reliability.  If a server goes down, when it comes back up, it
>> will never recover because of the lock files in HDFS. That solr node needs
>> to be brought down manually, the lock files deleted, and then brought back
>> up.  At that point, it appears to copy all the data for its replicas.  If
>> the index is large, and new data is being indexed, in some cases it will
>> never recover. The replication retries over and over.
>>> How can we make a reliable Solr Cloud cluster when using HDFS that can
>> handle servers coming and going?
>>> Thank you!
>>>
>>> -Joe
>>>
>>>
>>>
>>> ---
>>> This email has been checked for viruses by AVG.
>>> https://www.avg.com
>>>
Reply | Threaded
Open this post in threaded view
|

Re: Solr on HDFS

Kevin Risden-3
>
> If you think about it, having a shard with 3 replicas on top of a file

system that does 3x replication seems a little excessive!


https://issues.apache.org/jira/browse/SOLR-6305 should help here. I can
take a look at merging the patch since looks like it has been helpful to
others.


Kevin Risden


On Fri, Aug 2, 2019 at 10:09 AM Joe Obernberger <
[hidden email]> wrote:

> Hi Kyle - Thank you.
>
> Our current index is split across 3 solr collections; our largest
> collection is 26.8TBytes (80.5TBytes when 3x replicated in HDFS) across
> 100 shards.  There are 40 machines hosting this cluster. We've found
> that when dealing with large collections having no replicas (but lots of
> shards) ends up being more reliable since there is a much smaller
> recovery time.  We keep another 30 day index (1.4TBytes) that does have
> replicas (40 shards, 3 replicas each), and if a node goes down, we
> manually delete lock files and then bring it back up and yes - lots of
> network IO, but it usually recovers OK.
>
> Having a large collection like this with no replicas seems like a recipe
> for disaster.  So, we've been experimenting with the latest version
> (8.2) and our index process to split up the data into many solr
> collections that do have replicas, and then build the list of
> collections to search at query time.  Our searches are date based, so we
> can define what collections we want to query at query time. As a test,
> we ran just two machines, HDFS, and 500 collections. One server ran out
> of memory and crashed.  We had over 1,600 lock files to delete.
>
> If you think about it, having a shard with 3 replicas on top of a file
> system that does 3x replication seems a little excessive! I'd love to
> see Solr take more advantage of a shared FS.  Perhaps an idea is to use
> HDFS but with an NFS gateway.  Seems like that may be slow.
> Architecturally, I love only having one large file system to manage
> instead of lots of individual file systems across many machines.  HDFS
> makes this easy.
>
> -Joe
>
> On 8/2/2019 9:10 AM, lstusr 5u93n4 wrote:
> > Hi Joe,
> >
> > We fought with Solr on HDFS for quite some time, and faced similar issues
> > as you're seeing. (See this thread, for example:"
> >
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201812.mbox/%3cCABd9LjTeacXpy3FFjFBkzMq6vhgu7Ptyh96+w-KC2p=-rQk4Hg@...%3e
> >   )
> >
> > The Solr lock files on HDFS get deleted if the Solr server gets shut down
> > gracefully, but we couldn't always guarantee that in our environment so
> we
> > ended up writing a custom startup script to search for lock files on HDFS
> > and delete them before solr startup.
> >
> > However, the issue that you mention of the Solr server rebuilding its
> whole
> > index from replicas on startup was enough of a show-stopper for us that
> we
> > switched away from HDFS to local disk. It literally made the difference
> > between 24+ hours of recovery time after an unexpected outage to less
> than
> > a minute...
> >
> > If you do end up finding a solution to this issue, please post it to this
> > mailing list, because there are others out there (like us!) who would
> most
> > definitely make use it.
> >
> > Thanks
> >
> > Kyle
> >
> > On Fri, 2 Aug 2019 at 08:58, Joe Obernberger <
> [hidden email]>
> > wrote:
> >
> >> Thank you.  No, while the cluster is using Cloudera for HDFS, we do not
> >> use Cloudera to manager the solr cluster.  If it is a
> >> configuration/architecture issue, what can I do to fix it?  I'd like a
> >> system where servers can come and go, but the indexes stay available and
> >> recover automatically.  Is that possible with HDFS?
> >> While adding an alias to other collections would be an option, if that
> >> collection is the only collection, or one that is currently needed, in a
> >> live system, we can't bring it down, re-create it, and re-index when
> >> that process may take weeks to do.
> >>
> >> Any ideas?
> >>
> >> -Joe
> >>
> >> On 8/1/2019 6:15 PM, Angie Rabelero wrote:
> >>> I don’t think you’re using claudera or ambari, but ambari has an option
> >> to delete the locks. This seems more a configuration/architecture isssue
> >> than a realibility issue. You may want to spin up an alias while you
> bring
> >> down, clear locks and directories, recreate and index the affected
> >> collection, while you work your other isues.
> >>> On Aug 1, 2019, at 16:40, Joe Obernberger <
> [hidden email]>
> >> wrote:
> >>> Been using Solr on HDFS for a while now, and I'm seeing an issue with
> >> redundancy/reliability.  If a server goes down, when it comes back up,
> it
> >> will never recover because of the lock files in HDFS. That solr node
> needs
> >> to be brought down manually, the lock files deleted, and then brought
> back
> >> up.  At that point, it appears to copy all the data for its replicas.
> If
> >> the index is large, and new data is being indexed, in some cases it will
> >> never recover. The replication retries over and over.
> >>> How can we make a reliable Solr Cloud cluster when using HDFS that can
> >> handle servers coming and going?
> >>> Thank you!
> >>>
> >>> -Joe
> >>>
> >>>
> >>>
> >>> ---
> >>> This email has been checked for viruses by AVG.
> >>> https://www.avg.com
> >>>
>