Recovery Issue - Solr 6.6.1 and HDFS

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
35 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
Hi All - we have a system with 45 physical boxes running solr 6.6.1
using HDFS as the index.  The current index size is about 31TBytes. 
With 3x replication that takes up 93TBytes of disk. Our main collection
is split across 100 shards with 3 replicas each.  The issue that we're
running into is when restarting the solr6 cluster.  The shards go into
recovery and start to utilize nearly all of their network interfaces. 
If we start too many of the nodes at once, the shards will go into a
recovery, fail, and retry loop and never come up.  The errors are
related to HDFS not responding fast enough and warnings from the
DFSClient.  If we stop a node when this is happening, the script will
force a stop (180 second timeout) and upon restart, we have lock files
(write.lock) inside of HDFS.

The process at this point is to start one node, find out the lock files,
wait for it to come up completely (hours), stop it, delete the
write.lock files, and restart.  Usually this second restart is faster,
but it still can take 20-60 minutes.

The smaller indexes recover much faster (less than 5 minutes). Should we
have not used so many replicas with HDFS?  Is there a better way we
should have built the solr6 cluster?

Thank you for any insight!

-Joe

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Erick Erickson
How are you stopping Solr? Nodes should not go into recovery on
startup unless Solr was killed un-gracefully (i.e. kill -9 or the
like). If you use the bin/solr script to stop Solr and see a message
about "killing XXX forcefully" then you can lengthen out the time the
script waits for shutdown (there's a sysvar you can set, look in the
script).

Actually I'll correct myself a bit. Shards _do_ go into recovery but
it should be very short in the graceful shutdown case. Basically
shards temporarily go into recovery as part of normal processing just
long enough to see there's no recovery necessary, but that should be
measured in a few seconds.

What it sounds like from this "The shards go into recovery and start
to utilize nearly all of their network" is that your nodes go into
"full recovery" where the entire index is copied down because the
replica thinks it's "too far" out of date. That indicates something
weird about the state when the Solr nodes stopped.

wild-shot-in-the-dark question. How big are your tlogs? If you don't
hard commit very often, the tlogs can replay at startup for a very
long time.

This makes no sense to me, I'm surely missing something:

The process at this point is to start one node, find out the lock
files, wait for it to come up completely (hours), stop it, delete the
write.lock files, and restart.  Usually this second restart is faster,
but it still can take 20-60 minutes.

When you start one node it may take a few minutes for leader electing
to kick in (the default is 180 seconds) but absent replication it
should just be there. Taking hours totally violates my expectations.
What does Solr _think_ it's doing? What's in the logs at that point?
And if you stop solr gracefully, there shouldn't be a problem with
write.lock.

You could also try increasing the timeouts, and the HDFS directory
factory has some parameters to tweak that are a mystery to me...

All in all, this is behavior that I find mystifying.

Best,
Erick

On Tue, Nov 21, 2017 at 5:07 AM, Joe Obernberger
<[hidden email]> wrote:

> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
> HDFS as the index.  The current index size is about 31TBytes.  With 3x
> replication that takes up 93TBytes of disk. Our main collection is split
> across 100 shards with 3 replicas each.  The issue that we're running into
> is when restarting the solr6 cluster.  The shards go into recovery and start
> to utilize nearly all of their network interfaces.  If we start too many of
> the nodes at once, the shards will go into a recovery, fail, and retry loop
> and never come up.  The errors are related to HDFS not responding fast
> enough and warnings from the DFSClient.  If we stop a node when this is
> happening, the script will force a stop (180 second timeout) and upon
> restart, we have lock files (write.lock) inside of HDFS.
>
> The process at this point is to start one node, find out the lock files,
> wait for it to come up completely (hours), stop it, delete the write.lock
> files, and restart.  Usually this second restart is faster, but it still can
> take 20-60 minutes.
>
> The smaller indexes recover much faster (less than 5 minutes). Should we
> have not used so many replicas with HDFS?  Is there a better way we should
> have built the solr6 cluster?
>
> Thank you for any insight!
>
> -Joe
>
Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
Erick - thank you very much for the reply.  I'm still working through
restarting the nodes one by one.

I'm stopping the nodes with the script, but yes - they are being killed
forcefully because they are in this recovery, failed, retry loop.  I
could increase the timeout, but they never seem to recover.

The largest tlog file that I see currently is 222MBytes. Autocommit is
set to 1800000 and autoSoftCommit is set to 120000. Errors when they are
in the long recovery are things like:

2017-11-20 21:41:29.755 ERROR
(recoveryExecutor-3-thread-4-processing-n:frodo:9100_solr
x:UNCLASS_shard37_replica1 s:shard37 c:UNCLASS r:core_node196)
[c:UNCLASS s:shard37 r:core_node196 x:UNCLASS_shard37_replica1]
o.a.s.h.IndexFetcher Error closing file: _8dmn.cfs
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/solr6.6.0/UNCLASS/core_node196/data/index.20171120195705961/_8dmn.cfs
could only be replicated to 0 nodes instead of minReplication (=1). 
There are 39 datanode(s) running and no node(s) are excluded in this
operation.
         at
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1716)
         at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3385)

Complete log is here for one of the shards that was forcefully stopped.
http://lovehorsepower.com/solr.log

As to what is in the logs when it is recovering for several hours, it's
many WARN messages from the DFSClient such as:
Abandoning
BP-1714598269-10.2.100.220-1341346046854:blk_4366207808_1103082741732
and
Excluding datanode
DatanodeInfoWithStorage[172.16.100.229:50010,&#8203;DS-5985e40d-830a-44e7-a2ea-fc60bebabc30,&#8203;DISK]

or from the IndexFetcher:

File _a96y.cfe did not match. expected checksum is 3502268220 and actual
is checksum 2563579651. expected length is 405 and actual length is 405

Unfortunately, I'm not getting errors from some of the nodes (still
working through restarting them) about zookeeper:

org.apache.solr.common.SolrException: Could not get leader props
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
     at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
     at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
     at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
     at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
     at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
     at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
     at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
     at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)

Any idea what those could be?  Those shards are not coming back up.
Sorry so many questions!

-Joe

On 11/21/2017 12:11 PM, Erick Erickson wrote:

> How are you stopping Solr? Nodes should not go into recovery on
> startup unless Solr was killed un-gracefully (i.e. kill -9 or the
> like). If you use the bin/solr script to stop Solr and see a message
> about "killing XXX forcefully" then you can lengthen out the time the
> script waits for shutdown (there's a sysvar you can set, look in the
> script).
>
> Actually I'll correct myself a bit. Shards _do_ go into recovery but
> it should be very short in the graceful shutdown case. Basically
> shards temporarily go into recovery as part of normal processing just
> long enough to see there's no recovery necessary, but that should be
> measured in a few seconds.
>
> What it sounds like from this "The shards go into recovery and start
> to utilize nearly all of their network" is that your nodes go into
> "full recovery" where the entire index is copied down because the
> replica thinks it's "too far" out of date. That indicates something
> weird about the state when the Solr nodes stopped.
>
> wild-shot-in-the-dark question. How big are your tlogs? If you don't
> hard commit very often, the tlogs can replay at startup for a very
> long time.
>
> This makes no sense to me, I'm surely missing something:
>
> The process at this point is to start one node, find out the lock
> files, wait for it to come up completely (hours), stop it, delete the
> write.lock files, and restart.  Usually this second restart is faster,
> but it still can take 20-60 minutes.
>
> When you start one node it may take a few minutes for leader electing
> to kick in (the default is 180 seconds) but absent replication it
> should just be there. Taking hours totally violates my expectations.
> What does Solr _think_ it's doing? What's in the logs at that point?
> And if you stop solr gracefully, there shouldn't be a problem with
> write.lock.
>
> You could also try increasing the timeouts, and the HDFS directory
> factory has some parameters to tweak that are a mystery to me...
>
> All in all, this is behavior that I find mystifying.
>
> Best,
> Erick
>
> On Tue, Nov 21, 2017 at 5:07 AM, Joe Obernberger
> <[hidden email]> wrote:
>> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
>> HDFS as the index.  The current index size is about 31TBytes.  With 3x
>> replication that takes up 93TBytes of disk. Our main collection is split
>> across 100 shards with 3 replicas each.  The issue that we're running into
>> is when restarting the solr6 cluster.  The shards go into recovery and start
>> to utilize nearly all of their network interfaces.  If we start too many of
>> the nodes at once, the shards will go into a recovery, fail, and retry loop
>> and never come up.  The errors are related to HDFS not responding fast
>> enough and warnings from the DFSClient.  If we stop a node when this is
>> happening, the script will force a stop (180 second timeout) and upon
>> restart, we have lock files (write.lock) inside of HDFS.
>>
>> The process at this point is to start one node, find out the lock files,
>> wait for it to come up completely (hours), stop it, delete the write.lock
>> files, and restart.  Usually this second restart is faster, but it still can
>> take 20-60 minutes.
>>
>> The smaller indexes recover much faster (less than 5 minutes). Should we
>> have not used so many replicas with HDFS?  Is there a better way we should
>> have built the solr6 cluster?
>>
>> Thank you for any insight!
>>
>> -Joe
>>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hendrik Haddorp
In reply to this post by Joe Obernberger
Hi,

the write.lock issue I see as well when Solr is not been stopped
gracefully. The write.lock files are then left in the HDFS as they do
not get removed automatically when the client disconnects like a
ephemeral node in ZooKeeper. Unfortunately Solr does also not realize
that it should be owning the lock as it is marked in the state stored in
ZooKeeper as the owner and is also not willing to retry, which is why
you need to restart the whole Solr instance after the cleanup. I added
some logic to my Solr start up script which scans the log files in HDFS
and compares that with the state in ZooKeeper and then delete all lock
files that belong to the node that I'm starting.

regards,
Hendrik

On 21.11.2017 14:07, Joe Obernberger wrote:

> Hi All - we have a system with 45 physical boxes running solr 6.6.1
> using HDFS as the index.  The current index size is about 31TBytes.
> With 3x replication that takes up 93TBytes of disk. Our main
> collection is split across 100 shards with 3 replicas each.  The issue
> that we're running into is when restarting the solr6 cluster.  The
> shards go into recovery and start to utilize nearly all of their
> network interfaces.  If we start too many of the nodes at once, the
> shards will go into a recovery, fail, and retry loop and never come
> up.  The errors are related to HDFS not responding fast enough and
> warnings from the DFSClient.  If we stop a node when this is
> happening, the script will force a stop (180 second timeout) and upon
> restart, we have lock files (write.lock) inside of HDFS.
>
> The process at this point is to start one node, find out the lock
> files, wait for it to come up completely (hours), stop it, delete the
> write.lock files, and restart.  Usually this second restart is faster,
> but it still can take 20-60 minutes.
>
> The smaller indexes recover much faster (less than 5 minutes). Should
> we have not used so many replicas with HDFS?  Is there a better way we
> should have built the solr6 cluster?
>
> Thank you for any insight!
>
> -Joe
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
A clever idea.  Normally what we do when we need to do a restart, is to
halt indexing, and then wait about 30 minutes.  If we do not wait, and
stop the cluster, the default scripts 180 second timeout is not enough
and we'll have lock files to clean up.  We use puppet to start and stop
the nodes, but at this point that is not working well since we need to
start one node at a time.  With each one taking hours, this is a lengthy
process!  I'd love to see your script!

This new error is now coming up - see screen shot.  For some reason some
of the shards have no leader assigned:

http://lovehorsepower.com/SolrClusterErrors.jpg

-Joe


On 11/21/2017 1:34 PM, Hendrik Haddorp wrote:

> Hi,
>
> the write.lock issue I see as well when Solr is not been stopped
> gracefully. The write.lock files are then left in the HDFS as they do
> not get removed automatically when the client disconnects like a
> ephemeral node in ZooKeeper. Unfortunately Solr does also not realize
> that it should be owning the lock as it is marked in the state stored
> in ZooKeeper as the owner and is also not willing to retry, which is
> why you need to restart the whole Solr instance after the cleanup. I
> added some logic to my Solr start up script which scans the log files
> in HDFS and compares that with the state in ZooKeeper and then delete
> all lock files that belong to the node that I'm starting.
>
> regards,
> Hendrik
>
> On 21.11.2017 14:07, Joe Obernberger wrote:
>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>> using HDFS as the index.  The current index size is about 31TBytes.
>> With 3x replication that takes up 93TBytes of disk. Our main
>> collection is split across 100 shards with 3 replicas each.  The
>> issue that we're running into is when restarting the solr6 cluster. 
>> The shards go into recovery and start to utilize nearly all of their
>> network interfaces.  If we start too many of the nodes at once, the
>> shards will go into a recovery, fail, and retry loop and never come
>> up.  The errors are related to HDFS not responding fast enough and
>> warnings from the DFSClient.  If we stop a node when this is
>> happening, the script will force a stop (180 second timeout) and upon
>> restart, we have lock files (write.lock) inside of HDFS.
>>
>> The process at this point is to start one node, find out the lock
>> files, wait for it to come up completely (hours), stop it, delete the
>> write.lock files, and restart.  Usually this second restart is
>> faster, but it still can take 20-60 minutes.
>>
>> The smaller indexes recover much faster (less than 5 minutes). Should
>> we have not used so many replicas with HDFS?  Is there a better way
>> we should have built the solr6 cluster?
>>
>> Thank you for any insight!
>>
>> -Joe
>>
>
>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Erick Erickson
In reply to this post by Hendrik Haddorp
Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...

I need to back up a bit. Once nodes are in this state it's not
surprising that they need to be forcefully killed. I was more thinking
about how they got in this situation in the first place. _Before_ you
get into the nasty state how are the Solr nodes shut down? Forcefully?

Your hard commit is far longer than it needs to be, resulting in much
larger tlog files etc. I usually set this at 15-60 seconds with local
disks, not quite sure whether longer intervals are helpful on HDFS.
What this means is that you can spend up to 30 minutes when you
restart solr _replaying the tlogs_! If Solr is killed, it may not have
had a chance to fsync the segments and may have to replay on startup.
If you have openSearcher set to false, the hard commit operation is
not horribly expensive, it just fsync's the current segments and opens
new ones. It won't be a total cure, but I bet reducing this interval
would help a lot.

Also, if you stop indexing there's no need to wait 30 minutes if you
issue a manual commit, something like
.../collection/update?commit=true. Just reducing the hard commit
interval will make the wait between stopping indexing and restarting
shorter all by itself if you don't want to issue the manual commit.

Best,
Erick

On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
<[hidden email]> wrote:

> Hi,
>
> the write.lock issue I see as well when Solr is not been stopped gracefully.
> The write.lock files are then left in the HDFS as they do not get removed
> automatically when the client disconnects like a ephemeral node in
> ZooKeeper. Unfortunately Solr does also not realize that it should be owning
> the lock as it is marked in the state stored in ZooKeeper as the owner and
> is also not willing to retry, which is why you need to restart the whole
> Solr instance after the cleanup. I added some logic to my Solr start up
> script which scans the log files in HDFS and compares that with the state in
> ZooKeeper and then delete all lock files that belong to the node that I'm
> starting.
>
> regards,
> Hendrik
>
>
> On 21.11.2017 14:07, Joe Obernberger wrote:
>>
>> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>> replication that takes up 93TBytes of disk. Our main collection is split
>> across 100 shards with 3 replicas each.  The issue that we're running into
>> is when restarting the solr6 cluster.  The shards go into recovery and start
>> to utilize nearly all of their network interfaces.  If we start too many of
>> the nodes at once, the shards will go into a recovery, fail, and retry loop
>> and never come up.  The errors are related to HDFS not responding fast
>> enough and warnings from the DFSClient.  If we stop a node when this is
>> happening, the script will force a stop (180 second timeout) and upon
>> restart, we have lock files (write.lock) inside of HDFS.
>>
>> The process at this point is to start one node, find out the lock files,
>> wait for it to come up completely (hours), stop it, delete the write.lock
>> files, and restart.  Usually this second restart is faster, but it still can
>> take 20-60 minutes.
>>
>> The smaller indexes recover much faster (less than 5 minutes). Should we
>> have not used so many replicas with HDFS?  Is there a better way we should
>> have built the solr6 cluster?
>>
>> Thank you for any insight!
>>
>> -Joe
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
We set the hard commit time long because we were having performance
issues with HDFS, and thought that since the block size is 128M, having
a longer hard commit made sense.  That was our hypothesis anyway.  Happy
to switch it back and see what happens.

I don't know what caused the cluster to go into recovery in the first
place.  We had a server die over the weekend, but it's just one out of
~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9
copies).  It was at this point that we noticed lots of network activity,
and most of the shards in this recovery, fail, retry loop.  That is when
we decided to shut it down resulting in zombie lock files.

I tried using the FORCELEADER call, which completed, but doesn't seem to
have any effect on the shards that have no leader.  Kinda out of ideas
for that problem.  If I can get the cluster back up, I'll try a lower
hard commit time.  Thanks again Erick!

-Joe


On 11/21/2017 2:00 PM, Erick Erickson wrote:

> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>
> I need to back up a bit. Once nodes are in this state it's not
> surprising that they need to be forcefully killed. I was more thinking
> about how they got in this situation in the first place. _Before_ you
> get into the nasty state how are the Solr nodes shut down? Forcefully?
>
> Your hard commit is far longer than it needs to be, resulting in much
> larger tlog files etc. I usually set this at 15-60 seconds with local
> disks, not quite sure whether longer intervals are helpful on HDFS.
> What this means is that you can spend up to 30 minutes when you
> restart solr _replaying the tlogs_! If Solr is killed, it may not have
> had a chance to fsync the segments and may have to replay on startup.
> If you have openSearcher set to false, the hard commit operation is
> not horribly expensive, it just fsync's the current segments and opens
> new ones. It won't be a total cure, but I bet reducing this interval
> would help a lot.
>
> Also, if you stop indexing there's no need to wait 30 minutes if you
> issue a manual commit, something like
> .../collection/update?commit=true. Just reducing the hard commit
> interval will make the wait between stopping indexing and restarting
> shorter all by itself if you don't want to issue the manual commit.
>
> Best,
> Erick
>
> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
> <[hidden email]> wrote:
>> Hi,
>>
>> the write.lock issue I see as well when Solr is not been stopped gracefully.
>> The write.lock files are then left in the HDFS as they do not get removed
>> automatically when the client disconnects like a ephemeral node in
>> ZooKeeper. Unfortunately Solr does also not realize that it should be owning
>> the lock as it is marked in the state stored in ZooKeeper as the owner and
>> is also not willing to retry, which is why you need to restart the whole
>> Solr instance after the cleanup. I added some logic to my Solr start up
>> script which scans the log files in HDFS and compares that with the state in
>> ZooKeeper and then delete all lock files that belong to the node that I'm
>> starting.
>>
>> regards,
>> Hendrik
>>
>>
>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1 using
>>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>>> replication that takes up 93TBytes of disk. Our main collection is split
>>> across 100 shards with 3 replicas each.  The issue that we're running into
>>> is when restarting the solr6 cluster.  The shards go into recovery and start
>>> to utilize nearly all of their network interfaces.  If we start too many of
>>> the nodes at once, the shards will go into a recovery, fail, and retry loop
>>> and never come up.  The errors are related to HDFS not responding fast
>>> enough and warnings from the DFSClient.  If we stop a node when this is
>>> happening, the script will force a stop (180 second timeout) and upon
>>> restart, we have lock files (write.lock) inside of HDFS.
>>>
>>> The process at this point is to start one node, find out the lock files,
>>> wait for it to come up completely (hours), stop it, delete the write.lock
>>> files, and restart.  Usually this second restart is faster, but it still can
>>> take 20-60 minutes.
>>>
>>> The smaller indexes recover much faster (less than 5 minutes). Should we
>>> have not used so many replicas with HDFS?  Is there a better way we should
>>> have built the solr6 cluster?
>>>
>>> Thank you for any insight!
>>>
>>> -Joe
>>>
> ---
> This email has been checked for viruses by AVG.
> http://www.avg.com
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hendrik Haddorp
In reply to this post by Joe Obernberger
Unfortunately I can not upload my cleanup code but the steps I'm doing
are quite easy. I wrote it in Java using the HDFS API and Curator for
ZooKeeper. Steps are:
     - read out the children of /collections in ZK so you know all the
collection names
     - read /collections/<collection name>/state.json to get the state
     - find the replicas in the state and filter those out that have a
"node_name" matching your locale node (the node name is basically a
combination of your host name and the solr port)
     - if the replica data has "dataDir" set then you basically only
need to add "index/write.lock" to it and you have the lock location
     - if "dataDir" is not set (not really sure why) then you need to
construct it yourself: <hdfs base path>/<collection name>/<replica
name>/data/index/write.lock
     - if the lock file exist delete it

I believe there is a small race condition in case you use replica auto
fail over. So I try to keep the time between checking the state in
ZooKeeper and deleting the lock file as short, like not first determine
all lock file locations and only then delete them but do that while
checking the state.

regards,
Hendrik

On 21.11.2017 19:53, Joe Obernberger wrote:

> A clever idea.  Normally what we do when we need to do a restart, is
> to halt indexing, and then wait about 30 minutes.  If we do not wait,
> and stop the cluster, the default scripts 180 second timeout is not
> enough and we'll have lock files to clean up.  We use puppet to start
> and stop the nodes, but at this point that is not working well since
> we need to start one node at a time.  With each one taking hours, this
> is a lengthy process!  I'd love to see your script!
>
> This new error is now coming up - see screen shot.  For some reason
> some of the shards have no leader assigned:
>
> http://lovehorsepower.com/SolrClusterErrors.jpg
>
> -Joe
>
>
> On 11/21/2017 1:34 PM, Hendrik Haddorp wrote:
>> Hi,
>>
>> the write.lock issue I see as well when Solr is not been stopped
>> gracefully. The write.lock files are then left in the HDFS as they do
>> not get removed automatically when the client disconnects like a
>> ephemeral node in ZooKeeper. Unfortunately Solr does also not realize
>> that it should be owning the lock as it is marked in the state stored
>> in ZooKeeper as the owner and is also not willing to retry, which is
>> why you need to restart the whole Solr instance after the cleanup. I
>> added some logic to my Solr start up script which scans the log files
>> in HDFS and compares that with the state in ZooKeeper and then delete
>> all lock files that belong to the node that I'm starting.
>>
>> regards,
>> Hendrik
>>
>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>>> using HDFS as the index. The current index size is about 31TBytes.
>>> With 3x replication that takes up 93TBytes of disk. Our main
>>> collection is split across 100 shards with 3 replicas each.  The
>>> issue that we're running into is when restarting the solr6 cluster. 
>>> The shards go into recovery and start to utilize nearly all of their
>>> network interfaces.  If we start too many of the nodes at once, the
>>> shards will go into a recovery, fail, and retry loop and never come
>>> up.  The errors are related to HDFS not responding fast enough and
>>> warnings from the DFSClient.  If we stop a node when this is
>>> happening, the script will force a stop (180 second timeout) and
>>> upon restart, we have lock files (write.lock) inside of HDFS.
>>>
>>> The process at this point is to start one node, find out the lock
>>> files, wait for it to come up completely (hours), stop it, delete
>>> the write.lock files, and restart.  Usually this second restart is
>>> faster, but it still can take 20-60 minutes.
>>>
>>> The smaller indexes recover much faster (less than 5 minutes).
>>> Should we have not used so many replicas with HDFS?  Is there a
>>> better way we should have built the solr6 cluster?
>>>
>>> Thank you for any insight!
>>>
>>> -Joe
>>>
>>
>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hendrik Haddorp
In reply to this post by Joe Obernberger
We actually also have some performance issue with HDFS at the moment. We
are doing lots of soft commits for NRT search. Those seem to be slower
then with local storage. The investigation is however not really far yet.

We have a setup with 2000 collections, with one shard each and a
replication factor of 2 or 3. When we restart nodes too fast that causes
problems with the overseer queue, which can lead to the queue getting
out of control and Solr pretty much dying. We are still on Solr 6.3. 6.6
has some improvements and should handle these actions faster. I would
check what you see for
"/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical
part is the "overseer_queue_size" value. If this goes up to about 10000
it is pretty much game over on our setup. In that case it seems to be
best to stop all nodes, clear the queue in ZK and then restart the nodes
one by one with a gap of like 5min. That normally recovers pretty well.

regards,
Hendrik

On 21.11.2017 20:12, Joe Obernberger wrote:

> We set the hard commit time long because we were having performance
> issues with HDFS, and thought that since the block size is 128M,
> having a longer hard commit made sense.  That was our hypothesis
> anyway.  Happy to switch it back and see what happens.
>
> I don't know what caused the cluster to go into recovery in the first
> place.  We had a server die over the weekend, but it's just one out of
> ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9
> copies).  It was at this point that we noticed lots of network
> activity, and most of the shards in this recovery, fail, retry loop. 
> That is when we decided to shut it down resulting in zombie lock files.
>
> I tried using the FORCELEADER call, which completed, but doesn't seem
> to have any effect on the shards that have no leader.  Kinda out of
> ideas for that problem.  If I can get the cluster back up, I'll try a
> lower hard commit time.  Thanks again Erick!
>
> -Joe
>
>
> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>
>> I need to back up a bit. Once nodes are in this state it's not
>> surprising that they need to be forcefully killed. I was more thinking
>> about how they got in this situation in the first place. _Before_ you
>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>
>> Your hard commit is far longer than it needs to be, resulting in much
>> larger tlog files etc. I usually set this at 15-60 seconds with local
>> disks, not quite sure whether longer intervals are helpful on HDFS.
>> What this means is that you can spend up to 30 minutes when you
>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>> had a chance to fsync the segments and may have to replay on startup.
>> If you have openSearcher set to false, the hard commit operation is
>> not horribly expensive, it just fsync's the current segments and opens
>> new ones. It won't be a total cure, but I bet reducing this interval
>> would help a lot.
>>
>> Also, if you stop indexing there's no need to wait 30 minutes if you
>> issue a manual commit, something like
>> .../collection/update?commit=true. Just reducing the hard commit
>> interval will make the wait between stopping indexing and restarting
>> shorter all by itself if you don't want to issue the manual commit.
>>
>> Best,
>> Erick
>>
>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>> <[hidden email]> wrote:
>>> Hi,
>>>
>>> the write.lock issue I see as well when Solr is not been stopped
>>> gracefully.
>>> The write.lock files are then left in the HDFS as they do not get
>>> removed
>>> automatically when the client disconnects like a ephemeral node in
>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>> be owning
>>> the lock as it is marked in the state stored in ZooKeeper as the
>>> owner and
>>> is also not willing to retry, which is why you need to restart the
>>> whole
>>> Solr instance after the cleanup. I added some logic to my Solr start up
>>> script which scans the log files in HDFS and compares that with the
>>> state in
>>> ZooKeeper and then delete all lock files that belong to the node
>>> that I'm
>>> starting.
>>>
>>> regards,
>>> Hendrik
>>>
>>>
>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>>>> using
>>>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>> split
>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>> running into
>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>> and start
>>>> to utilize nearly all of their network interfaces.  If we start too
>>>> many of
>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>> retry loop
>>>> and never come up.  The errors are related to HDFS not responding fast
>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>> this is
>>>> happening, the script will force a stop (180 second timeout) and upon
>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>
>>>> The process at this point is to start one node, find out the lock
>>>> files,
>>>> wait for it to come up completely (hours), stop it, delete the
>>>> write.lock
>>>> files, and restart.  Usually this second restart is faster, but it
>>>> still can
>>>> take 20-60 minutes.
>>>>
>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>> Should we
>>>> have not used so many replicas with HDFS?  Is there a better way we
>>>> should
>>>> have built the solr6 cluster?
>>>>
>>>> Thank you for any insight!
>>>>
>>>> -Joe
>>>>
>> ---
>> This email has been checked for viruses by AVG.
>> http://www.avg.com
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
We've never run an index this size in anything but HDFS, so I have no
comparison.  What we've been doing is keeping two main collections - all
data, and the last 30 days of data.  Then we handle queries based on
date range.  The 30 day index is significantly faster.

My main concern right now is that 6 of the 100 shards are not coming
back because of no leader.  I've never seen this error before.  Any
ideas?  ClusterStatus shows all three replicas with state 'down'.

Thanks!

-joe


On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:

> We actually also have some performance issue with HDFS at the moment.
> We are doing lots of soft commits for NRT search. Those seem to be
> slower then with local storage. The investigation is however not
> really far yet.
>
> We have a setup with 2000 collections, with one shard each and a
> replication factor of 2 or 3. When we restart nodes too fast that
> causes problems with the overseer queue, which can lead to the queue
> getting out of control and Solr pretty much dying. We are still on
> Solr 6.3. 6.6 has some improvements and should handle these actions
> faster. I would check what you see for
> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical
> part is the "overseer_queue_size" value. If this goes up to about
> 10000 it is pretty much game over on our setup. In that case it seems
> to be best to stop all nodes, clear the queue in ZK and then restart
> the nodes one by one with a gap of like 5min. That normally recovers
> pretty well.
>
> regards,
> Hendrik
>
> On 21.11.2017 20:12, Joe Obernberger wrote:
>> We set the hard commit time long because we were having performance
>> issues with HDFS, and thought that since the block size is 128M,
>> having a longer hard commit made sense.  That was our hypothesis
>> anyway.  Happy to switch it back and see what happens.
>>
>> I don't know what caused the cluster to go into recovery in the first
>> place.  We had a server die over the weekend, but it's just one out
>> of ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so
>> 9 copies).  It was at this point that we noticed lots of network
>> activity, and most of the shards in this recovery, fail, retry loop. 
>> That is when we decided to shut it down resulting in zombie lock files.
>>
>> I tried using the FORCELEADER call, which completed, but doesn't seem
>> to have any effect on the shards that have no leader. Kinda out of
>> ideas for that problem.  If I can get the cluster back up, I'll try a
>> lower hard commit time.  Thanks again Erick!
>>
>> -Joe
>>
>>
>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>>
>>> I need to back up a bit. Once nodes are in this state it's not
>>> surprising that they need to be forcefully killed. I was more thinking
>>> about how they got in this situation in the first place. _Before_ you
>>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>>
>>> Your hard commit is far longer than it needs to be, resulting in much
>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>> What this means is that you can spend up to 30 minutes when you
>>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>>> had a chance to fsync the segments and may have to replay on startup.
>>> If you have openSearcher set to false, the hard commit operation is
>>> not horribly expensive, it just fsync's the current segments and opens
>>> new ones. It won't be a total cure, but I bet reducing this interval
>>> would help a lot.
>>>
>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>> issue a manual commit, something like
>>> .../collection/update?commit=true. Just reducing the hard commit
>>> interval will make the wait between stopping indexing and restarting
>>> shorter all by itself if you don't want to issue the manual commit.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>> <[hidden email]> wrote:
>>>> Hi,
>>>>
>>>> the write.lock issue I see as well when Solr is not been stopped
>>>> gracefully.
>>>> The write.lock files are then left in the HDFS as they do not get
>>>> removed
>>>> automatically when the client disconnects like a ephemeral node in
>>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>>> be owning
>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>> owner and
>>>> is also not willing to retry, which is why you need to restart the
>>>> whole
>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>> start up
>>>> script which scans the log files in HDFS and compares that with the
>>>> state in
>>>> ZooKeeper and then delete all lock files that belong to the node
>>>> that I'm
>>>> starting.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>>
>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>> 6.6.1 using
>>>>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>> split
>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>> running into
>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>> and start
>>>>> to utilize nearly all of their network interfaces.  If we start
>>>>> too many of
>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>> retry loop
>>>>> and never come up.  The errors are related to HDFS not responding
>>>>> fast
>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>> this is
>>>>> happening, the script will force a stop (180 second timeout) and upon
>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>
>>>>> The process at this point is to start one node, find out the lock
>>>>> files,
>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>> write.lock
>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>> still can
>>>>> take 20-60 minutes.
>>>>>
>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>> Should we
>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>> we should
>>>>> have built the solr6 cluster?
>>>>>
>>>>> Thank you for any insight!
>>>>>
>>>>> -Joe
>>>>>
>>> ---
>>> This email has been checked for viruses by AVG.
>>> http://www.avg.com
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hendrik Haddorp
We sometimes also have replicas not recovering. If one replica is left
active the easiest is to then to delete the replica and create a new
one. When all replicas are down it helps most of the time to restart one
of the nodes that contains a replica in down state. If that also doesn't
get the replica to recover I would check the logs of the node and also
that of the overseer node. I have seen the same issue on Solr using
local storage. The main HDFS related issues we had so far was those lock
files and if you delete and recreate collections/cores and it sometimes
happens that the data was not cleaned up in HDFS and then causes a conflict.

Hendrik

On 21.11.2017 21:07, Joe Obernberger wrote:

> We've never run an index this size in anything but HDFS, so I have no
> comparison.  What we've been doing is keeping two main collections -
> all data, and the last 30 days of data.  Then we handle queries based
> on date range.  The 30 day index is significantly faster.
>
> My main concern right now is that 6 of the 100 shards are not coming
> back because of no leader.  I've never seen this error before.  Any
> ideas?  ClusterStatus shows all three replicas with state 'down'.
>
> Thanks!
>
> -joe
>
>
> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>> We actually also have some performance issue with HDFS at the moment.
>> We are doing lots of soft commits for NRT search. Those seem to be
>> slower then with local storage. The investigation is however not
>> really far yet.
>>
>> We have a setup with 2000 collections, with one shard each and a
>> replication factor of 2 or 3. When we restart nodes too fast that
>> causes problems with the overseer queue, which can lead to the queue
>> getting out of control and Solr pretty much dying. We are still on
>> Solr 6.3. 6.6 has some improvements and should handle these actions
>> faster. I would check what you see for
>> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The critical
>> part is the "overseer_queue_size" value. If this goes up to about
>> 10000 it is pretty much game over on our setup. In that case it seems
>> to be best to stop all nodes, clear the queue in ZK and then restart
>> the nodes one by one with a gap of like 5min. That normally recovers
>> pretty well.
>>
>> regards,
>> Hendrik
>>
>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>> We set the hard commit time long because we were having performance
>>> issues with HDFS, and thought that since the block size is 128M,
>>> having a longer hard commit made sense.  That was our hypothesis
>>> anyway.  Happy to switch it back and see what happens.
>>>
>>> I don't know what caused the cluster to go into recovery in the
>>> first place.  We had a server die over the weekend, but it's just
>>> one out of ~50.  Every shard is 3x replicated (and 3x replicated in
>>> HDFS...so 9 copies).  It was at this point that we noticed lots of
>>> network activity, and most of the shards in this recovery, fail,
>>> retry loop.  That is when we decided to shut it down resulting in
>>> zombie lock files.
>>>
>>> I tried using the FORCELEADER call, which completed, but doesn't
>>> seem to have any effect on the shards that have no leader. Kinda out
>>> of ideas for that problem.  If I can get the cluster back up, I'll
>>> try a lower hard commit time.  Thanks again Erick!
>>>
>>> -Joe
>>>
>>>
>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>>>
>>>> I need to back up a bit. Once nodes are in this state it's not
>>>> surprising that they need to be forcefully killed. I was more thinking
>>>> about how they got in this situation in the first place. _Before_ you
>>>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>>>
>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>> What this means is that you can spend up to 30 minutes when you
>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>>>> had a chance to fsync the segments and may have to replay on startup.
>>>> If you have openSearcher set to false, the hard commit operation is
>>>> not horribly expensive, it just fsync's the current segments and opens
>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>> would help a lot.
>>>>
>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>> issue a manual commit, something like
>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>> interval will make the wait between stopping indexing and restarting
>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>> <[hidden email]> wrote:
>>>>> Hi,
>>>>>
>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>> gracefully.
>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>> removed
>>>>> automatically when the client disconnects like a ephemeral node in
>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>>>> be owning
>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>> owner and
>>>>> is also not willing to retry, which is why you need to restart the
>>>>> whole
>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>> start up
>>>>> script which scans the log files in HDFS and compares that with
>>>>> the state in
>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>> that I'm
>>>>> starting.
>>>>>
>>>>> regards,
>>>>> Hendrik
>>>>>
>>>>>
>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>> 6.6.1 using
>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>> With 3x
>>>>>> replication that takes up 93TBytes of disk. Our main collection
>>>>>> is split
>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>> running into
>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>> recovery and start
>>>>>> to utilize nearly all of their network interfaces.  If we start
>>>>>> too many of
>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>> retry loop
>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>> fast
>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>> this is
>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>> upon
>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>
>>>>>> The process at this point is to start one node, find out the lock
>>>>>> files,
>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>> write.lock
>>>>>> files, and restart.  Usually this second restart is faster, but
>>>>>> it still can
>>>>>> take 20-60 minutes.
>>>>>>
>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>> Should we
>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>> we should
>>>>>> have built the solr6 cluster?
>>>>>>
>>>>>> Thank you for any insight!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>> ---
>>>> This email has been checked for viruses by AVG.
>>>> http://www.avg.com
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Erick Erickson
bq: We are doing lots of soft commits for NRT search...

It's not surprising that this is slower than local storage, especially
if you have any autowarming going on. Opening  new searchers will need
to read data from disk for the new segments, and HDFS may be slower
here.

As far as the commit interval, an under-appreciated event is that when
RAMBufferSizeMB is exceeded (default 100M last I knew) new segments
are written _anyway_, they're just a little invisible. That is, the
segments_n file isn't updated even though they're closed IIUC at
least. So that very long interval isn't helping with that problem I
don't think....

Evidence to the contrary trumps my understanding of course.

About starting all these collections up at once and the Overseer
queue. I've seen this in similar situations. There are a _lot_ of
messages flying back and forth for each replica on startup, and the
Overseer processing was very inefficient historically so that queue
could get in the 100s of K, I've seen some pathological situations
where it's over 1M. SOLR-10524 made this a lot better. There are still
a lot of messages written in a case like yours, but at least the
Overseer has a much better chance to keep up.... Solr 6.6... At that
point bringing up Solr took a very long time.


Erick

On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorp
<[hidden email]> wrote:

> We sometimes also have replicas not recovering. If one replica is left
> active the easiest is to then to delete the replica and create a new one.
> When all replicas are down it helps most of the time to restart one of the
> nodes that contains a replica in down state. If that also doesn't get the
> replica to recover I would check the logs of the node and also that of the
> overseer node. I have seen the same issue on Solr using local storage. The
> main HDFS related issues we had so far was those lock files and if you
> delete and recreate collections/cores and it sometimes happens that the data
> was not cleaned up in HDFS and then causes a conflict.
>
> Hendrik
>
>
> On 21.11.2017 21:07, Joe Obernberger wrote:
>>
>> We've never run an index this size in anything but HDFS, so I have no
>> comparison.  What we've been doing is keeping two main collections - all
>> data, and the last 30 days of data.  Then we handle queries based on date
>> range.  The 30 day index is significantly faster.
>>
>> My main concern right now is that 6 of the 100 shards are not coming back
>> because of no leader.  I've never seen this error before.  Any ideas?
>> ClusterStatus shows all three replicas with state 'down'.
>>
>> Thanks!
>>
>> -joe
>>
>>
>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>
>>> We actually also have some performance issue with HDFS at the moment. We
>>> are doing lots of soft commits for NRT search. Those seem to be slower then
>>> with local storage. The investigation is however not really far yet.
>>>
>>> We have a setup with 2000 collections, with one shard each and a
>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>> problems with the overseer queue, which can lead to the queue getting out of
>>> control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
>>> improvements and should handle these actions faster. I would check what you
>>> see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>> critical part is the "overseer_queue_size" value. If this goes up to about
>>> 10000 it is pretty much game over on our setup. In that case it seems to be
>>> best to stop all nodes, clear the queue in ZK and then restart the nodes one
>>> by one with a gap of like 5min. That normally recovers pretty well.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>
>>>> We set the hard commit time long because we were having performance
>>>> issues with HDFS, and thought that since the block size is 128M, having a
>>>> longer hard commit made sense.  That was our hypothesis anyway.  Happy to
>>>> switch it back and see what happens.
>>>>
>>>> I don't know what caused the cluster to go into recovery in the first
>>>> place.  We had a server die over the weekend, but it's just one out of ~50.
>>>> Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies).  It
>>>> was at this point that we noticed lots of network activity, and most of the
>>>> shards in this recovery, fail, retry loop.  That is when we decided to shut
>>>> it down resulting in zombie lock files.
>>>>
>>>> I tried using the FORCELEADER call, which completed, but doesn't seem to
>>>> have any effect on the shards that have no leader. Kinda out of ideas for
>>>> that problem.  If I can get the cluster back up, I'll try a lower hard
>>>> commit time.  Thanks again Erick!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>
>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>>>>
>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>> surprising that they need to be forcefully killed. I was more thinking
>>>>> about how they got in this situation in the first place. _Before_ you
>>>>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>>>>
>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>> What this means is that you can spend up to 30 minutes when you
>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>> not horribly expensive, it just fsync's the current segments and opens
>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>> would help a lot.
>>>>>
>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>> issue a manual commit, something like
>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>> interval will make the wait between stopping indexing and restarting
>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>> <[hidden email]> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>> gracefully.
>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>> removed
>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should be
>>>>>> owning
>>>>>> the lock as it is marked in the state stored in ZooKeeper as the owner
>>>>>> and
>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>> whole
>>>>>> Solr instance after the cleanup. I added some logic to my Solr start
>>>>>> up
>>>>>> script which scans the log files in HDFS and compares that with the
>>>>>> state in
>>>>>> ZooKeeper and then delete all lock files that belong to the node that
>>>>>> I'm
>>>>>> starting.
>>>>>>
>>>>>> regards,
>>>>>> Hendrik
>>>>>>
>>>>>>
>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>
>>>>>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>>>>>>> using
>>>>>>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>>>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>>>> split
>>>>>>> across 100 shards with 3 replicas each.  The issue that we're running
>>>>>>> into
>>>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>>>> and start
>>>>>>> to utilize nearly all of their network interfaces.  If we start too
>>>>>>> many of
>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>> retry loop
>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>> fast
>>>>>>> enough and warnings from the DFSClient.  If we stop a node when this
>>>>>>> is
>>>>>>> happening, the script will force a stop (180 second timeout) and upon
>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>
>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>> files,
>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>> write.lock
>>>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>>>> still can
>>>>>>> take 20-60 minutes.
>>>>>>>
>>>>>>> The smaller indexes recover much faster (less than 5 minutes). Should
>>>>>>> we
>>>>>>> have not used so many replicas with HDFS?  Is there a better way we
>>>>>>> should
>>>>>>> have built the solr6 cluster?
>>>>>>>
>>>>>>> Thank you for any insight!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>> ---
>>>>> This email has been checked for viruses by AVG.
>>>>> http://www.avg.com
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
Thank you Erick.  I've set the RamBufferSize to 1G; perhaps higher would
be beneficial.  One more data point is that if I restart a node, more
often than not, it goes into recovery, beats up the network for a while,
and then goes green.  This happens even if I do no indexing between
restarts.  Is that expected? Sometimes this can take longer than 20
minutes.  No new data was added to the index between the restarts.

-Joe


On 11/21/2017 3:43 PM, Erick Erickson wrote:

> bq: We are doing lots of soft commits for NRT search...
>
> It's not surprising that this is slower than local storage, especially
> if you have any autowarming going on. Opening  new searchers will need
> to read data from disk for the new segments, and HDFS may be slower
> here.
>
> As far as the commit interval, an under-appreciated event is that when
> RAMBufferSizeMB is exceeded (default 100M last I knew) new segments
> are written _anyway_, they're just a little invisible. That is, the
> segments_n file isn't updated even though they're closed IIUC at
> least. So that very long interval isn't helping with that problem I
> don't think....
>
> Evidence to the contrary trumps my understanding of course.
>
> About starting all these collections up at once and the Overseer
> queue. I've seen this in similar situations. There are a _lot_ of
> messages flying back and forth for each replica on startup, and the
> Overseer processing was very inefficient historically so that queue
> could get in the 100s of K, I've seen some pathological situations
> where it's over 1M. SOLR-10524 made this a lot better. There are still
> a lot of messages written in a case like yours, but at least the
> Overseer has a much better chance to keep up.... Solr 6.6... At that
> point bringing up Solr took a very long time.
>
>
> Erick
>
> On Tue, Nov 21, 2017 at 12:24 PM, Hendrik Haddorp
> <[hidden email]> wrote:
>> We sometimes also have replicas not recovering. If one replica is left
>> active the easiest is to then to delete the replica and create a new one.
>> When all replicas are down it helps most of the time to restart one of the
>> nodes that contains a replica in down state. If that also doesn't get the
>> replica to recover I would check the logs of the node and also that of the
>> overseer node. I have seen the same issue on Solr using local storage. The
>> main HDFS related issues we had so far was those lock files and if you
>> delete and recreate collections/cores and it sometimes happens that the data
>> was not cleaned up in HDFS and then causes a conflict.
>>
>> Hendrik
>>
>>
>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>> We've never run an index this size in anything but HDFS, so I have no
>>> comparison.  What we've been doing is keeping two main collections - all
>>> data, and the last 30 days of data.  Then we handle queries based on date
>>> range.  The 30 day index is significantly faster.
>>>
>>> My main concern right now is that 6 of the 100 shards are not coming back
>>> because of no leader.  I've never seen this error before.  Any ideas?
>>> ClusterStatus shows all three replicas with state 'down'.
>>>
>>> Thanks!
>>>
>>> -joe
>>>
>>>
>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>> We actually also have some performance issue with HDFS at the moment. We
>>>> are doing lots of soft commits for NRT search. Those seem to be slower then
>>>> with local storage. The investigation is however not really far yet.
>>>>
>>>> We have a setup with 2000 collections, with one shard each and a
>>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>>> problems with the overseer queue, which can lead to the queue getting out of
>>>> control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
>>>> improvements and should handle these actions faster. I would check what you
>>>> see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>>> critical part is the "overseer_queue_size" value. If this goes up to about
>>>> 10000 it is pretty much game over on our setup. In that case it seems to be
>>>> best to stop all nodes, clear the queue in ZK and then restart the nodes one
>>>> by one with a gap of like 5min. That normally recovers pretty well.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>> We set the hard commit time long because we were having performance
>>>>> issues with HDFS, and thought that since the block size is 128M, having a
>>>>> longer hard commit made sense.  That was our hypothesis anyway.  Happy to
>>>>> switch it back and see what happens.
>>>>>
>>>>> I don't know what caused the cluster to go into recovery in the first
>>>>> place.  We had a server die over the weekend, but it's just one out of ~50.
>>>>> Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies).  It
>>>>> was at this point that we noticed lots of network activity, and most of the
>>>>> shards in this recovery, fail, retry loop.  That is when we decided to shut
>>>>> it down resulting in zombie lock files.
>>>>>
>>>>> I tried using the FORCELEADER call, which completed, but doesn't seem to
>>>>> have any effect on the shards that have no leader. Kinda out of ideas for
>>>>> that problem.  If I can get the cluster back up, I'll try a lower hard
>>>>> commit time.  Thanks again Erick!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>>>>>
>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>> surprising that they need to be forcefully killed. I was more thinking
>>>>>> about how they got in this situation in the first place. _Before_ you
>>>>>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>>>>>
>>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>> not horribly expensive, it just fsync's the current segments and opens
>>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>>> would help a lot.
>>>>>>
>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>>> issue a manual commit, something like
>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>> interval will make the wait between stopping indexing and restarting
>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>> <[hidden email]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>> gracefully.
>>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>>> removed
>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should be
>>>>>>> owning
>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the owner
>>>>>>> and
>>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>>> whole
>>>>>>> Solr instance after the cleanup. I added some logic to my Solr start
>>>>>>> up
>>>>>>> script which scans the log files in HDFS and compares that with the
>>>>>>> state in
>>>>>>> ZooKeeper and then delete all lock files that belong to the node that
>>>>>>> I'm
>>>>>>> starting.
>>>>>>>
>>>>>>> regards,
>>>>>>> Hendrik
>>>>>>>
>>>>>>>
>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>>>>>>>> using
>>>>>>>> HDFS as the index.  The current index size is about 31TBytes. With 3x
>>>>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>>>>> split
>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're running
>>>>>>>> into
>>>>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>>>>> and start
>>>>>>>> to utilize nearly all of their network interfaces.  If we start too
>>>>>>>> many of
>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>>> retry loop
>>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>>> fast
>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when this
>>>>>>>> is
>>>>>>>> happening, the script will force a stop (180 second timeout) and upon
>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>
>>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>>> files,
>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>> write.lock
>>>>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>>>>> still can
>>>>>>>> take 20-60 minutes.
>>>>>>>>
>>>>>>>> The smaller indexes recover much faster (less than 5 minutes). Should
>>>>>>>> we
>>>>>>>> have not used so many replicas with HDFS?  Is there a better way we
>>>>>>>> should
>>>>>>>> have built the solr6 cluster?
>>>>>>>>
>>>>>>>> Thank you for any insight!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>> ---
>>>>>> This email has been checked for viruses by AVG.
>>>>>> http://www.avg.com
>>>>>>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
In reply to this post by Hendrik Haddorp
Hi Hendrick - the shards in question have three replicas.  I tried
restarting each one (one by one) - no luck.  No leader is found.  I
deleted one of the replicas and added a new one, and the new one also
shows as 'down'.  I also tried the FORCELEADER call, but that had no
effect.  I checked the OVERSEERSTATUS, but there is nothing unusual
there.  I don't see anything useful in the logs except the error:

org.apache.solr.common.SolrException: Error getting leader from zk for
shard shard21
     at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
     at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
     at
org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
     at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
     at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Could not get leader props
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
     at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
     ... 7 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
     at
org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
     at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
     at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
     at
org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
     at
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
     at
org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
     at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)
     ... 9 more

Can I modify zookeeper to force a leader?  Is there any other way to
recover from this?  Thanks very much!

-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:

> We sometimes also have replicas not recovering. If one replica is left
> active the easiest is to then to delete the replica and create a new
> one. When all replicas are down it helps most of the time to restart
> one of the nodes that contains a replica in down state. If that also
> doesn't get the replica to recover I would check the logs of the node
> and also that of the overseer node. I have seen the same issue on Solr
> using local storage. The main HDFS related issues we had so far was
> those lock files and if you delete and recreate collections/cores and
> it sometimes happens that the data was not cleaned up in HDFS and then
> causes a conflict.
>
> Hendrik
>
> On 21.11.2017 21:07, Joe Obernberger wrote:
>> We've never run an index this size in anything but HDFS, so I have no
>> comparison.  What we've been doing is keeping two main collections -
>> all data, and the last 30 days of data.  Then we handle queries based
>> on date range. The 30 day index is significantly faster.
>>
>> My main concern right now is that 6 of the 100 shards are not coming
>> back because of no leader.  I've never seen this error before.  Any
>> ideas?  ClusterStatus shows all three replicas with state 'down'.
>>
>> Thanks!
>>
>> -joe
>>
>>
>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>> We actually also have some performance issue with HDFS at the
>>> moment. We are doing lots of soft commits for NRT search. Those seem
>>> to be slower then with local storage. The investigation is however
>>> not really far yet.
>>>
>>> We have a setup with 2000 collections, with one shard each and a
>>> replication factor of 2 or 3. When we restart nodes too fast that
>>> causes problems with the overseer queue, which can lead to the queue
>>> getting out of control and Solr pretty much dying. We are still on
>>> Solr 6.3. 6.6 has some improvements and should handle these actions
>>> faster. I would check what you see for
>>> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>> critical part is the "overseer_queue_size" value. If this goes up to
>>> about 10000 it is pretty much game over on our setup. In that case
>>> it seems to be best to stop all nodes, clear the queue in ZK and
>>> then restart the nodes one by one with a gap of like 5min. That
>>> normally recovers pretty well.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>> We set the hard commit time long because we were having performance
>>>> issues with HDFS, and thought that since the block size is 128M,
>>>> having a longer hard commit made sense.  That was our hypothesis
>>>> anyway. Happy to switch it back and see what happens.
>>>>
>>>> I don't know what caused the cluster to go into recovery in the
>>>> first place.  We had a server die over the weekend, but it's just
>>>> one out of ~50.  Every shard is 3x replicated (and 3x replicated in
>>>> HDFS...so 9 copies).  It was at this point that we noticed lots of
>>>> network activity, and most of the shards in this recovery, fail,
>>>> retry loop.  That is when we decided to shut it down resulting in
>>>> zombie lock files.
>>>>
>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>> seem to have any effect on the shards that have no leader. Kinda
>>>> out of ideas for that problem.  If I can get the cluster back up,
>>>> I'll try a lower hard commit time. Thanks again Erick!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>> ;)...
>>>>>
>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>> surprising that they need to be forcefully killed. I was more
>>>>> thinking
>>>>> about how they got in this situation in the first place. _Before_ you
>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>> Forcefully?
>>>>>
>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>> What this means is that you can spend up to 30 minutes when you
>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>> have
>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>> not horribly expensive, it just fsync's the current segments and
>>>>> opens
>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>> would help a lot.
>>>>>
>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>> issue a manual commit, something like
>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>> interval will make the wait between stopping indexing and restarting
>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>> <[hidden email]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>> gracefully.
>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>> removed
>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it
>>>>>> should be owning
>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>> owner and
>>>>>> is also not willing to retry, which is why you need to restart
>>>>>> the whole
>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>> start up
>>>>>> script which scans the log files in HDFS and compares that with
>>>>>> the state in
>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>> that I'm
>>>>>> starting.
>>>>>>
>>>>>> regards,
>>>>>> Hendrik
>>>>>>
>>>>>>
>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>> 6.6.1 using
>>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>>> With 3x
>>>>>>> replication that takes up 93TBytes of disk. Our main collection
>>>>>>> is split
>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>> running into
>>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>>> recovery and start
>>>>>>> to utilize nearly all of their network interfaces.  If we start
>>>>>>> too many of
>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>> retry loop
>>>>>>> and never come up.  The errors are related to HDFS not
>>>>>>> responding fast
>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>> this is
>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>> upon
>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>
>>>>>>> The process at this point is to start one node, find out the
>>>>>>> lock files,
>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>> write.lock
>>>>>>> files, and restart.  Usually this second restart is faster, but
>>>>>>> it still can
>>>>>>> take 20-60 minutes.
>>>>>>>
>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>> Should we
>>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>>> we should
>>>>>>> have built the solr6 cluster?
>>>>>>>
>>>>>>> Thank you for any insight!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>> ---
>>>>> This email has been checked for viruses by AVG.
>>>>> http://www.avg.com
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
In reply to this post by Hendrik Haddorp
One other data point I just saw on one of the nodes.  It has the
following error:
2017-11-21 22:59:48.886 ERROR
(coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS
s:shard14 r:core_node175 x:UNCLASS_shard14_replica3]
o.a.s.c.ShardLeaderElectionContext There was a problem trying to
register as the leader:org.apache.solr.common.SolrException: Leader
Initiated Recovery prevented leadership
         at
org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521)
         at
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424)
         at
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
         at
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
         at
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
         at
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
         at
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
         at
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
         at
org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
         at
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
         at
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
         at
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
         at
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)

This stack trace repeats for a long while; looks like a recursive call.

-Joe


On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:

> We sometimes also have replicas not recovering. If one replica is left
> active the easiest is to then to delete the replica and create a new
> one. When all replicas are down it helps most of the time to restart
> one of the nodes that contains a replica in down state. If that also
> doesn't get the replica to recover I would check the logs of the node
> and also that of the overseer node. I have seen the same issue on Solr
> using local storage. The main HDFS related issues we had so far was
> those lock files and if you delete and recreate collections/cores and
> it sometimes happens that the data was not cleaned up in HDFS and then
> causes a conflict.
>
> Hendrik
>
> On 21.11.2017 21:07, Joe Obernberger wrote:
>> We've never run an index this size in anything but HDFS, so I have no
>> comparison.  What we've been doing is keeping two main collections -
>> all data, and the last 30 days of data.  Then we handle queries based
>> on date range. The 30 day index is significantly faster.
>>
>> My main concern right now is that 6 of the 100 shards are not coming
>> back because of no leader.  I've never seen this error before.  Any
>> ideas?  ClusterStatus shows all three replicas with state 'down'.
>>
>> Thanks!
>>
>> -joe
>>
>>
>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>> We actually also have some performance issue with HDFS at the
>>> moment. We are doing lots of soft commits for NRT search. Those seem
>>> to be slower then with local storage. The investigation is however
>>> not really far yet.
>>>
>>> We have a setup with 2000 collections, with one shard each and a
>>> replication factor of 2 or 3. When we restart nodes too fast that
>>> causes problems with the overseer queue, which can lead to the queue
>>> getting out of control and Solr pretty much dying. We are still on
>>> Solr 6.3. 6.6 has some improvements and should handle these actions
>>> faster. I would check what you see for
>>> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>> critical part is the "overseer_queue_size" value. If this goes up to
>>> about 10000 it is pretty much game over on our setup. In that case
>>> it seems to be best to stop all nodes, clear the queue in ZK and
>>> then restart the nodes one by one with a gap of like 5min. That
>>> normally recovers pretty well.
>>>
>>> regards,
>>> Hendrik
>>>
>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>> We set the hard commit time long because we were having performance
>>>> issues with HDFS, and thought that since the block size is 128M,
>>>> having a longer hard commit made sense.  That was our hypothesis
>>>> anyway. Happy to switch it back and see what happens.
>>>>
>>>> I don't know what caused the cluster to go into recovery in the
>>>> first place.  We had a server die over the weekend, but it's just
>>>> one out of ~50.  Every shard is 3x replicated (and 3x replicated in
>>>> HDFS...so 9 copies).  It was at this point that we noticed lots of
>>>> network activity, and most of the shards in this recovery, fail,
>>>> retry loop.  That is when we decided to shut it down resulting in
>>>> zombie lock files.
>>>>
>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>> seem to have any effect on the shards that have no leader. Kinda
>>>> out of ideas for that problem.  If I can get the cluster back up,
>>>> I'll try a lower hard commit time. Thanks again Erick!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>> ;)...
>>>>>
>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>> surprising that they need to be forcefully killed. I was more
>>>>> thinking
>>>>> about how they got in this situation in the first place. _Before_ you
>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>> Forcefully?
>>>>>
>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>> What this means is that you can spend up to 30 minutes when you
>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>> have
>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>> not horribly expensive, it just fsync's the current segments and
>>>>> opens
>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>> would help a lot.
>>>>>
>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>> issue a manual commit, something like
>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>> interval will make the wait between stopping indexing and restarting
>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>> <[hidden email]> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>> gracefully.
>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>> removed
>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it
>>>>>> should be owning
>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>> owner and
>>>>>> is also not willing to retry, which is why you need to restart
>>>>>> the whole
>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>> start up
>>>>>> script which scans the log files in HDFS and compares that with
>>>>>> the state in
>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>> that I'm
>>>>>> starting.
>>>>>>
>>>>>> regards,
>>>>>> Hendrik
>>>>>>
>>>>>>
>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>> 6.6.1 using
>>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>>> With 3x
>>>>>>> replication that takes up 93TBytes of disk. Our main collection
>>>>>>> is split
>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>> running into
>>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>>> recovery and start
>>>>>>> to utilize nearly all of their network interfaces.  If we start
>>>>>>> too many of
>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>> retry loop
>>>>>>> and never come up.  The errors are related to HDFS not
>>>>>>> responding fast
>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>> this is
>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>> upon
>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>
>>>>>>> The process at this point is to start one node, find out the
>>>>>>> lock files,
>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>> write.lock
>>>>>>> files, and restart.  Usually this second restart is faster, but
>>>>>>> it still can
>>>>>>> take 20-60 minutes.
>>>>>>>
>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>> Should we
>>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>>> we should
>>>>>>> have built the solr6 cluster?
>>>>>>>
>>>>>>> Thank you for any insight!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>> ---
>>>>> This email has been checked for viruses by AVG.
>>>>> http://www.avg.com
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Erick Erickson
Well, you can  always manually change the ZK nodes, but whether just
setting a node's state to "leader" in ZK then starting the Solr
instance hosting that node would work... I don't know. Do consider
running CheckIndex on one of the replicas in question first though.

Best,
Erick

On Tue, Nov 21, 2017 at 3:06 PM, Joe Obernberger
<[hidden email]> wrote:

> One other data point I just saw on one of the nodes.  It has the following
> error:
> 2017-11-21 22:59:48.886 ERROR
> (coreZkRegister-1-thread-1-processing-n:leda:9100_solr) [c:UNCLASS s:shard14
> r:core_node175 x:UNCLASS_shard14_replica3]
> o.a.s.c.ShardLeaderElectionContext There was a problem trying to register as
> the leader:org.apache.solr.common.SolrException: Leader Initiated Recovery
> prevented leadership
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.checkLIR(ElectionContext.java:521)
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:424)
>         at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
>         at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
>         at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
>         at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.rejoinLeaderElection(ElectionContext.java:684)
>         at
> org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:454)
>         at
> org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:170)
>         at
> org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:135)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:307)
>         at
> org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:216)
>
> This stack trace repeats for a long while; looks like a recursive call.
>
>
> -Joe
>
>
> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>
>> We sometimes also have replicas not recovering. If one replica is left
>> active the easiest is to then to delete the replica and create a new one.
>> When all replicas are down it helps most of the time to restart one of the
>> nodes that contains a replica in down state. If that also doesn't get the
>> replica to recover I would check the logs of the node and also that of the
>> overseer node. I have seen the same issue on Solr using local storage. The
>> main HDFS related issues we had so far was those lock files and if you
>> delete and recreate collections/cores and it sometimes happens that the data
>> was not cleaned up in HDFS and then causes a conflict.
>>
>> Hendrik
>>
>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>
>>> We've never run an index this size in anything but HDFS, so I have no
>>> comparison.  What we've been doing is keeping two main collections - all
>>> data, and the last 30 days of data.  Then we handle queries based on date
>>> range. The 30 day index is significantly faster.
>>>
>>> My main concern right now is that 6 of the 100 shards are not coming back
>>> because of no leader.  I've never seen this error before.  Any ideas?
>>> ClusterStatus shows all three replicas with state 'down'.
>>>
>>> Thanks!
>>>
>>> -joe
>>>
>>>
>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>
>>>> We actually also have some performance issue with HDFS at the moment. We
>>>> are doing lots of soft commits for NRT search. Those seem to be slower then
>>>> with local storage. The investigation is however not really far yet.
>>>>
>>>> We have a setup with 2000 collections, with one shard each and a
>>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>>> problems with the overseer queue, which can lead to the queue getting out of
>>>> control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has some
>>>> improvements and should handle these actions faster. I would check what you
>>>> see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>>> critical part is the "overseer_queue_size" value. If this goes up to about
>>>> 10000 it is pretty much game over on our setup. In that case it seems to be
>>>> best to stop all nodes, clear the queue in ZK and then restart the nodes one
>>>> by one with a gap of like 5min. That normally recovers pretty well.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>
>>>>> We set the hard commit time long because we were having performance
>>>>> issues with HDFS, and thought that since the block size is 128M, having a
>>>>> longer hard commit made sense.  That was our hypothesis anyway. Happy to
>>>>> switch it back and see what happens.
>>>>>
>>>>> I don't know what caused the cluster to go into recovery in the first
>>>>> place.  We had a server die over the weekend, but it's just one out of ~50.
>>>>> Every shard is 3x replicated (and 3x replicated in HDFS...so 9 copies).  It
>>>>> was at this point that we noticed lots of network activity, and most of the
>>>>> shards in this recovery, fail, retry loop.  That is when we decided to shut
>>>>> it down resulting in zombie lock files.
>>>>>
>>>>> I tried using the FORCELEADER call, which completed, but doesn't seem
>>>>> to have any effect on the shards that have no leader. Kinda out of ideas for
>>>>> that problem.  If I can get the cluster back up, I'll try a lower hard
>>>>> commit time. Thanks again Erick!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>
>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik ;)...
>>>>>>
>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>> surprising that they need to be forcefully killed. I was more thinking
>>>>>> about how they got in this situation in the first place. _Before_ you
>>>>>> get into the nasty state how are the Solr nodes shut down? Forcefully?
>>>>>>
>>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not have
>>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>> not horribly expensive, it just fsync's the current segments and opens
>>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>>> would help a lot.
>>>>>>
>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>>> issue a manual commit, something like
>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>> interval will make the wait between stopping indexing and restarting
>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>> gracefully.
>>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>>> removed
>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should be
>>>>>>> owning
>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>>> owner and
>>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>>> whole
>>>>>>> Solr instance after the cleanup. I added some logic to my Solr start
>>>>>>> up
>>>>>>> script which scans the log files in HDFS and compares that with the
>>>>>>> state in
>>>>>>> ZooKeeper and then delete all lock files that belong to the node that
>>>>>>> I'm
>>>>>>> starting.
>>>>>>>
>>>>>>> regards,
>>>>>>> Hendrik
>>>>>>>
>>>>>>>
>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>
>>>>>>>> Hi All - we have a system with 45 physical boxes running solr 6.6.1
>>>>>>>> using
>>>>>>>> HDFS as the index.  The current index size is about 31TBytes. With
>>>>>>>> 3x
>>>>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>>>>> split
>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>> running into
>>>>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>>>>> and start
>>>>>>>> to utilize nearly all of their network interfaces.  If we start too
>>>>>>>> many of
>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>>> retry loop
>>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>>> fast
>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when this
>>>>>>>> is
>>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>>> upon
>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>
>>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>>> files,
>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>> write.lock
>>>>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>>>>> still can
>>>>>>>> take 20-60 minutes.
>>>>>>>>
>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>> Should we
>>>>>>>> have not used so many replicas with HDFS?  Is there a better way we
>>>>>>>> should
>>>>>>>> have built the solr6 cluster?
>>>>>>>>
>>>>>>>> Thank you for any insight!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>> ---
>>>>>> This email has been checked for viruses by AVG.
>>>>>> http://www.avg.com
>>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Hendrik Haddorp
In reply to this post by Joe Obernberger
Hi Joe,

sorry, I have not seen that problem. I would normally not delete a
replica if the shard is down but only if there is an active shard.
Without an active leader the replica should not be able to recover. I
also just had a case where all replicas of a shard stayed in down state
and restarts didn't help. This was however also caused by lock files.
Once I cleaned them up and restarted all Solr instances that had a
replica they recovered.

For the lock files I discovered that the index is not always in the
"index" folder but can also be in an index.<timestamp> folder. There can
be an "index.properties" file in the "data" directory in HDFS and this
contains the correct index folder name.

If you are really desperate you could also delete all but one replica so
that the leader election is quite trivial. But this does of course
increase the risk of finally loosing the data quite a bit. So I would
try looking into the code and figure out what the problem is here and
maybe compare the state in HDFS and ZK with a shard that works.

regards,
Hendrik

On 21.11.2017 23:57, Joe Obernberger wrote:

> Hi Hendrick - the shards in question have three replicas.  I tried
> restarting each one (one by one) - no luck.  No leader is found. I
> deleted one of the replicas and added a new one, and the new one also
> shows as 'down'.  I also tried the FORCELEADER call, but that had no
> effect.  I checked the OVERSEERSTATUS, but there is nothing unusual
> there.  I don't see anything useful in the logs except the error:
>
> org.apache.solr.common.SolrException: Error getting leader from zk for
> shard shard21
>     at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
>     at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>     at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>     at
> org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
>     at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.solr.common.SolrException: Could not get leader
> props
>     at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
>     at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
>     at
> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
>     ... 7 more
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>     at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>     at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>     at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
>     at
> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
>     at
> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>     at
> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
>     at
> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)
>     ... 9 more
>
> Can I modify zookeeper to force a leader?  Is there any other way to
> recover from this?  Thanks very much!
>
> -Joe
>
>
> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>> We sometimes also have replicas not recovering. If one replica is
>> left active the easiest is to then to delete the replica and create a
>> new one. When all replicas are down it helps most of the time to
>> restart one of the nodes that contains a replica in down state. If
>> that also doesn't get the replica to recover I would check the logs
>> of the node and also that of the overseer node. I have seen the same
>> issue on Solr using local storage. The main HDFS related issues we
>> had so far was those lock files and if you delete and recreate
>> collections/cores and it sometimes happens that the data was not
>> cleaned up in HDFS and then causes a conflict.
>>
>> Hendrik
>>
>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>> We've never run an index this size in anything but HDFS, so I have
>>> no comparison.  What we've been doing is keeping two main
>>> collections - all data, and the last 30 days of data.  Then we
>>> handle queries based on date range. The 30 day index is
>>> significantly faster.
>>>
>>> My main concern right now is that 6 of the 100 shards are not coming
>>> back because of no leader.  I've never seen this error before.  Any
>>> ideas?  ClusterStatus shows all three replicas with state 'down'.
>>>
>>> Thanks!
>>>
>>> -joe
>>>
>>>
>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>> We actually also have some performance issue with HDFS at the
>>>> moment. We are doing lots of soft commits for NRT search. Those
>>>> seem to be slower then with local storage. The investigation is
>>>> however not really far yet.
>>>>
>>>> We have a setup with 2000 collections, with one shard each and a
>>>> replication factor of 2 or 3. When we restart nodes too fast that
>>>> causes problems with the overseer queue, which can lead to the
>>>> queue getting out of control and Solr pretty much dying. We are
>>>> still on Solr 6.3. 6.6 has some improvements and should handle
>>>> these actions faster. I would check what you see for
>>>> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>>> critical part is the "overseer_queue_size" value. If this goes up
>>>> to about 10000 it is pretty much game over on our setup. In that
>>>> case it seems to be best to stop all nodes, clear the queue in ZK
>>>> and then restart the nodes one by one with a gap of like 5min. That
>>>> normally recovers pretty well.
>>>>
>>>> regards,
>>>> Hendrik
>>>>
>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>> We set the hard commit time long because we were having
>>>>> performance issues with HDFS, and thought that since the block
>>>>> size is 128M, having a longer hard commit made sense.  That was
>>>>> our hypothesis anyway. Happy to switch it back and see what happens.
>>>>>
>>>>> I don't know what caused the cluster to go into recovery in the
>>>>> first place.  We had a server die over the weekend, but it's just
>>>>> one out of ~50.  Every shard is 3x replicated (and 3x replicated
>>>>> in HDFS...so 9 copies).  It was at this point that we noticed lots
>>>>> of network activity, and most of the shards in this recovery,
>>>>> fail, retry loop.  That is when we decided to shut it down
>>>>> resulting in zombie lock files.
>>>>>
>>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>>> seem to have any effect on the shards that have no leader. Kinda
>>>>> out of ideas for that problem.  If I can get the cluster back up,
>>>>> I'll try a lower hard commit time. Thanks again Erick!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>>> ;)...
>>>>>>
>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>> surprising that they need to be forcefully killed. I was more
>>>>>> thinking
>>>>>> about how they got in this situation in the first place. _Before_
>>>>>> you
>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>> Forcefully?
>>>>>>
>>>>>> Your hard commit is far longer than it needs to be, resulting in
>>>>>> much
>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with
>>>>>> local
>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>>> have
>>>>>> had a chance to fsync the segments and may have to replay on
>>>>>> startup.
>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>> not horribly expensive, it just fsync's the current segments and
>>>>>> opens
>>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>>> would help a lot.
>>>>>>
>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>>> issue a manual commit, something like
>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>> interval will make the wait between stopping indexing and restarting
>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>> <[hidden email]> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>> gracefully.
>>>>>>> The write.lock files are then left in the HDFS as they do not
>>>>>>> get removed
>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it
>>>>>>> should be owning
>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>>> owner and
>>>>>>> is also not willing to retry, which is why you need to restart
>>>>>>> the whole
>>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>>> start up
>>>>>>> script which scans the log files in HDFS and compares that with
>>>>>>> the state in
>>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>>> that I'm
>>>>>>> starting.
>>>>>>>
>>>>>>> regards,
>>>>>>> Hendrik
>>>>>>>
>>>>>>>
>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>>> 6.6.1 using
>>>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>>>> With 3x
>>>>>>>> replication that takes up 93TBytes of disk. Our main collection
>>>>>>>> is split
>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>> running into
>>>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>>>> recovery and start
>>>>>>>> to utilize nearly all of their network interfaces. If we start
>>>>>>>> too many of
>>>>>>>> the nodes at once, the shards will go into a recovery, fail,
>>>>>>>> and retry loop
>>>>>>>> and never come up.  The errors are related to HDFS not
>>>>>>>> responding fast
>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>>> this is
>>>>>>>> happening, the script will force a stop (180 second timeout)
>>>>>>>> and upon
>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>
>>>>>>>> The process at this point is to start one node, find out the
>>>>>>>> lock files,
>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>> write.lock
>>>>>>>> files, and restart.  Usually this second restart is faster, but
>>>>>>>> it still can
>>>>>>>> take 20-60 minutes.
>>>>>>>>
>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>> Should we
>>>>>>>> have not used so many replicas with HDFS?  Is there a better
>>>>>>>> way we should
>>>>>>>> have built the solr6 cluster?
>>>>>>>>
>>>>>>>> Thank you for any insight!
>>>>>>>>
>>>>>>>> -Joe
>>>>>>>>
>>>>>> ---
>>>>>> This email has been checked for viruses by AVG.
>>>>>> http://www.avg.com
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
Hi Hendrick - I was halting a replica and then restarting it, waited,
then restarted another one.  That didn't work, but when I halted all
three, and then restarted those one by one, the shard finally elected a
leader and came up.  Phew!  I too noticed the lock files in
index.<timestamp> folders.  Usually what I do is:
hadoop fs -ls -R /solr6.6.0 | grep write.lock > out.txt
then
cat out.txt | cut --bytes 57-
to get a list of files to delete

Glad these shards have come up!  Thanks very much.

-Joe


On 11/22/2017 5:20 AM, Hendrik Haddorp wrote:

> Hi Joe,
>
> sorry, I have not seen that problem. I would normally not delete a
> replica if the shard is down but only if there is an active shard.
> Without an active leader the replica should not be able to recover. I
> also just had a case where all replicas of a shard stayed in down
> state and restarts didn't help. This was however also caused by lock
> files. Once I cleaned them up and restarted all Solr instances that
> had a replica they recovered.
>
> For the lock files I discovered that the index is not always in the
> "index" folder but can also be in an index.<timestamp> folder. There
> can be an "index.properties" file in the "data" directory in HDFS and
> this contains the correct index folder name.
>
> If you are really desperate you could also delete all but one replica
> so that the leader election is quite trivial. But this does of course
> increase the risk of finally loosing the data quite a bit. So I would
> try looking into the code and figure out what the problem is here and
> maybe compare the state in HDFS and ZK with a shard that works.
>
> regards,
> Hendrik
>
> On 21.11.2017 23:57, Joe Obernberger wrote:
>> Hi Hendrick - the shards in question have three replicas.  I tried
>> restarting each one (one by one) - no luck.  No leader is found. I
>> deleted one of the replicas and added a new one, and the new one also
>> shows as 'down'.  I also tried the FORCELEADER call, but that had no
>> effect.  I checked the OVERSEERSTATUS, but there is nothing unusual
>> there.  I don't see anything useful in the logs except the error:
>>
>> org.apache.solr.common.SolrException: Error getting leader from zk
>> for shard shard21
>>     at
>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:996)
>>     at
>> org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>>     at
>> org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>>     at
>> org.apache.solr.core.ZkContainer.lambda$registerInZk$0(ZkContainer.java:181)
>>     at
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>>     at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>> props
>>     at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1043)
>>     at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1007)
>>     at
>> org.apache.solr.cloud.ZkController.getLeader(ZkController.java:963)
>>     ... 7 more
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>>     at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>>     at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>>     at
>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:357)
>>     at
>> org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:354)
>>     at
>> org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:60)
>>     at
>> org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:354)
>>     at
>> org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:1021)
>>     ... 9 more
>>
>> Can I modify zookeeper to force a leader?  Is there any other way to
>> recover from this?  Thanks very much!
>>
>> -Joe
>>
>>
>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>> We sometimes also have replicas not recovering. If one replica is
>>> left active the easiest is to then to delete the replica and create
>>> a new one. When all replicas are down it helps most of the time to
>>> restart one of the nodes that contains a replica in down state. If
>>> that also doesn't get the replica to recover I would check the logs
>>> of the node and also that of the overseer node. I have seen the same
>>> issue on Solr using local storage. The main HDFS related issues we
>>> had so far was those lock files and if you delete and recreate
>>> collections/cores and it sometimes happens that the data was not
>>> cleaned up in HDFS and then causes a conflict.
>>>
>>> Hendrik
>>>
>>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>> We've never run an index this size in anything but HDFS, so I have
>>>> no comparison.  What we've been doing is keeping two main
>>>> collections - all data, and the last 30 days of data.  Then we
>>>> handle queries based on date range. The 30 day index is
>>>> significantly faster.
>>>>
>>>> My main concern right now is that 6 of the 100 shards are not
>>>> coming back because of no leader.  I've never seen this error
>>>> before.  Any ideas?  ClusterStatus shows all three replicas with
>>>> state 'down'.
>>>>
>>>> Thanks!
>>>>
>>>> -joe
>>>>
>>>>
>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>> We actually also have some performance issue with HDFS at the
>>>>> moment. We are doing lots of soft commits for NRT search. Those
>>>>> seem to be slower then with local storage. The investigation is
>>>>> however not really far yet.
>>>>>
>>>>> We have a setup with 2000 collections, with one shard each and a
>>>>> replication factor of 2 or 3. When we restart nodes too fast that
>>>>> causes problems with the overseer queue, which can lead to the
>>>>> queue getting out of control and Solr pretty much dying. We are
>>>>> still on Solr 6.3. 6.6 has some improvements and should handle
>>>>> these actions faster. I would check what you see for
>>>>> "/solr/admin/collections?action=OVERSEERSTATUS&wt=json". The
>>>>> critical part is the "overseer_queue_size" value. If this goes up
>>>>> to about 10000 it is pretty much game over on our setup. In that
>>>>> case it seems to be best to stop all nodes, clear the queue in ZK
>>>>> and then restart the nodes one by one with a gap of like 5min.
>>>>> That normally recovers pretty well.
>>>>>
>>>>> regards,
>>>>> Hendrik
>>>>>
>>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>> We set the hard commit time long because we were having
>>>>>> performance issues with HDFS, and thought that since the block
>>>>>> size is 128M, having a longer hard commit made sense.  That was
>>>>>> our hypothesis anyway. Happy to switch it back and see what happens.
>>>>>>
>>>>>> I don't know what caused the cluster to go into recovery in the
>>>>>> first place.  We had a server die over the weekend, but it's just
>>>>>> one out of ~50.  Every shard is 3x replicated (and 3x replicated
>>>>>> in HDFS...so 9 copies).  It was at this point that we noticed
>>>>>> lots of network activity, and most of the shards in this
>>>>>> recovery, fail, retry loop.  That is when we decided to shut it
>>>>>> down resulting in zombie lock files.
>>>>>>
>>>>>> I tried using the FORCELEADER call, which completed, but doesn't
>>>>>> seem to have any effect on the shards that have no leader. Kinda
>>>>>> out of ideas for that problem.  If I can get the cluster back up,
>>>>>> I'll try a lower hard commit time. Thanks again Erick!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>>>> ;)...
>>>>>>>
>>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>>> surprising that they need to be forcefully killed. I was more
>>>>>>> thinking
>>>>>>> about how they got in this situation in the first place.
>>>>>>> _Before_ you
>>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>>> Forcefully?
>>>>>>>
>>>>>>> Your hard commit is far longer than it needs to be, resulting in
>>>>>>> much
>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with
>>>>>>> local
>>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may
>>>>>>> not have
>>>>>>> had a chance to fsync the segments and may have to replay on
>>>>>>> startup.
>>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>>> not horribly expensive, it just fsync's the current segments and
>>>>>>> opens
>>>>>>> new ones. It won't be a total cure, but I bet reducing this
>>>>>>> interval
>>>>>>> would help a lot.
>>>>>>>
>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if
>>>>>>> you
>>>>>>> issue a manual commit, something like
>>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>>> interval will make the wait between stopping indexing and
>>>>>>> restarting
>>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>>> <[hidden email]> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> the write.lock issue I see as well when Solr is not been
>>>>>>>> stopped gracefully.
>>>>>>>> The write.lock files are then left in the HDFS as they do not
>>>>>>>> get removed
>>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it
>>>>>>>> should be owning
>>>>>>>> the lock as it is marked in the state stored in ZooKeeper as
>>>>>>>> the owner and
>>>>>>>> is also not willing to retry, which is why you need to restart
>>>>>>>> the whole
>>>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>>>> start up
>>>>>>>> script which scans the log files in HDFS and compares that with
>>>>>>>> the state in
>>>>>>>> ZooKeeper and then delete all lock files that belong to the
>>>>>>>> node that I'm
>>>>>>>> starting.
>>>>>>>>
>>>>>>>> regards,
>>>>>>>> Hendrik
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>>>> 6.6.1 using
>>>>>>>>> HDFS as the index.  The current index size is about 31TBytes.
>>>>>>>>> With 3x
>>>>>>>>> replication that takes up 93TBytes of disk. Our main
>>>>>>>>> collection is split
>>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>>> running into
>>>>>>>>> is when restarting the solr6 cluster.  The shards go into
>>>>>>>>> recovery and start
>>>>>>>>> to utilize nearly all of their network interfaces. If we start
>>>>>>>>> too many of
>>>>>>>>> the nodes at once, the shards will go into a recovery, fail,
>>>>>>>>> and retry loop
>>>>>>>>> and never come up.  The errors are related to HDFS not
>>>>>>>>> responding fast
>>>>>>>>> enough and warnings from the DFSClient.  If we stop a node
>>>>>>>>> when this is
>>>>>>>>> happening, the script will force a stop (180 second timeout)
>>>>>>>>> and upon
>>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>>
>>>>>>>>> The process at this point is to start one node, find out the
>>>>>>>>> lock files,
>>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>>> write.lock
>>>>>>>>> files, and restart.  Usually this second restart is faster,
>>>>>>>>> but it still can
>>>>>>>>> take 20-60 minutes.
>>>>>>>>>
>>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>>> Should we
>>>>>>>>> have not used so many replicas with HDFS?  Is there a better
>>>>>>>>> way we should
>>>>>>>>> have built the solr6 cluster?
>>>>>>>>>
>>>>>>>>> Thank you for any insight!
>>>>>>>>>
>>>>>>>>> -Joe
>>>>>>>>>
>>>>>>> ---
>>>>>>> This email has been checked for viruses by AVG.
>>>>>>> http://www.avg.com
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Kevin Risden-3
In reply to this post by Hendrik Haddorp
Joe,

I have a few questions about your Solr and HDFS setup that could help
improve the recovery performance.

* Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
* Is Solr colocated with HDFS data nodes?
* What is the output of "ps aux | grep solr"? (specifically looking for the
Java arguments that are being set.)

Depending on how Solr on HDFS was setup, there are some potentially simple
settings that can help significantly improve performance.

1) Short circuit reads

If Solr is colocated with an HDFS datanode, short circuit reads can improve
read performance since it skips a network hop if the data is local to that
node. This requires HDFS native libraries to be added to Solr.

2) HDFS block cache in Solr

Solr without HDFS uses the OS page cache to handle caching data for
queries. With HDFS, Solr has a special HDFS block cache which allows for
caching HDFS blocks. This significantly helps query performance. There are
a few configuration parameters that can help here.

Kevin Risden

On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp <[hidden email]>
wrote:

> Hi Joe,
>
> sorry, I have not seen that problem. I would normally not delete a replica
> if the shard is down but only if there is an active shard. Without an
> active leader the replica should not be able to recover. I also just had a
> case where all replicas of a shard stayed in down state and restarts didn't
> help. This was however also caused by lock files. Once I cleaned them up
> and restarted all Solr instances that had a replica they recovered.
>
> For the lock files I discovered that the index is not always in the
> "index" folder but can also be in an index.<timestamp> folder. There can be
> an "index.properties" file in the "data" directory in HDFS and this
> contains the correct index folder name.
>
> If you are really desperate you could also delete all but one replica so
> that the leader election is quite trivial. But this does of course increase
> the risk of finally loosing the data quite a bit. So I would try looking
> into the code and figure out what the problem is here and maybe compare the
> state in HDFS and ZK with a shard that works.
>
> regards,
> Hendrik
>
>
> On 21.11.2017 23:57, Joe Obernberger wrote:
>
>> Hi Hendrick - the shards in question have three replicas.  I tried
>> restarting each one (one by one) - no luck.  No leader is found. I deleted
>> one of the replicas and added a new one, and the new one also shows as
>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>> see anything useful in the logs except the error:
>>
>> org.apache.solr.common.SolrException: Error getting leader from zk for
>> shard shard21
>>     at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:996)
>>     at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>>     at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>>     at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>> ZkContainer.java:181)
>>     at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1149)
>>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:624)
>>     at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>> props
>>     at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1043)
>>     at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1007)
>>     at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>> java:963)
>>     ... 7 more
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>>     at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:111)
>>     at org.apache.zookeeper.KeeperException.create(KeeperException.
>> java:51)
>>     at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>>     at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:357)
>>     at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>> ient.java:354)
>>     at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>> CmdExecutor.java:60)
>>     at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie
>> nt.java:354)
>>     at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>> er.java:1021)
>>     ... 9 more
>>
>> Can I modify zookeeper to force a leader?  Is there any other way to
>> recover from this?  Thanks very much!
>>
>> -Joe
>>
>>
>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>
>>> We sometimes also have replicas not recovering. If one replica is left
>>> active the easiest is to then to delete the replica and create a new one.
>>> When all replicas are down it helps most of the time to restart one of the
>>> nodes that contains a replica in down state. If that also doesn't get the
>>> replica to recover I would check the logs of the node and also that of the
>>> overseer node. I have seen the same issue on Solr using local storage. The
>>> main HDFS related issues we had so far was those lock files and if you
>>> delete and recreate collections/cores and it sometimes happens that the
>>> data was not cleaned up in HDFS and then causes a conflict.
>>>
>>> Hendrik
>>>
>>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>
>>>> We've never run an index this size in anything but HDFS, so I have no
>>>> comparison.  What we've been doing is keeping two main collections - all
>>>> data, and the last 30 days of data.  Then we handle queries based on date
>>>> range. The 30 day index is significantly faster.
>>>>
>>>> My main concern right now is that 6 of the 100 shards are not coming
>>>> back because of no leader.  I've never seen this error before.  Any ideas?
>>>> ClusterStatus shows all three replicas with state 'down'.
>>>>
>>>> Thanks!
>>>>
>>>> -joe
>>>>
>>>>
>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>
>>>>> We actually also have some performance issue with HDFS at the moment.
>>>>> We are doing lots of soft commits for NRT search. Those seem to be slower
>>>>> then with local storage. The investigation is however not really far yet.
>>>>>
>>>>> We have a setup with 2000 collections, with one shard each and a
>>>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>>>> problems with the overseer queue, which can lead to the queue getting out
>>>>> of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has
>>>>> some improvements and should handle these actions faster. I would check
>>>>> what you see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json".
>>>>> The critical part is the "overseer_queue_size" value. If this goes up to
>>>>> about 10000 it is pretty much game over on our setup. In that case it seems
>>>>> to be best to stop all nodes, clear the queue in ZK and then restart the
>>>>> nodes one by one with a gap of like 5min. That normally recovers pretty
>>>>> well.
>>>>>
>>>>> regards,
>>>>> Hendrik
>>>>>
>>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>
>>>>>> We set the hard commit time long because we were having performance
>>>>>> issues with HDFS, and thought that since the block size is 128M, having a
>>>>>> longer hard commit made sense.  That was our hypothesis anyway. Happy to
>>>>>> switch it back and see what happens.
>>>>>>
>>>>>> I don't know what caused the cluster to go into recovery in the first
>>>>>> place.  We had a server die over the weekend, but it's just one out of
>>>>>> ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9
>>>>>> copies).  It was at this point that we noticed lots of network activity,
>>>>>> and most of the shards in this recovery, fail, retry loop.  That is when we
>>>>>> decided to shut it down resulting in zombie lock files.
>>>>>>
>>>>>> I tried using the FORCELEADER call, which completed, but doesn't seem
>>>>>> to have any effect on the shards that have no leader. Kinda out of ideas
>>>>>> for that problem.  If I can get the cluster back up, I'll try a lower hard
>>>>>> commit time. Thanks again Erick!
>>>>>>
>>>>>> -Joe
>>>>>>
>>>>>>
>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>
>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>>>> ;)...
>>>>>>>
>>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>>> surprising that they need to be forcefully killed. I was more
>>>>>>> thinking
>>>>>>> about how they got in this situation in the first place. _Before_ you
>>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>>> Forcefully?
>>>>>>>
>>>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>>>> have
>>>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>>> not horribly expensive, it just fsync's the current segments and
>>>>>>> opens
>>>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>>>> would help a lot.
>>>>>>>
>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>>>> issue a manual commit, something like
>>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>>> interval will make the wait between stopping indexing and restarting
>>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>>> <[hidden email]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>>> gracefully.
>>>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>>>> removed
>>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>>>>>>> be owning
>>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>>>> owner and
>>>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>>>> whole
>>>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>>>> start up
>>>>>>>> script which scans the log files in HDFS and compares that with the
>>>>>>>> state in
>>>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>>>> that I'm
>>>>>>>> starting.
>>>>>>>>
>>>>>>>> regards,
>>>>>>>> Hendrik
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>
>>>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>>>> 6.6.1 using
>>>>>>>>> HDFS as the index.  The current index size is about 31TBytes. With
>>>>>>>>> 3x
>>>>>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>>>>>> split
>>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>>> running into
>>>>>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>>>>>> and start
>>>>>>>>> to utilize nearly all of their network interfaces. If we start too
>>>>>>>>> many of
>>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>>>> retry loop
>>>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>>>> fast
>>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>>>> this is
>>>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>>>> upon
>>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>>
>>>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>>>> files,
>>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>>> write.lock
>>>>>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>>>>>> still can
>>>>>>>>> take 20-60 minutes.
>>>>>>>>>
>>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>>> Should we
>>>>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>>>>> we should
>>>>>>>>> have built the solr6 cluster?
>>>>>>>>>
>>>>>>>>> Thank you for any insight!
>>>>>>>>>
>>>>>>>>> -Joe
>>>>>>>>>
>>>>>>>>> ---
>>>>>>> This email has been checked for viruses by AVG.
>>>>>>> http://www.avg.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Recovery Issue - Solr 6.6.1 and HDFS

Joe Obernberger
Hi Kevin -
* HDFS is part of Cloudera 5.12.0.
* Solr is co-located in most cases.  We do have several nodes that run
on servers that are not data nodes, but most do. Unfortunately, our
nodes are not the same size.  Some nodes have 8TBytes of disk, while our
largest nodes are 64TBytes.  This results in a lot of data that needs to
go over the network.

* Command is:
/usr/lib/jvm/jre-1.8.0/bin/java -server -Xms12g -Xmx16g -Xss2m
-XX:+UseG1GC -XX:MaxDirectMemorySize=11g -XX:+PerfDisableSharedMem
-XX:+ParallelRefProcEnabled -XX:G1HeapRegionSize=16m
-XX:MaxGCPauseMillis=300 -XX:InitiatingHeapOccupancyPercent=75
-XX:+UseLargePages -XX:ParallelGCThreads=16 -XX:-ResizePLAB
-XX:+AggressiveOpts -verbose:gc -XX:+PrintHeapAtGC -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
-XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime
-Xloggc:/opt/solr6/server/logs/solr_gc.log -XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=9 -XX:GCLogFileSize=20M -DzkClientTimeout=300000
-DzkHost=frodo.querymasters.com:2181,bilbo.querymasters.com:2181,gandalf.querymasters.com:2181,cordelia.querymasters.com:2181,cressida.querymasters.com:2181/solr6.6.0
-Dsolr.log.dir=/opt/solr6/server/logs -Djetty.port=9100 -DSTOP.PORT=8100
-DSTOP.KEY=solrrocks -Dhost=tarvos -Duser.timezone=UTC
-Djetty.home=/opt/solr6/server -Dsolr.solr.home=/opt/solr6/server/solr
-Dsolr.install.dir=/opt/solr6 -Dsolr.clustering.enabled=true
-Dsolr.lock.type=hdfs -Dsolr.autoSoftCommit.maxTime=120000
-Dsolr.autoCommit.maxTime=1800000 -Dsolr.solr.home=/etc/solr6
-Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native
-Xss256k -Dsolr.log.muteconsole
-XX:OnOutOfMemoryError=/opt/solr6/bin/oom_solr.sh 9100
/opt/solr6/server/logs -jar start.jar --module=http

* We have enabled short circuit reads.

Right now, we have a relatively small block cache due to the
requirements that the servers run other software.  We tried to find the
best balance between block cache size, and RAM for programs, while still
giving enough for local FS cache.  This came out to be 84 128M blocks -
or about 10G for the cache per node (45 nodes total).

<directoryFactory name="DirectoryFactory"
         class="solr.HdfsDirectoryFactory">
         <bool name="solr.hdfs.blockcache.enabled">true</bool>
         <bool name="solr.hdfs.blockcache.global">true</bool>
         <int name="solr.hdfs.blockcache.slab.count">84</int>
         <bool
name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
         <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">128</int>
         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">1024</int>
         <str name="solr.hdfs.home">hdfs://nameservice1:8020/solr6.6.0</str>
         <str name="solr.hdfs.confdir">/etc/hadoop/conf.cloudera.hdfs1</str>
     </directoryFactory>

Thanks for reviewing!

-Joe


On 11/22/2017 8:20 AM, Kevin Risden wrote:

> Joe,
>
> I have a few questions about your Solr and HDFS setup that could help
> improve the recovery performance.
>
> * Is HDFS part of a distribution from Hortonworks, Cloudera, etc?
> * Is Solr colocated with HDFS data nodes?
> * What is the output of "ps aux | grep solr"? (specifically looking for the
> Java arguments that are being set.)
>
> Depending on how Solr on HDFS was setup, there are some potentially simple
> settings that can help significantly improve performance.
>
> 1) Short circuit reads
>
> If Solr is colocated with an HDFS datanode, short circuit reads can improve
> read performance since it skips a network hop if the data is local to that
> node. This requires HDFS native libraries to be added to Solr.
>
> 2) HDFS block cache in Solr
>
> Solr without HDFS uses the OS page cache to handle caching data for
> queries. With HDFS, Solr has a special HDFS block cache which allows for
> caching HDFS blocks. This significantly helps query performance. There are
> a few configuration parameters that can help here.
>
> Kevin Risden
>
> On Wed, Nov 22, 2017 at 4:20 AM, Hendrik Haddorp <[hidden email]>
> wrote:
>
>> Hi Joe,
>>
>> sorry, I have not seen that problem. I would normally not delete a replica
>> if the shard is down but only if there is an active shard. Without an
>> active leader the replica should not be able to recover. I also just had a
>> case where all replicas of a shard stayed in down state and restarts didn't
>> help. This was however also caused by lock files. Once I cleaned them up
>> and restarted all Solr instances that had a replica they recovered.
>>
>> For the lock files I discovered that the index is not always in the
>> "index" folder but can also be in an index.<timestamp> folder. There can be
>> an "index.properties" file in the "data" directory in HDFS and this
>> contains the correct index folder name.
>>
>> If you are really desperate you could also delete all but one replica so
>> that the leader election is quite trivial. But this does of course increase
>> the risk of finally loosing the data quite a bit. So I would try looking
>> into the code and figure out what the problem is here and maybe compare the
>> state in HDFS and ZK with a shard that works.
>>
>> regards,
>> Hendrik
>>
>>
>> On 21.11.2017 23:57, Joe Obernberger wrote:
>>
>>> Hi Hendrick - the shards in question have three replicas.  I tried
>>> restarting each one (one by one) - no luck.  No leader is found. I deleted
>>> one of the replicas and added a new one, and the new one also shows as
>>> 'down'.  I also tried the FORCELEADER call, but that had no effect.  I
>>> checked the OVERSEERSTATUS, but there is nothing unusual there.  I don't
>>> see anything useful in the logs except the error:
>>>
>>> org.apache.solr.common.SolrException: Error getting leader from zk for
>>> shard shard21
>>>      at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>>> java:996)
>>>      at org.apache.solr.cloud.ZkController.register(ZkController.java:902)
>>>      at org.apache.solr.cloud.ZkController.register(ZkController.java:846)
>>>      at org.apache.solr.core.ZkContainer.lambda$registerInZk$0(
>>> ZkContainer.java:181)
>>>      at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolE
>>> xecutor.lambda$execute$0(ExecutorUtil.java:229)
>>>      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>>> Executor.java:1149)
>>>      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>>> lExecutor.java:624)
>>>      at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.solr.common.SolrException: Could not get leader
>>> props
>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>> er.java:1043)
>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>> er.java:1007)
>>>      at org.apache.solr.cloud.ZkController.getLeader(ZkController.
>>> java:963)
>>>      ... 7 more
>>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException:
>>> KeeperErrorCode = NoNode for /collections/UNCLASS/leaders/shard21/leader
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>> java:111)
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.
>>> java:51)
>>>      at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151)
>>>      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>>> ient.java:357)
>>>      at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkCl
>>> ient.java:354)
>>>      at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(Zk
>>> CmdExecutor.java:60)
>>>      at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClie
>>> nt.java:354)
>>>      at org.apache.solr.cloud.ZkController.getLeaderProps(ZkControll
>>> er.java:1021)
>>>      ... 9 more
>>>
>>> Can I modify zookeeper to force a leader?  Is there any other way to
>>> recover from this?  Thanks very much!
>>>
>>> -Joe
>>>
>>>
>>> On 11/21/2017 3:24 PM, Hendrik Haddorp wrote:
>>>
>>>> We sometimes also have replicas not recovering. If one replica is left
>>>> active the easiest is to then to delete the replica and create a new one.
>>>> When all replicas are down it helps most of the time to restart one of the
>>>> nodes that contains a replica in down state. If that also doesn't get the
>>>> replica to recover I would check the logs of the node and also that of the
>>>> overseer node. I have seen the same issue on Solr using local storage. The
>>>> main HDFS related issues we had so far was those lock files and if you
>>>> delete and recreate collections/cores and it sometimes happens that the
>>>> data was not cleaned up in HDFS and then causes a conflict.
>>>>
>>>> Hendrik
>>>>
>>>> On 21.11.2017 21:07, Joe Obernberger wrote:
>>>>
>>>>> We've never run an index this size in anything but HDFS, so I have no
>>>>> comparison.  What we've been doing is keeping two main collections - all
>>>>> data, and the last 30 days of data.  Then we handle queries based on date
>>>>> range. The 30 day index is significantly faster.
>>>>>
>>>>> My main concern right now is that 6 of the 100 shards are not coming
>>>>> back because of no leader.  I've never seen this error before.  Any ideas?
>>>>> ClusterStatus shows all three replicas with state 'down'.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -joe
>>>>>
>>>>>
>>>>> On 11/21/2017 2:35 PM, Hendrik Haddorp wrote:
>>>>>
>>>>>> We actually also have some performance issue with HDFS at the moment.
>>>>>> We are doing lots of soft commits for NRT search. Those seem to be slower
>>>>>> then with local storage. The investigation is however not really far yet.
>>>>>>
>>>>>> We have a setup with 2000 collections, with one shard each and a
>>>>>> replication factor of 2 or 3. When we restart nodes too fast that causes
>>>>>> problems with the overseer queue, which can lead to the queue getting out
>>>>>> of control and Solr pretty much dying. We are still on Solr 6.3. 6.6 has
>>>>>> some improvements and should handle these actions faster. I would check
>>>>>> what you see for "/solr/admin/collections?action=OVERSEERSTATUS&wt=json".
>>>>>> The critical part is the "overseer_queue_size" value. If this goes up to
>>>>>> about 10000 it is pretty much game over on our setup. In that case it seems
>>>>>> to be best to stop all nodes, clear the queue in ZK and then restart the
>>>>>> nodes one by one with a gap of like 5min. That normally recovers pretty
>>>>>> well.
>>>>>>
>>>>>> regards,
>>>>>> Hendrik
>>>>>>
>>>>>> On 21.11.2017 20:12, Joe Obernberger wrote:
>>>>>>
>>>>>>> We set the hard commit time long because we were having performance
>>>>>>> issues with HDFS, and thought that since the block size is 128M, having a
>>>>>>> longer hard commit made sense.  That was our hypothesis anyway. Happy to
>>>>>>> switch it back and see what happens.
>>>>>>>
>>>>>>> I don't know what caused the cluster to go into recovery in the first
>>>>>>> place.  We had a server die over the weekend, but it's just one out of
>>>>>>> ~50.  Every shard is 3x replicated (and 3x replicated in HDFS...so 9
>>>>>>> copies).  It was at this point that we noticed lots of network activity,
>>>>>>> and most of the shards in this recovery, fail, retry loop.  That is when we
>>>>>>> decided to shut it down resulting in zombie lock files.
>>>>>>>
>>>>>>> I tried using the FORCELEADER call, which completed, but doesn't seem
>>>>>>> to have any effect on the shards that have no leader. Kinda out of ideas
>>>>>>> for that problem.  If I can get the cluster back up, I'll try a lower hard
>>>>>>> commit time. Thanks again Erick!
>>>>>>>
>>>>>>> -Joe
>>>>>>>
>>>>>>>
>>>>>>> On 11/21/2017 2:00 PM, Erick Erickson wrote:
>>>>>>>
>>>>>>>> Frankly with HDFS I'm a bit out of my depth so listen to Hendrik
>>>>>>>> ;)...
>>>>>>>>
>>>>>>>> I need to back up a bit. Once nodes are in this state it's not
>>>>>>>> surprising that they need to be forcefully killed. I was more
>>>>>>>> thinking
>>>>>>>> about how they got in this situation in the first place. _Before_ you
>>>>>>>> get into the nasty state how are the Solr nodes shut down?
>>>>>>>> Forcefully?
>>>>>>>>
>>>>>>>> Your hard commit is far longer than it needs to be, resulting in much
>>>>>>>> larger tlog files etc. I usually set this at 15-60 seconds with local
>>>>>>>> disks, not quite sure whether longer intervals are helpful on HDFS.
>>>>>>>> What this means is that you can spend up to 30 minutes when you
>>>>>>>> restart solr _replaying the tlogs_! If Solr is killed, it may not
>>>>>>>> have
>>>>>>>> had a chance to fsync the segments and may have to replay on startup.
>>>>>>>> If you have openSearcher set to false, the hard commit operation is
>>>>>>>> not horribly expensive, it just fsync's the current segments and
>>>>>>>> opens
>>>>>>>> new ones. It won't be a total cure, but I bet reducing this interval
>>>>>>>> would help a lot.
>>>>>>>>
>>>>>>>> Also, if you stop indexing there's no need to wait 30 minutes if you
>>>>>>>> issue a manual commit, something like
>>>>>>>> .../collection/update?commit=true. Just reducing the hard commit
>>>>>>>> interval will make the wait between stopping indexing and restarting
>>>>>>>> shorter all by itself if you don't want to issue the manual commit.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Erick
>>>>>>>>
>>>>>>>> On Tue, Nov 21, 2017 at 10:34 AM, Hendrik Haddorp
>>>>>>>> <[hidden email]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> the write.lock issue I see as well when Solr is not been stopped
>>>>>>>>> gracefully.
>>>>>>>>> The write.lock files are then left in the HDFS as they do not get
>>>>>>>>> removed
>>>>>>>>> automatically when the client disconnects like a ephemeral node in
>>>>>>>>> ZooKeeper. Unfortunately Solr does also not realize that it should
>>>>>>>>> be owning
>>>>>>>>> the lock as it is marked in the state stored in ZooKeeper as the
>>>>>>>>> owner and
>>>>>>>>> is also not willing to retry, which is why you need to restart the
>>>>>>>>> whole
>>>>>>>>> Solr instance after the cleanup. I added some logic to my Solr
>>>>>>>>> start up
>>>>>>>>> script which scans the log files in HDFS and compares that with the
>>>>>>>>> state in
>>>>>>>>> ZooKeeper and then delete all lock files that belong to the node
>>>>>>>>> that I'm
>>>>>>>>> starting.
>>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Hendrik
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 21.11.2017 14:07, Joe Obernberger wrote:
>>>>>>>>>
>>>>>>>>>> Hi All - we have a system with 45 physical boxes running solr
>>>>>>>>>> 6.6.1 using
>>>>>>>>>> HDFS as the index.  The current index size is about 31TBytes. With
>>>>>>>>>> 3x
>>>>>>>>>> replication that takes up 93TBytes of disk. Our main collection is
>>>>>>>>>> split
>>>>>>>>>> across 100 shards with 3 replicas each.  The issue that we're
>>>>>>>>>> running into
>>>>>>>>>> is when restarting the solr6 cluster.  The shards go into recovery
>>>>>>>>>> and start
>>>>>>>>>> to utilize nearly all of their network interfaces. If we start too
>>>>>>>>>> many of
>>>>>>>>>> the nodes at once, the shards will go into a recovery, fail, and
>>>>>>>>>> retry loop
>>>>>>>>>> and never come up.  The errors are related to HDFS not responding
>>>>>>>>>> fast
>>>>>>>>>> enough and warnings from the DFSClient.  If we stop a node when
>>>>>>>>>> this is
>>>>>>>>>> happening, the script will force a stop (180 second timeout) and
>>>>>>>>>> upon
>>>>>>>>>> restart, we have lock files (write.lock) inside of HDFS.
>>>>>>>>>>
>>>>>>>>>> The process at this point is to start one node, find out the lock
>>>>>>>>>> files,
>>>>>>>>>> wait for it to come up completely (hours), stop it, delete the
>>>>>>>>>> write.lock
>>>>>>>>>> files, and restart.  Usually this second restart is faster, but it
>>>>>>>>>> still can
>>>>>>>>>> take 20-60 minutes.
>>>>>>>>>>
>>>>>>>>>> The smaller indexes recover much faster (less than 5 minutes).
>>>>>>>>>> Should we
>>>>>>>>>> have not used so many replicas with HDFS?  Is there a better way
>>>>>>>>>> we should
>>>>>>>>>> have built the solr6 cluster?
>>>>>>>>>>
>>>>>>>>>> Thank you for any insight!
>>>>>>>>>>
>>>>>>>>>> -Joe
>>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>> This email has been checked for viruses by AVG.
>>>>>>>> http://www.avg.com
>>>>>>>>
>>>>>>>>

12