Re: HDFS Balancer Stuck after 10 Minz

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: HDFS Balancer Stuck after 10 Minz

Rakesh Radhakrishnan-2
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.

Rakesh

On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <[hidden email]> wrote:
Hi All ,  We are in the situation to balance the cluster data since median
reached 98% .. I started balancer as below

Hadoop Version: Hadoop 2.4.1


/apache/hadoop/sbin/start-balancer.sh   -threshold  30


Once i start balancer it goes will for first 8-10 minutes of time..
Balancer was moving so quickly first 10 minutes.. Not sure whats happening
in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .

Log excerpts :

2016-09-08 04:58:15,653 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_-5830766563502877304_1279767737 with size=134217728 from
10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004

2016-09-08 04:59:14,426 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
10.142.18.27:1004 through 10.103.84.16:1004

2016-09-08 05:01:15,037 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
10.142.21.56:1004 through 10.103.21.42:1004



[05:16]:[hadoop@lvsaishdc3sn0002:~]$ date

Thu Sep  8 05:16:53 GMT+7 2016

[05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps

1003 Balancer

20388 Jps



Last Block Mover Timestamp     : 05:01

Current Timestamp                    : 05:16


Almost 15 minz no blocks moved by balancer ..  What could be the issue here
??  Restart would help us start moving again..



It’s not event passing iteration 1 ..


I found one thread discussing about the same issue:

http://lucene.472066.n3.nabble.com/A-question-about-Balancer-in-HDFS-td4118505.html


Pls suggest here to balance cluster ..


--Senthil


Reply | Threaded
Open this post in threaded view
|

Re: HDFS Balancer Stuck after 10 Minz

Senthil Kumar-2
Thanks  Rakesh .

"Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. "

I could see below log statement after enabling DEBUG mode.. 

2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49230
2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49229
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49230
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49234
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49235
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49234


Same getBlocks() call repeating !!!

--Senthil

On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.

Rakesh

On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <[hidden email]> wrote:
Hi All ,  We are in the situation to balance the cluster data since median
reached 98% .. I started balancer as below

Hadoop Version: Hadoop 2.4.1


/apache/hadoop/sbin/start-balancer.sh   -threshold  30


Once i start balancer it goes will for first 8-10 minutes of time..
Balancer was moving so quickly first 10 minutes.. Not sure whats happening
in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .

Log excerpts :

2016-09-08 04:58:15,653 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_-5830766563502877304_1279767737 with size=134217728 from
10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004

2016-09-08 04:59:14,426 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
10.142.18.27:1004 through 10.103.84.16:1004

2016-09-08 05:01:15,037 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
10.142.21.56:1004 through 10.103.21.42:1004



[05:16]:[hadoop@lvsaishdc3sn0002:~]$ date

Thu Sep  8 05:16:53 GMT+7 2016

[05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps

1003 Balancer

20388 Jps



Last Block Mover Timestamp     : 05:01

Current Timestamp                    : 05:16


Almost 15 minz no blocks moved by balancer ..  What could be the issue here
??  Restart would help us start moving again..



It’s not event passing iteration 1 ..


I found one thread discussing about the same issue:

http://lucene.472066.n3.nabble.com/A-question-about-Balancer-in-HDFS-td4118505.html


Pls suggest here to balance cluster ..


--Senthil



Reply | Threaded
Open this post in threaded view
|

Re: HDFS Balancer Stuck after 10 Minz

Senthil Kumar-2
Here is the cluster space details :

DFS Remaining : 32.50 PB
DFS Used% : 54.24%
DFS Remaining% : 45.76%
Block Pool Used : 38.52 PB
Block Pool Used% : 54.24%
DataNodes usages : Min % Median % Max % stdev %
0.00% 95.59% 99.92% 36.28%


Also i tried with threshold  value 5 & 10 , ended with same result ..


"pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on condition [0x00002ae89910f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007062bda50> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

"pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on condition [0x00002ae89900e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)

most of the threads are waiting for blocks .. 


--Senthil

On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <[hidden email]> wrote:
Thanks  Rakesh .

"Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. "

I could see below log statement after enabling DEBUG mode.. 

2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49230
2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49229
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49230
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49234
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49235
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49234


Same getBlocks() call repeating !!!

--Senthil

On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.

Rakesh

On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <[hidden email]> wrote:
Hi All ,  We are in the situation to balance the cluster data since median
reached 98% .. I started balancer as below

Hadoop Version: Hadoop 2.4.1


/apache/hadoop/sbin/start-balancer.sh   -threshold  30


Once i start balancer it goes will for first 8-10 minutes of time..
Balancer was moving so quickly first 10 minutes.. Not sure whats happening
in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .

Log excerpts :

2016-09-08 04:58:15,653 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_-5830766563502877304_1279767737 with size=134217728 from
10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004

2016-09-08 04:59:14,426 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
10.142.18.27:1004 through 10.103.84.16:1004

2016-09-08 05:01:15,037 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
10.142.21.56:1004 through 10.103.21.42:1004



[05:16]:[hadoop@lvsaishdc3sn0002:~]$ date

Thu Sep  8 05:16:53 GMT+7 2016

[05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps

1003 Balancer

20388 Jps



Last Block Mover Timestamp     : 05:01

Current Timestamp                    : 05:16


Almost 15 minz no blocks moved by balancer ..  What could be the issue here
??  Restart would help us start moving again..



It’s not event passing iteration 1 ..


I found one thread discussing about the same issue:

http://lucene.472066.n3.nabble.com/A-question-about-Balancer-in-HDFS-td4118505.html


Pls suggest here to balance cluster ..


--Senthil




Reply | Threaded
Open this post in threaded view
|

Re: HDFS Balancer Stuck after 10 Minz

karthik p
Senthil,

           I don't find any delay from the log trace, you could find the destination node which it takes longer time to receive blocks.

Check out whether blocks are moving within the rack or outside of it, also identify if any delay from particular node.

Make sure there is no networks issue on the rack/node. Pls, confirm all the racks has same network bandwidth. 


Thanks,
Karthik

On Thu, Sep 8, 2016 at 10:03 PM, Senthil Kumar <[hidden email]> wrote:
Here is the cluster space details :

DFS Remaining : 32.50 PB
DFS Used% : 54.24%
DFS Remaining% : 45.76%
Block Pool Used : 38.52 PB
Block Pool Used% : 54.24%
DataNodes usages : Min % Median % Max % stdev %
0.00% 95.59% 99.92% 36.28%


Also i tried with threshold  value 5 & 10 , ended with same result ..


"pool-5-thread-84" prio=10 tid=0x00002ae8780a8800 nid=0x3b22 waiting on condition [0x00002ae89910f000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007062bda50> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

"pool-5-thread-83" prio=10 tid=0x00002ae878ba3000 nid=0x3b21 waiting on condition [0x00002ae89900e000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)

most of the threads are waiting for blocks .. 


--Senthil

On Thu, Sep 8, 2016 at 9:47 PM, Senthil Kumar <[hidden email]> wrote:
Thanks  Rakesh .

"Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. "

I could see below log statement after enabling DEBUG mode.. 

2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49230
2016-09-08 06:32:06,574 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49229
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49230
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 1ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49231
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49234
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49232
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN sending #49235
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49233
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine: Call: getBlocks took 0ms
2016-09-08 06:32:06,575 DEBUG org.apache.hadoop.ipc.Client: IPC Client (685788708) connection to nn-host/10.103.108.
201:8020 from hadoop/host@HOST_DOMAIN got value #49234


Same getBlocks() call repeating !!!

--Senthil

On Thu, Sep 8, 2016 at 7:46 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.


Rakesh

On Thu, Sep 8, 2016 at 7:44 PM, Rakesh Radhakrishnan <[hidden email]> wrote:
Have you taken multiple thread dumps (jstack) and observed the operations which are performing during this period of time. Perhaps there could be high chance of searching for data blocks which it can move around to balance the cluster. 

Could you tell me the used space and available space values. Have you tried changing the threshold to a lower value, may be 10 or 5 and what happens with this value. Also, I think there is no log messages during 15 mins time period, any possibility of enabling debug log priority and try to dig more about the problem.

Rakesh

On Thu, Sep 8, 2016 at 6:15 PM, Senthil Kumar <[hidden email]> wrote:
Hi All ,  We are in the situation to balance the cluster data since median
reached 98% .. I started balancer as below

Hadoop Version: Hadoop 2.4.1


/apache/hadoop/sbin/start-balancer.sh   -threshold  30


Once i start balancer it goes will for first 8-10 minutes of time..
Balancer was moving so quickly first 10 minutes.. Not sure whats happening
in the cluster after sometime ( say 10 minz ) , balancer is almost stuck .

Log excerpts :

2016-09-08 04:58:15,653 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_-5830766563502877304_1279767737 with size=134217728 from
10.103.21.27:1004 to 10.142.21.56:1004 through 10.103.21.27:1004

2016-09-08 04:59:14,426 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_2601479900_1104500421142 with size=268435456 from 10.103.84.51:1004 to
10.142.18.27:1004 through 10.103.84.16:1004

2016-09-08 05:01:15,037 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Successfully moved
blk_3073791211_1104972921837 with size=268435456 from 10.103.21.27:1004 to
10.142.21.56:1004 through 10.103.21.42:1004



[05:16]:[hadoop@lvsaishdc3sn0002:~]$ date

Thu Sep  8 05:16:53 GMT+7 2016

[05:16]:[hadoop@lvsaishdc3sn0002:~]$ jps

1003 Balancer

20388 Jps



Last Block Mover Timestamp     : 05:01

Current Timestamp                    : 05:16


Almost 15 minz no blocks moved by balancer ..  What could be the issue here
??  Restart would help us start moving again..



It’s not event passing iteration 1 ..


I found one thread discussing about the same issue:

http://lucene.472066.n3.nabble.com/A-question-about-Balancer-in-HDFS-td4118505.html


Pls suggest here to balance cluster ..


--Senthil







--

Thanks and Regards

Karthik P

+91 9940089181

SkypeId - karthik.p01