All nodes are not used

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

All nodes are not used

Madhav Sharan
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan

Reply | Threaded
Open this post in threaded view
|

Re: All nodes are not used

Madhav Sharan
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan


Reply | Threaded
Open this post in threaded view
|

Re: All nodes are not used

Sunil Govind
In reply to this post by Madhav Sharan
HI Madhav

Could you help to share some more information here. When u say few nodes are not utilized, is it always same nodes which are not utilized?

also how long each of these container are running on an average, pls make sure you have provided enough split size to ensure the containers are not short running. 

Thanks
Sunil

On Tue, Aug 9, 2016 at 4:49 AM Madhav Sharan <[hidden email]> wrote:
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan

Reply | Threaded
Open this post in threaded view
|

Re: All nodes are not used

Madhav Sharan
Hi Sunil - Thanks a lot for replying

For one job run yes some nodes don't take load at all. But if I rerun no these are not same nodes always. 

One map job takes ~3 seconds to run and till now I am not able to run my whole job on a bigger data set so I can't say that container are short lived.

I was doing experiments and if I split input file into N files where N = number of cores then my job starts running on all cores. So may be I need to look at split size. Any trick to set split size = number of cores?

I can try adjusting mapred.min.split.size manually otherwise.


--
Madhav Sharan


On Tue, Aug 9, 2016 at 8:27 AM, Sunil Govind <[hidden email]> wrote:
HI Madhav

Could you help to share some more information here. When u say few nodes are not utilized, is it always same nodes which are not utilized?

also how long each of these container are running on an average, pls make sure you have provided enough split size to ensure the containers are not short running. 

Thanks
Sunil

On Tue, Aug 9, 2016 at 4:49 AM Madhav Sharan <[hidden email]> wrote:
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan


Reply | Threaded
Open this post in threaded view
|

Re: All nodes are not used

Mahesh Balija
In reply to this post by Madhav Sharan
Hi Madhav,

The behaviour to me sounds normal.
If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e., containers used).
You cannot use entire cluster as the blocks could be only in the nodes being used.

You should not try using the entire cluster resources for following reason

The time required to initialize the container vs the time required to process the amount of data should be optimum to maximize the conainer utilization, that is why the block size 128 MB is been choosen, in many cases this InputSplit size is increased to optimize the containers utilization depending on the workloads.

Best,
Mahesh.B.
 


On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <[hidden email]> wrote:
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan


Reply | Threaded
Open this post in threaded view
|

Re: All nodes are not used

Madhav Sharan
Thanks Mahesh

Till now I am not able to run the whole job in a limited time period. So I am looking for optimizations and resource utilization. May be I can try tweaking input split size if it helps.

Thanks for your help, It explains the behaviour 

--
Madhav Sharan


On Tue, Aug 9, 2016 at 1:28 PM, Mahesh Balija <[hidden email]> wrote:
Hi Madhav,

The behaviour to me sounds normal.
If the Block Size is 128 MB there could possibly be ~24 Mappers (i.e., containers used).
You cannot use entire cluster as the blocks could be only in the nodes being used.

You should not try using the entire cluster resources for following reason

The time required to initialize the container vs the time required to process the amount of data should be optimum to maximize the conainer utilization, that is why the block size 128 MB is been choosen, in many cases this InputSplit size is increased to optimize the containers utilization depending on the workloads.

Best,
Mahesh.B.
 


On Tue, Aug 9, 2016 at 12:19 AM, Madhav Sharan <[hidden email]> wrote:
Hi Hadoop users,

I am running a m/r job with an input file of 23 million records. I can see all our files are not getting used.

What can I change to utilize all nodes?


ContainersMem UsedMem AvailVcores usedVcores avail
811.25 GB0 B80
00 B11.25 GB08
00 B11.25 GB08
811.25 GB0 B80
811.25 GB0 B80
711.25 GB0 B71
57.03 GB4.22 GB53
00 B11.25 GB08
00 B11.25 GB08


My command looks like -

hadoop jar target/pooled-time-series-1.0-SNAPSHOT-jar-with-dependencies.jar gov.nasa.jpl.memex.pooledtimeseries.MeanChiSquareDistanceCalculation /user/pts/output/MeanChiSquareAndSimilarityInput /user/pts/output/MeanChiSquaredCalcOutput

Directory - /user/pts/output/MeanChiSquareAndSimilarityInput have a input file of 23 m records. File size is ~3 GB



--
Madhav Sharan