Spark application Runtime Measurement

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark application Runtime Measurement

Fei Hu
Dear all,

I have a question about how to measure the runtime for a Spak application. Here is an example:

  • On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following
Screen Shot 2016-07-09 at 11.45.44 PM.png
  • However, when I check the jobs launched by the application, the time is 13s + 0.8s + 4s = 17.8 seconds, which is much less than 120 seconds. I am not sure which time I should choose to measure the performance of the Spark application.
Screen Shot 2016-07-09 at 11.48.26 PM.png
  • I also check the event timeline as following. There is a big gap between the second job and the third job. I do not know what happened during that gap.
Screen Shot 2016-07-09 at 11.53.29 PM.png

Is there anyone who can help explain which time is the exact time to measure the performance of a Spark application.

Thanks in advance,
Fei

Reply | Threaded
Open this post in threaded view
|

Re: Spark application Runtime Measurement

Fei Hu
Hi Mich,

Thank you for your detailed response. I have one more question. 

In your case, the sum time of individual jobs (earliest job 47 to last job 58) equals to the time you print out by code.

But in my case, the sum time of all the individual jobs (17.8 seconds ) is much less than the time between the start time and end up time (120 seconds). After the seconds job, the Spark application stops 25 seconds, then continue the final job. After the final job, Spark takes 50 seconds to end this application. Do you know what happened between the individual jobs?

Thanks,
Fei



On Sun, Jul 10, 2016 at 1:58 AM, Mich Talebzadeh <[hidden email]> wrote:
Hi,

Ultimately regardless of individual components timing, what matter is the elapsed time from the start of the job till end of the job. If I do a performance test I measure it three times and average the timing and to me that time is the time taken between start and end.

Example

println ("\nStarted at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
HiveContext.sql("use oraclehadoop")
val s = HiveContext.table("sales").select("AMOUNT_SOLD","TIME_ID","CHANNEL_ID")
val c = HiveContext.table("channels").select("CHANNEL_ID","CHANNEL_DESC")
val t = HiveContext.table("times").select("TIME_ID","CALENDAR_MONTH_DESC")
println ("\ncreating data set at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
val rs = s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
println ("\nfirst query at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
val rs1 = rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
println ("\nsecond query at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)
val rs2 =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
println ("\nFinished at"); sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ").collect.foreach(println)

So in here I look at individual timings as well.

Now Spark UI breaks down the timings for each job and stages.

As far as measurements concerned I have start time as


Started at
[10/07/2016 06:05:55.55]
res32: org.apache.spark.sql.DataFrame = [result: string]
s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: timestamp, CHANNEL_ID: bigint]
c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: double, CHANNEL_DESC: string]
t: org.apache.spark.sql.DataFrame = [TIME_ID: timestamp, CALENDAR_MONTH_DESC: string]
creating data set at
[10/07/2016 06:05:56.56]
rs: org.apache.spark.sql.DataFrame = [calendar_month_desc: string, channel_desc: string, TotalSales: decimal(20,0)]
first query at
[10/07/2016 06:05:56.56]
second query at
[10/07/2016 06:17:18.18]
Finished at
[10/07/2016 06:33:35.35]

So the job took 27 minutes 39 seconds or 1659 seconds to finish

From Spark UI I have

Inline images 1

Starting at Job 47 and finishing at job 58 as below

Inline images 2


Which adds up to 1623.1 seconds from duration but what matters is the start time and end up for me i.e. 2016/07/10 06:05:55 and 2016/07/10 06:33:35 which is what my measurements said including elapsed time between job start and end

So from UI what matters is the start of earliest job 47 here and end of last job 58.

I would take that as the indication noting from UI the individual job timing.

HTH


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 10 July 2016 at 04:57, Fei Hu <[hidden email]> wrote:
Dear all,

I have a question about how to measure the runtime for a Spak application. Here is an example:

  • On the Spark UI: the total duration time is 2.0 minutes = 120 seconds as following
Screen Shot 2016-07-09 at 11.45.44 PM.png
  • However, when I check the jobs launched by the application, the time is 13s + 0.8s + 4s = 17.8 seconds, which is much less than 120 seconds. I am not sure which time I should choose to measure the performance of the Spark application.
Screen Shot 2016-07-09 at 11.48.26 PM.png
  • I also check the event timeline as following. There is a big gap between the second job and the third job. I do not know what happened during that gap.
Screen Shot 2016-07-09 at 11.53.29 PM.png

Is there anyone who can help explain which time is the exact time to measure the performance of a Spark application.

Thanks in advance,
Fei