Kidwai, Hashir Karim
2014-01-08 12:11:59 UTC
Hello,
I am sure lots of people have asked the similar question, but I couldnât find the exact answer which I am looking for.
I have run the command on cluster of 18 computing nodes with 12 cores each (qacct âj for a particular user) against all the jobs past 30 days and compile the results as follows.
Wall Clock
CPU Time
Job #933
Job #932
Job #935
Job #936
Job #937
Job #939
Job #940
Job #934
Job #944
Job #943
Job #942
Job #931
Job#930
Job #929
Job #927
Dec-13
654.334
69.1
165.99
3.301
0.0005
13.18
7.7
0.0005
122
13.49
52.56
207
0
0
0.012
0.002
5725.5522
822.17
0.0006
39.38
0.0006
157
0.0002
0.0006
1450
159
626
2472
0
0
0
0.0002
Slots
96
48
60
60
60
60
60
60
60
60
60
48
48
48
48
I am analyzing and comparing the CPU time and Wall clock time in hours (from qacct command) for job submitted and finished in the month of December-2013. These are my findings, so please correct me if I am mistaken.
1. Wall clock time is the time from job submission to job finish.
2. CPU time is the usage time during the job execution. Since every node is equipped with 12 cores, one should divide the time (except for few instances) with 12 cores which will give one the same or close to the same time as the wall clock. But it is infact the total time of all the Cores involved in running the job (??). What is exactly the logic behind it, if my assumption is right ?
3. Slots are basically the total # of cores involved in job execution (slots = Cores)??
4. In some instances (not shown in the above table), although wall clock is quite significant but the CPU usage time is close to 0, what could be the logic behind it, it could be a problem with the job or any other factor ?
5. While analyzing the jobs , I noticed that there is only one hostname (compute node) associated with the job, why is that so? What about other nodes which are running the same job, is there a way to trace them?
I really appreciate somebodyâs feedback on the above.
Thanks
Hashir
________________________________
The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as âthis Emailâ), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.
I am sure lots of people have asked the similar question, but I couldnât find the exact answer which I am looking for.
I have run the command on cluster of 18 computing nodes with 12 cores each (qacct âj for a particular user) against all the jobs past 30 days and compile the results as follows.
Wall Clock
CPU Time
Job #933
Job #932
Job #935
Job #936
Job #937
Job #939
Job #940
Job #934
Job #944
Job #943
Job #942
Job #931
Job#930
Job #929
Job #927
Dec-13
654.334
69.1
165.99
3.301
0.0005
13.18
7.7
0.0005
122
13.49
52.56
207
0
0
0.012
0.002
5725.5522
822.17
0.0006
39.38
0.0006
157
0.0002
0.0006
1450
159
626
2472
0
0
0
0.0002
Slots
96
48
60
60
60
60
60
60
60
60
60
48
48
48
48
I am analyzing and comparing the CPU time and Wall clock time in hours (from qacct command) for job submitted and finished in the month of December-2013. These are my findings, so please correct me if I am mistaken.
1. Wall clock time is the time from job submission to job finish.
2. CPU time is the usage time during the job execution. Since every node is equipped with 12 cores, one should divide the time (except for few instances) with 12 cores which will give one the same or close to the same time as the wall clock. But it is infact the total time of all the Cores involved in running the job (??). What is exactly the logic behind it, if my assumption is right ?
3. Slots are basically the total # of cores involved in job execution (slots = Cores)??
4. In some instances (not shown in the above table), although wall clock is quite significant but the CPU usage time is close to 0, what could be the logic behind it, it could be a problem with the job or any other factor ?
5. While analyzing the jobs , I noticed that there is only one hostname (compute node) associated with the job, why is that so? What about other nodes which are running the same job, is there a way to trace them?
I really appreciate somebodyâs feedback on the above.
Thanks
Hashir
________________________________
The contents of this email, including all related responses, files and attachments transmitted with it (collectively referred to as âthis Emailâ), are intended solely for the use of the individual/entity to whom/which they are addressed, and may contain confidential and/or legally privileged information. This Email may not be disclosed or forwarded to anyone else without authorization from the originator of this Email. If you have received this Email in error, please notify the sender immediately and delete all copies from your system. Please note that the views or opinions presented in this Email are those of the author and may not necessarily represent those of Saudi Aramco. The recipient should check this Email and any attachments for the presence of any viruses. Saudi Aramco accepts no liability for any damage caused by any virus/error transmitted by this Email.