[gridengine users] Interpretation of Qacct output

Reuti

2014-01-08 13:16:02 UTC

Hi,

I am sure lots of people have asked the similar question, but I couldn’t find the exact answer which I am looking for.
I have run the command on cluster of 18 computing nodes with 12 cores each (qacct –j for a particular user) against all the jobs past 30 days and compile the results as follows.
<snip>
I am analyzing and comparing the CPU time and Wall clock time in hours (from qacct command) for job submitted and finished in the month of December-2013. These are my findings, so please correct me if I am mistaken.
1. Wall clock time is the time from job submission to job finish.

No. It's start time of the job to stop time of the job. When the job was submitted is unrelated to the measured wall clock time.

2. CPU time is the usage time during the job execution.

Yes.

Since every node is equipped with 12 cores, one should divide the time (except for few instances) with 12 cores which will give one the same or close to the same time as the wall clock.

(Not by 12, but the number of requested and used cores for this job. Otherwise slave tasks are not tightly integrated into SGE [see below]. Unless you need X11 forwarding inside the cluster, you can even disable ssh/rsh in the cluster [I do this to allow only admin staff to reach the nodes by a direct `ssh`])

This depends on the implementation of the application. If it scales perfectly linear with an increasing number of cores: yes. But often you gain only a speedup by a much smaller rate like 66% per core instead of 50% for each doubling of number of cores. And due to the algorithm and communcation between the processes some cores might be idle for some time and hence the overall CPU time across all cores divided by X is less than the wall clock time. There is nothing you can do about it, except not using too many cores for this particular application.

E.g. gaining only 66% instead instead of 50%, using 8 cores would mean to lower the execution time (from 1.0) to 0.287 (reflecting needed wall clock time) instead of 0.125: roughly half of the computing time is wasted. The real used CPU time might lay somewhere between 0.287 * 8 and 0.125 * 8.

But it is infact the total time of all the Cores involved in running the job (??). What is exactly the logic behind it, if my assumption is right ?
3. Slots are basically the total # of cores involved in job execution (slots = Cores)??

Yes.

4. In some instances (not shown in the above table), although wall clock is quite significant but the CPU usage time is close to 0, what could be the logic behind it, it could be a problem with the job or any other factor ?

This might happen if a parallel library is not tightly integrated into SGE. I.e.: the main job script starting the `mpiexec` doesn't consume any computing time at all (only a fraction for the startup), and the kids are not tracked by SGE. Which library was taken for these jobs showing almost no computing time, what are the settings of the requested PE and the submission command itself?

5. While analyzing the jobs , I noticed that there is only one hostname (compute node) associated with the job, why is that so? What about other nodes which are running the same job, is there a way to trace them?

For a tightly integrated parallel job where all slave tasks are tracked by SGE, you would get an entry for each started remote process in `qacct` (unless the parallel environment (PE) has set "accounting_summary TRUE" (`man sge_pe`). If you get several entries, all entried must be summarized up for this job.

-- Reuti