Discussion:
[gridengine users] Job Died Through Signal Kill (9)
Eric Kaufmann
2014-03-31 16:22:14 UTC
Permalink
We are using ge 6.2u5 with CentOS 6.4.

I have jobs that are randomly being killed. Here is the log entry. The jobs
that are getting killed are getting an exit status of 127 or 137. I did
check /var/log/messages on the nodes and didn't see anything out of the
ordinary.

03/31/2014 09:55:30|worker|kepler|W|job 33393.1 failed on host
research029.cm.cluster assumedly after job because: job 33393.1 died
through signal KILL (9)

03/31/2014 09:55:34|worker|kepler|W|job 33394.1 failed on host
research026.cm.cluster assumedly after job because: job 33394.1 died
through signal KILL (9)

qacct -j 33394

qname std
hostname research026.cm.cluster
group justinchem
owner justinchem
project NONE
department defaultdepartment
jobname runCHO-C6H5-Cs_opt.24081
jobnumber 33394
taskid undefined
account sge
priority 0
qsub_time Mon Mar 31 09:54:53 2014
start_time Mon Mar 31 09:55:10 2014
end_time Mon Mar 31 09:55:33 2014
granted_pe gauss
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 23
ru_utime 0.003
ru_stime 0.008
ru_maxrss 1380
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 1957
ru_majflt 5
ru_nswap 0
ru_inblock 584
ru_oublock 40
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 58
ru_nivcsw 6
cpu 82.570
mem 452.669
io 0.084
iow 0.000
maxvmem 5.710G
arid undefined

Thanks,

Eric
--
Eric Kaufmann | Application Support Analyst - Advanced Technology Group |
Saint Louis University | 314-977-2257 | ***@slu.edu
Reuti
2014-03-31 17:31:42 UTC
Permalink
Hi,
Post by Eric Kaufmann
We are using ge 6.2u5 with CentOS 6.4.
I have jobs that are randomly being killed. Here is the log entry. The jobs that are getting killed are getting an exit status of 127 or 137. I did check /var/log/messages on the nodes and didn't see anything out of the ordinary.
03/31/2014 09:55:30|worker|kepler|W|job 33393.1 failed on host research029.cm.cluster assumedly after job because: job 33393.1 died through signal KILL (9)
03/31/2014 09:55:34|worker|kepler|W|job 33394.1 failed on host research026.cm.cluster assumedly after job because: job 33394.1 died through signal KILL (9)
Did you request any limit during job submission? The lines above are in the messages file of the qmaster - is there anything in the messages file of SGE on the nodes (you checked the system one on the nodes)?

-- Reuti
Post by Eric Kaufmann
qacct -j 33394
qname std
hostname research026.cm.cluster
group justinchem
owner justinchem
project NONE
department defaultdepartment
jobname runCHO-C6H5-Cs_opt.24081
jobnumber 33394
taskid undefined
account sge
priority 0
qsub_time Mon Mar 31 09:54:53 2014
start_time Mon Mar 31 09:55:10 2014
end_time Mon Mar 31 09:55:33 2014
granted_pe gauss
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 23
ru_utime 0.003
ru_stime 0.008
ru_maxrss 1380
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 1957
ru_majflt 5
ru_nswap 0
ru_inblock 584
ru_oublock 40
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 58
ru_nivcsw 6
cpu 82.570
mem 452.669
io 0.084
iow 0.000
maxvmem 5.710G
arid undefined
Thanks,
Eric
--
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Loading...