[gridengine users] Jobs being killed with exit status 137

Discussion:

lars van der bijl

2011-04-01 10:33:55 UTC

Hey everyone.

Where having some issues with job's being killed with exit status 137. This
causes the task to finish and start it dependent task which is causing all
kind of havoc.

submitting a job with a very small max memory limit gives me this this as a
example.

$ qacct -j 21141
==============================================================
qname test.q
hostname atom12.**
group **
owner lars
project NONE
department defaultdepartment
jobname stest__out__geometry2
jobnumber 21141
taskid 101
account sge
priority 0
qsub_time Fri Apr 1 11:22:30 2011
start_time Fri Apr 1 11:22:31 2011
end_time Fri Apr 1 11:22:39 2011
granted_pe smp
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 8
ru_utime 0.281
ru_stime 0.167
ru_maxrss 3744
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 70739
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 224
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 1072
ru_nivcsw 439
cpu 2.240
mem 0.573
io 0.145
iow 0.000
maxvmem 405.820M
arid undefined

anyone know of a reason why the task would be killed with this error state?
or how to catch it?

Lars

Reuti

2011-04-01 10:41:00 UTC

Permalink

Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9

$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...

So, the job was killed. Did you request a too small value for h_vmem or h_rt?

-- Reuti

Post by lars van der bijl
This causes the task to finish and start it dependent task which is causing all kind of havoc.
submitting a job with a very small max memory limit gives me this this as a example.
$ qacct -j 21141
==============================================================
qname test.q
hostname atom12.**
group **
owner lars
project NONE
department defaultdepartment
jobname stest__out__geometry2
jobnumber 21141
taskid 101
account sge
priority 0
qsub_time Fri Apr 1 11:22:30 2011
start_time Fri Apr 1 11:22:31 2011
end_time Fri Apr 1 11:22:39 2011
granted_pe smp
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 8
ru_utime 0.281
ru_stime 0.167
ru_maxrss 3744
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 70739
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 224
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 1072
ru_nivcsw 439
cpu 2.240
mem 0.573
io 0.145
iow 0.000
maxvmem 405.820M
arid undefined
anyone know of a reason why the task would be killed with this error state? or how to catch it?
Lars
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

lars van der bijl

2011-04-01 10:54:16 UTC

Permalink

in this case yes.

however on the jobs running on our farm we put no memory limits as of yet.
just request amount of procs

is the it usual behaviour that if it fails with this code that the
subsequent dependencies start regardless?

Lars

Post by Reuti
Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti

Post by lars van der bijl
This causes the task to finish and start it dependent task which is

causing all kind of havoc.

Post by lars van der bijl
submitting a job with a very small max memory limit gives me this this as

a example.

Post by lars van der bijl
$ qacct -j 21141
==============================================================
qname test.q
hostname atom12.**
group **
owner lars
project NONE
department defaultdepartment
jobname stest__out__geometry2
jobnumber 21141
taskid 101
account sge
priority 0
qsub_time Fri Apr 1 11:22:30 2011
start_time Fri Apr 1 11:22:31 2011
end_time Fri Apr 1 11:22:39 2011
granted_pe smp
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 8
ru_utime 0.281
ru_stime 0.167
ru_maxrss 3744
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 70739
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 224
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 1072
ru_nivcsw 439
cpu 2.240
mem 0.573
io 0.145
iow 0.000
maxvmem 405.820M
arid undefined
anyone know of a reason why the task would be killed with this error

state? or how to catch it?

Post by lars van der bijl
Lars
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

lars van der bijl

2011-04-01 11:01:59 UTC

Permalink

also is there anyway of catching this and raising 100? ones the job is
finished and it's dependencies start it's causing major havok on our system
looking for file that arent there.

are there other things the grid uses the SIGKILL for? not just memory
limits?

Lars

Post by lars van der bijl
in this case yes.
however on the jobs running on our farm we put no memory limits as of yet.
just request amount of procs
is the it usual behaviour that if it fails with this code that the
subsequent dependencies start regardless?
Lars

Post by Reuti
Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti

Post by lars van der bijl
This causes the task to finish and start it dependent task which is

causing all kind of havoc.

Post by lars van der bijl
submitting a job with a very small max memory limit gives me this this

as a example.

state? or how to catch it?

Post by lars van der bijl
Lars
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

Reuti

2011-04-01 14:49:15 UTC

Permalink

the problem is that i don't have any such limit's enforced currently on submission. the submission to qsub are hidden from the user so i know there not adding them.. the only thing we have is a load/suspended theshold in the grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the same result but the other jobs have been getting this same signal and where submitted without any limits.
atom10b = 215 times
huey = 856 times
atom24 = 356 times
atom23 = 345 times
atom05 = 669 times
atom15 = 796 times
atom12 = 432 times
atom22 = 250 times
centi = 152 times
sage = 186 times
atom08 = 588 times
fluffy = 101 times
atom20 = 561 times
atom10 = 570 times
neon = 129 times
atom17 = 358 times
atom14 = 188 times
atom13 = 414 times
atom21 = 406 times
atom11 = 182 times
dewey = 658 times
atom16 = 423 times
atom06 = 500 times
atom01 = 802 times
atom18 = 567 times
atom09 = 539 times
milly = 113 times
louie = 249 times
atom03 = 793 times
topsy = 69 times
atom02 = 834 times
atom04 = 359 times
atom07 = 791 times
atom19 = 488 times
seems a little more the then users killing it on there local machines. could that load avarage be doing this? or any other settings in qmon i might be over looking?

No, not SGE. Either the users do it by hand, or the kernel by triggering the oom-killer because it was running out of memory, or you have some limits set somewhere. What does:

ulimit -Ha
echo
ulimit -Sa

give in a jobscript. Were the execds started at boot up or lateron by hand?

-- Reuti

Lars

<snip>
trying to raise a 100 error in the epilog didn't work. if the task failed with 137 it will not accept anything else it seems. I works fine for other errors but not for a Kill command.

Indeed. Nevertheless it's noted in `qacct` as "failed 100 : assumedly after job". Looks like it's to late to prevent kicking it out of the system.
What about using s_vmem then? The job will get a signal SIGXCPU and can act upon like using a trap in the jobscript. The binary on its own will be terminated unless you change the default behavior there too.
(soft limits in SGE are not like soft ulimits [which introduce only a second lower limit on user request])
-- Reuti

ether inside of the grid code it self or as ether a other location like the epilog that gets run regards less of the task being killed?
also. is it normal that the task being killed will not run it's epilog?

No. The epilog will be executed, but not the remainder of the job script. Otherwise necessary cleanups couldn't be executed.
I see this. I'm printing the exit status of 100 but still no bananas.
-- Reuti

Lars
your a beacon of knowledge Reuti! Thank you!
I'll Think i'll have enough to have a other stab at my problem
Lars

also is there anyway of catching this and raising 100? ones the job is finished and it's dependencies start it's causing major havok on our system looking for file that aren’t there.
are there other things the grid uses the SIGKILL for? not just memory limits?

h_rt and h_cpu too: man queue_conf
Or any ulimits in the cluster, which you set by other means.
-- Reuti

Lars
in this case yes.
however on the jobs running on our farm we put no memory limits as of yet. just request amount of procs
is the it usual behaviour that if it fails with this code that the subsequent dependencies start regardless?
Lars
Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti

Reuti

2011-04-01 14:50:46 UTC

Permalink

Add on:

you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there.

-- Reuti

<snip>
trying to raise a 100 error in the epilog didn't work. if the task failed with 137 it will not accept anything else it seems. I works fine for other errors but not for a Kill command.

Lars
your a beacon of knowledge Reuti! Thank you!
I'll Think i'll have enough to have a other stab at my problem
Lars

h_rt and h_cpu too: man queue_conf
Or any ulimits in the cluster, which you set by other means.
-- Reuti

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti

lars van der bijl

2011-04-01 14:57:24 UTC

Permalink

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 193056
max locked memory (kbytes, -l) 256
max memory size (kbytes, -m) unlimited
open files (-n) 8192
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 193056
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 193056
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 193056
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I think it might be the machine killing them. because where not putting any
other limits anywhere. unless it's the application where running. the task
usually take up a lot of ram and if more then one hits a machine it can be
swapping like crazy.

would be good to still be able to catch this before it gets the signal.

Post by Reuti
you can check the messages file of the execd on the nodes, whether anything
about the reason was recorded there.
-- Reuti

the problem is that i don't have any such limit's enforced currently on

submission. the submission to qsub are hidden from the user so i know there
not adding them.. the only thing we have is a load/suspended theshold in the
grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the
same result but the other jobs have been getting this same signal and where
submitted without any limits.

atom10b = 215 times
huey = 856 times
atom24 = 356 times
atom23 = 345 times
atom05 = 669 times
atom15 = 796 times
atom12 = 432 times
atom22 = 250 times
centi = 152 times
sage = 186 times
atom08 = 588 times
fluffy = 101 times
atom20 = 561 times
atom10 = 570 times
neon = 129 times
atom17 = 358 times
atom14 = 188 times
atom13 = 414 times
atom21 = 406 times
atom11 = 182 times
dewey = 658 times
atom16 = 423 times
atom06 = 500 times
atom01 = 802 times
atom18 = 567 times
atom09 = 539 times
milly = 113 times
louie = 249 times
atom03 = 793 times
topsy = 69 times
atom02 = 834 times
atom04 = 359 times
atom07 = 791 times
atom19 = 488 times
seems a little more the then users killing it on there local machines.

could that load avarage be doing this? or any other settings in qmon i might
be over looking?

Lars

<snip>
trying to raise a 100 error in the epilog didn't work. if the task

failed with 137 it will not accept anything else it seems. I works fine for
other errors but not for a Kill command.
assumedly after job". Looks like it's to late to prevent kicking it out of
the system.

What about using s_vmem then? The job will get a signal SIGXCPU and can

act upon like using a trap in the jobscript. The binary on its own will be
terminated unless you change the default behavior there too.

(soft limits in SGE are not like soft ulimits [which introduce only a

second lower limit on user request])

-- Reuti

ether inside of the grid code it self or as ether a other location

like the epilog that gets run regards less of the task being killed?

also. is it normal that the task being killed will not run it's

epilog?

No. The epilog will be executed, but not the remainder of the job

script. Otherwise necessary cleanups couldn't be executed.

I see this. I'm printing the exit status of 100 but still no bananas.
-- Reuti

Lars
your a beacon of knowledge Reuti! Thank you!
I'll Think i'll have enough to have a other stab at my problem
Lars

Post by lars van der bijl
also is there anyway of catching this and raising 100? ones the job

is finished and it's dependencies start it's causing major havok on our
system looking for file that arent there.

Post by lars van der bijl
are there other things the grid uses the SIGKILL for? not just

memory limits?

h_rt and h_cpu too: man queue_conf
Or any ulimits in the cluster, which you set by other means.
-- Reuti

Post by lars van der bijl
Lars
in this case yes.
however on the jobs running on our farm we put no memory limits as

of yet. just request amount of procs

Post by lars van der bijl
is the it usual behaviour that if it fails with this code that the

subsequent dependencies start regardless?

Post by lars van der bijl
Lars
Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status

137.

Post by lars van der bijl
137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for

h_vmem or h_rt?

Post by lars van der bijl
-- Reuti

Post by lars van der bijl
This causes the task to finish and start it dependent task which

is causing all kind of havoc.

Post by lars van der bijl

Post by lars van der bijl
submitting a job with a very small max memory limit gives me this

this as a example.

Post by lars van der bijl

error state? or how to catch it?

Post by lars van der bijl

Post by lars van der bijl
Lars
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

Reuti

2011-04-01 15:19:42 UTC

Permalink

Post by lars van der bijl
core file size (blocks, -c) 0
<snip>
file locks (-x) unlimited

Fine.

Post by lars van der bijl
I think it might be the machine killing them. because where not putting any other limits anywhere. unless it's the application where running. the task usually take up a lot of ram and if more then one hits a machine it can be swapping like crazy.

This should show up in /var/log/messages by the kernel.

Post by lars van der bijl
would be good to still be able to catch this before it gets the signal.

Hehe - not with the oom-killer.

- Limit the slot count per machine.
- Limit and request memory in SGE, this way only the ones fitting in will be scheduled to a machine until memory is exhausted.
- Install more memory.

-- Reuti

Post by lars van der bijl
you can check the messages file of the execd on the nodes, whether anything about the reason was recorded there.
-- Reuti

<snip>
trying to raise a 100 error in the epilog didn't work. if the task failed with 137 it will not accept anything else it seems. I works fine for other errors but not for a Kill command.

Lars
your a beacon of knowledge Reuti! Thank you!
I'll Think i'll have enough to have a other stab at my problem
Lars

h_rt and h_cpu too: man queue_conf
Or any ulimits in the cluster, which you set by other means.
-- Reuti

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti

Reuti

2011-04-01 11:03:31 UTC

Permalink

Post by lars van der bijl
in this case yes.
however on the jobs running on our farm we put no memory limits as of yet. just request amount of procs
is the it usual behaviour that if it fails with this code that the subsequent dependencies start regardless?

Yes, for SGE the -hold_jid will only check whether the predecessor left the system. It's state isn't checked or honored. This needs to be done in the followup job script and maybe sending itself into error state, so that it's not lost.

If you have a list with jobnames in advance you want to handle, you could submit all but the first job with -h, and each finished job will have to enable the followup job then in the job script or a queue epilog.

A place to specify the name of the followup jobs could be the job context, as its meta information is just comment for SGE, but you can access the information and act upon.

-- Reuti

Post by lars van der bijl
Lars
Hi,

Post by lars van der bijl
Hey everyone.
Where having some issues with job's being killed with exit status 137.

137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for h_vmem or h_rt?
-- Reuti