core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 193056
max locked memory (kbytes, -l) 256
max memory size (kbytes, -m) unlimited
open files (-n) 8192
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 193056
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 193056
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 193056
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
I think it might be the machine killing them. because where not putting any
other limits anywhere. unless it's the application where running. the task
usually take up a lot of ram and if more then one hits a machine it can be
swapping like crazy.
would be good to still be able to catch this before it gets the signal.
Post by Reutiyou can check the messages file of the execd on the nodes, whether anything
about the reason was recorded there.
-- Reuti
the problem is that i don't have any such limit's enforced currently on
submission. the submission to qsub are hidden from the user so i know there
not adding them.. the only thing we have is a load/suspended theshold in the
grid it self. (np_load_av, 1.75) at if I give it the memory limit I get the
same result but the other jobs have been getting this same signal and where
submitted without any limits.
atom10b = 215 times
huey = 856 times
atom24 = 356 times
atom23 = 345 times
atom05 = 669 times
atom15 = 796 times
atom12 = 432 times
atom22 = 250 times
centi = 152 times
sage = 186 times
atom08 = 588 times
fluffy = 101 times
atom20 = 561 times
atom10 = 570 times
neon = 129 times
atom17 = 358 times
atom14 = 188 times
atom13 = 414 times
atom21 = 406 times
atom11 = 182 times
dewey = 658 times
atom16 = 423 times
atom06 = 500 times
atom01 = 802 times
atom18 = 567 times
atom09 = 539 times
milly = 113 times
louie = 249 times
atom03 = 793 times
topsy = 69 times
atom02 = 834 times
atom04 = 359 times
atom07 = 791 times
atom19 = 488 times
seems a little more the then users killing it on there local machines.
could that load avarage be doing this? or any other settings in qmon i might
be over looking?
Lars
<snip>
trying to raise a 100 error in the epilog didn't work. if the task
failed with 137 it will not accept anything else it seems. I works fine for
other errors but not for a Kill command.
assumedly after job". Looks like it's to late to prevent kicking it out of
the system.
What about using s_vmem then? The job will get a signal SIGXCPU and can
act upon like using a trap in the jobscript. The binary on its own will be
terminated unless you change the default behavior there too.
(soft limits in SGE are not like soft ulimits [which introduce only a
second lower limit on user request])
-- Reuti
ether inside of the grid code it self or as ether a other location
like the epilog that gets run regards less of the task being killed?
also. is it normal that the task being killed will not run it's
epilog?
No. The epilog will be executed, but not the remainder of the job
script. Otherwise necessary cleanups couldn't be executed.
I see this. I'm printing the exit status of 100 but still no bananas.
-- Reuti
Lars
your a beacon of knowledge Reuti! Thank you!
I'll Think i'll have enough to have a other stab at my problem
Lars
Post by lars van der bijlalso is there anyway of catching this and raising 100? ones the job
is finished and it's dependencies start it's causing major havok on our
system looking for file that arent there.
Post by lars van der bijlare there other things the grid uses the SIGKILL for? not just
memory limits?
h_rt and h_cpu too: man queue_conf
Or any ulimits in the cluster, which you set by other means.
-- Reuti
Post by lars van der bijlLars
in this case yes.
however on the jobs running on our farm we put no memory limits as
of yet. just request amount of procs
Post by lars van der bijlis the it usual behaviour that if it fails with this code that the
subsequent dependencies start regardless?
Post by lars van der bijlLars
Hi,
Post by lars van der bijlHey everyone.
Where having some issues with job's being killed with exit status
137.
Post by lars van der bijl137 = 128 + 9
$ kill -l
1) SIGHUP 2) SIGINT 3) SIGQUIT 4) SIGILL
5) SIGTRAP 6) SIGABRT 7) SIGBUS 8) SIGFPE
9) SIGKILL ...
So, the job was killed. Did you request a too small value for
h_vmem or h_rt?
Post by lars van der bijl-- Reuti
Post by lars van der bijlThis causes the task to finish and start it dependent task which
is causing all kind of havoc.
Post by lars van der bijlPost by lars van der bijlsubmitting a job with a very small max memory limit gives me this
this as a example.
Post by lars van der bijlPost by lars van der bijl$ qacct -j 21141
==============================================================
qname test.q
hostname atom12.**
group **
owner lars
project NONE
department defaultdepartment
jobname stest__out__geometry2
jobnumber 21141
taskid 101
account sge
priority 0
qsub_time Fri Apr 1 11:22:30 2011
start_time Fri Apr 1 11:22:31 2011
end_time Fri Apr 1 11:22:39 2011
granted_pe smp
slots 4
failed 100 : assumedly after job
exit_status 137
ru_wallclock 8
ru_utime 0.281
ru_stime 0.167
ru_maxrss 3744
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 70739
ru_majflt 0
ru_nswap 0
ru_inblock 8
ru_oublock 224
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 1072
ru_nivcsw 439
cpu 2.240
mem 0.573
io 0.145
iow 0.000
maxvmem 405.820M
arid undefined
anyone know of a reason why the task would be killed with this
error state? or how to catch it?
Post by lars van der bijlPost by lars van der bijlLars
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users