[gridengine users] Error 137 - trying to figure out what it means.

Discussion:

Jake Carroll

2013-01-13 23:56:47 UTC

Hi all.

We're trying to figure out the answer to a problem that is escaping us. We can usually self solve most of these issues, but this one, we're having problems trapping and can't find any solid answers for after a lot of looking around on online resources.

One of our quite capable users [read: he rarely needs our help with grid engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for us is that sometimes these array based jobs (non PE's/parallel environments and no mpi/mpich explicit in use) are only crashing sometimes. Some, and not others. It seems almost quasi-random.

The code is written in fortran compiled with Intels ifort, using standard code optimisation (compile flag 02). However, the code is also compiled with optimisation turned off and traceback and error reporting turned on, and in both cases programs failed and no run-time error was printed. The same code was also compiled with gfortran and did also produce error '137'.

The code run successfully numerous times, but is doing something slightly different each time due to random sampling and different model specifications. There are 20 jobs because analyses are run across 20 replicates of a simulations. Previously our user had
no problems running these 20 replicates across 11 different models (20x11=220 runs).

Some specifics:

Array jobMemory allocation is 20GB, and the job uses less than 14GB.

Submitted through a shell script qsub test.sh, where test sh looks like:

-------------------------------------------------------
#$ -cwd
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
#$ -M s<mailto:***@uq.edu.au>***@somedomain.com
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
-----------------------------------------------------------------------------------------------------------------

Intels default is 'static compiling' from what we understand, in anyway no external libraries are used (although Intel uses its own MKL library).

We can't see any obvious memory starvation issues or resource contention problems. Do you have any suggestions in things we could look at to trap this? The error 137 stuff online, after looking around a little, seems sparse at best.

Any help would be appreciated.

--JC

Ron Chen

2013-01-14 00:34:30 UTC

Permalink

Exit code 137 = process was killed because it exceeded the time limit, and Google is your best friend if you have similar issues - and the solution is to check the default time limit of your shell.

-Ron

************************************************************************

Open Grid Scheduler - the official open source Grid Engine: http://gridscheduler.sourceforge.net/

________________________________
From: Jake Carroll <***@uq.edu.au>
To: "***@gridengine.org" <***@gridengine.org>
Sent: Sunday, January 13, 2013 6:56 PM
Subject: [gridengine users] Error 137 - trying to figure out what it means.

Hi all.

We're trying to figure out the answer to a problem that is escaping us. We can usually self solve most of these issues, but this one, we're having problems trapping and can't find any solid answers for after a lot of looking around on online resources.

One of our quite capable users [read: he rarely needs our help with grid engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for us is that sometimes these array based jobs (non PE's/parallel environments and no mpi/mpich explicit in use) are only crashing sometimes. Some, and not others. It seems almost quasi-random.

The code is written in fortran compiled with Intels ifort, using standard code optimisation (compile flag –02). However, the code is also compiled with optimisation turned off and traceback and error reporting turned on, and in both cases programs failed and no run-time error was printed. The same code was also compiled with gfortran and did also produce error '137'.

The code run successfully numerous times, but is doing something slightly different each time due to random sampling and different model specifications. There are 20 jobs because analyses are run across 20 replicates of a simulations. Previously our user had
no problems running these 20 replicates across 11 different models (20x11=220 runs).

Some specifics:

Array jobMemory allocation is 20GB, and the job uses less than 14GB.

Submitted through a shell script qsub test.sh, where test sh looks like:

-------------------------------------------------------
#$ -cwd
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
#$ -M ***@somedomain.com
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
-----------------------------------------------------------------------------------------------------------------

Intels default is 'static compiling' from what we understand, in anyway no external libraries are used (although Intel uses its own MKL library).

We can't see any obvious memory starvation issues or resource contention problems. Do you have any suggestions in things we could look at to trap this? The error 137 stuff online, after looking around a little, seems sparse at best.

Any help would be appreciated.

--JC

Jake Carroll

2013-01-14 22:08:28 UTC

Permalink

Hi.

So we tested out trying to hard set wall-time different for the specific
user who's experiencing the Exit 137 issue. We noticed the jobs are still
failing, however.

One job that was killed that included the wall-time setting. Obviously the
job did not run for 24h, anyway input and outputs shown below.

--------
- qsub b5_set112.sh

- b5_set11_2.sh:

#$ -cwd
#$ -l h_rt=24:00:00

#$ -l vf=20G
#$ -N b5_set11_2
#$ -m eas
#$ -M ***@somewhere
/blah/blah/blah/bayesRsim <b5_set11_2.par

- cat b5_set11_2.e1325823:
/opt/gridengine/default/spool/compute-0-4/job_scripts/1325823: line 7:
8117 Killed /blah/blah/blag/bayesRsim < b5_set11_2.par

-qacct -j 1325823
==============================================================
qname medium.q
hostname compute-0-4.local
group users
owner someguy
project NONE
department defaultdepartment
jobname b5_set11_2
jobnumber 1325823
taskid undefined
account sge
priority 0
qsub_time Mon Jan 14 15:36:49 2013
start_time Mon Jan 14 15:36:55 2013
end_time Mon Jan 14 18:11:56 2013
granted_pe NONE
slots 1
failed 0
exit_status 137
ru_wallclock 9301
ru_utime 9262.906
ru_stime 7.916
ru_maxrss 13820636
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 46056
ru_majflt 26
ru_nswap 0
ru_inblock 392840
ru_oublock 32
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 536
ru_nivcsw 30791
cpu 9270.822
mem 61688.906
io 0.430
iow 0.000
maxvmem 13.302G
arid undefined

So, you mentioned "default time limit of your shell". My googling
suggested trying to set a wall time limit, or have the user specify the
wall time, but that did not help. A few google searches show the use of a
global time limit for jobs in general, but make no reference to a default
time limit of the shell. Am I supposed to be looking at limits such as
s_rt and h_rt? If so, how go I manipulate these for the specific user? The
queue_conf man page makes some reference to this, but it doesn't explain
explicitly how to manipulate it globally or on a per user basis making
reference to defaults or "shell".

Sorry - just stumbling through this and not finding it too intuitive.

--JC

Post by Ron Chen
Exit code 137 = process was killed because it exceeded the time limit,
and Google is your best friend if you have similar issues - and the
solution is to check the default time limit of your shell.
-Ron
************************************************************************
http://gridscheduler.sourceforge.net/
________________________________
Sent: Sunday, January 13, 2013 6:56 PM
Subject: [gridengine users] Error 137 - trying to figure out what it
means.
Hi all.
We're trying to figure out the answer to a problem that is escaping us.
We can usually self solve most of these issues, but this one, we're
having problems trapping and can't find any solid answers for after a lot
of looking around on online resources.
One of our quite capable users [read: he rarely needs our help with grid
engine] has an unusual issue with certain jobs (seemingly, randomly?)
crashing out on error 137. The code is predominantly C++ based running
atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for
us is that sometimes these array based jobs (non PE's/parallel
environments and no mpi/mpich explicit in use) are only crashing
sometimes. Some, and not others. It seems almost quasi-random.
The code is written in fortran compiled with Intels ifort, using standard
code optimisation (compile flag 02). However, the code is also compiled
with optimisation turned off and traceback and error reporting turned on,
and in both cases programs failed and no run-time error was printed. The
same code was also compiled with gfortran and did also produce error
'137'.
The code run successfully numerous times, but is doing something slightly
different each time due to random sampling and different model
specifications. There are 20 jobs because analyses are run across 20
replicates of a simulations. Previously our user had
no problems running these 20 replicates across 11 different models
(20x11=220 runs).
Array jobMemory allocation is 20GB, and the job uses less than 14GB.
-------------------------------------------------------
#$ -cwd
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
--------------------------------------------------------------------------
---------------------------------------
Intels default is 'static compiling' from what we understand, in anyway
no external libraries are used (although Intel uses its own MKL library).
We can't see any obvious memory starvation issues or resource contention
problems. Do you have any suggestions in things we could look at to trap
this? The error 137 stuff online, after looking around a little, seems
sparse at best.
Any help would be appreciated.
--JC
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

Reuti

2013-01-14 22:24:20 UTC

Permalink

Hi,

Post by Jake Carroll
So we tested out trying to hard set wall-time different for the specific
user who's experiencing the Exit 137 issue. We noticed the jobs are still
failing, however.

is there any message about the kill signal in the spooling directory's messages file of the node, i.e.:

/opt/gridengine/default/spool/compute-0-4/messages (search for the job id)

-- Reuti

Post by Jake Carroll
One job that was killed that included the wall-time setting. Obviously the
job did not run for 24h, anyway input and outputs shown below.
--------
- qsub b5_set112.sh
#$ -cwd
#$ -l h_rt=24:00:00
#$ -l vf=20G
#$ -N b5_set11_2
#$ -m eas
/blah/blah/blah/bayesRsim <b5_set11_2.par
8117 Killed /blah/blah/blag/bayesRsim < b5_set11_2.par
-qacct -j 1325823
==============================================================
qname medium.q
hostname compute-0-4.local
group users
owner someguy
project NONE
department defaultdepartment
jobname b5_set11_2
jobnumber 1325823
taskid undefined
account sge
priority 0
qsub_time Mon Jan 14 15:36:49 2013
start_time Mon Jan 14 15:36:55 2013
end_time Mon Jan 14 18:11:56 2013
granted_pe NONE
slots 1
failed 0
exit_status 137
ru_wallclock 9301
ru_utime 9262.906
ru_stime 7.916
ru_maxrss 13820636
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 46056
ru_majflt 26
ru_nswap 0
ru_inblock 392840
ru_oublock 32
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 536
ru_nivcsw 30791
cpu 9270.822
mem 61688.906
io 0.430
iow 0.000
maxvmem 13.302G
arid undefined
So, you mentioned "default time limit of your shell". My googling
suggested trying to set a wall time limit, or have the user specify the
wall time, but that did not help. A few google searches show the use of a
global time limit for jobs in general, but make no reference to a default
time limit of the shell. Am I supposed to be looking at limits such as
s_rt and h_rt? If so, how go I manipulate these for the specific user? The
queue_conf man page makes some reference to this, but it doesn't explain
explicitly how to manipulate it globally or on a per user basis making
reference to defaults or "shell".
Sorry - just stumbling through this and not finding it too intuitive.
--JC

_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users

Jake Carroll

2013-01-16 11:34:06 UTC

Permalink

Hi.

Interesting.

Dave Love

2013-01-18 16:16:01 UTC

Permalink

Post by Jake Carroll
Hi.
Interesting.

Dave Love

2013-01-18 16:14:32 UTC

Permalink

Post by Jake Carroll
We can't see any obvious memory starvation issues or resource
contention problems. Do you have any suggestions in things we could
look at to trap this? The error 137 stuff online, after looking around
a little, seems sparse at best.

That is actually an example in my copy of accounting(5), and I don't
think it's a recent addition. I'll amend it to say that it usually
means a job was killed with qdel, or it exceeded hard limits, and to
look in the messages file for the reason.

--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

Jake Carroll

2013-01-21 03:52:48 UTC

Permalink

Hi.

We've now shot the head node in the head (heh) and we're exploring killing
off/restarting each execd on the compute nodes.

Do you recommend a kill -HUP on the process, or something more aggressive?
This will in theory "kill" currently executing jobs on each compute host,
we're assuming?

Also, we just caught another one in the act, on one of the nodes that just
threw the 137:

[***@compute-0-6 ~]# tail -f
/opt/gridengine/default/spool/compute-0-6/messages
01/17/2013 08:03:15| main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:22:33| main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:24:55| main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 09:34:12| main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/17/2013 10:06:45| main|compute-0-6|E|removing unreferenced job
1371379.7545 without job report from ptf
01/17/2013 10:09:25| main|compute-0-6|W|reaping job "1371379" ptf
complains: Job does not exist
01/18/2013 17:10:52| main|compute-0-6|W|can't register at qmaster
"cluster.local": abort qmaster registration due to communication errors
01/18/2013 17:16:42| main|compute-0-6|W|gethostbyname(cluster.local) took
20 seconds and returns TRY_AGAIN

01/18/2013 17:25:37| main|compute-0-6|E|commlib error: got select error
(No route to host)

What's most unusual, about this, is that these time stamps don't match up
with the error 137 we just saw.

This example job was running for two days or so, then just became unhappy
today, then threw the 137:

Job 1307803 (b5_set11_9) Complete
User = someguy
Queue = ***@compute-0-6.local
Host = compute-0-6.local
Start Time = 01/14/2013 14:22:12
End Time = 01/21/2013 12:23:02
User Time = 6:21:24:07
System Time = 00:00:27
Wallclock Time = 6:22:00:50
CPU = 6:21:24:35
Max vmem = 13.302G
Exit Status = 137

***@cluster run]$ qacct -j 1307803
==============================================================
qname medium.q
hostname compute-0-6.local
group users
owner uqgmoser
project NONE
department defaultdepartment
jobname b5_set11_9
jobnumber 1307803
taskid undefined
account sge
priority 0
qsub_time Mon Jan 14 14:22:04 2013
start_time Mon Jan 14 14:22:12 2013
end_time Mon Jan 21 12:23:02 2013
granted_pe NONE
slots 1
failed 0
exit_status 137
ru_wallclock 597650
ru_utime 595447.475
ru_stime 27.902
ru_maxrss 13814492
ru_ixrss 0
ru_ismrss 0
ru_idrss 0
ru_isrss 0
ru_minflt 46105
ru_majflt 33
ru_nswap 0
ru_inblock 19736
ru_oublock 160
ru_msgsnd 0
ru_msgrcv 0
ru_nsignals 0
ru_nvcsw 786
ru_nivcsw 1520367
cpu 595475.377
mem 3938342.810
io 0.430
iow 0.000
maxvmem 13.302G
arid undefined

Thoughts, at this point? We're really running out of ideas now [apart from
the most recent suggestion of a restart of the execd and queuemaster].

--JC

Post by Jake Carroll
Hi.
Interesting.

Dave Love

2013-01-24 17:59:15 UTC

Permalink

Post by Jake Carroll
Hi.
We've now shot the head node in the head (heh) and we're exploring killing
off/restarting each execd on the compute nodes.
Do you recommend a kill -HUP on the process, or something more aggressive?
This will in theory "kill" currently executing jobs on each compute host,
we're assuming?

I can't remember what this refers to, but the init scripts for SGE 8
have a "restart" option which does softstop+start.

You'd better address the network errors before anything else. As in the
tracker, I don't know what causes the PTF errors, though.

Post by Jake Carroll
What's most unusual, about this, is that these time stamps don't match up
with the error 137 we just saw.

Look in the messages files for what does.

Post by Jake Carroll
This example job was running for two days or so, then just became unhappy
Job 1307803 (b5_set11_9) Complete
User = someguy
Host = compute-0-6.local
Start Time = 01/14/2013 14:22:12
End Time = 01/21/2013 12:23:02

That's nearly a week, not two days.

--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

Jake Carroll

2013-01-24 20:20:04 UTC

Permalink

We figured it out!

Specific user binary was not respecting vf memory complex and decided to
use all the RAM on random nodes it landed on!

How this generated a 137, and the explanation for what we were told a 137
meant really threw us off however!

Cheers.

--JC

Post by Dave Love

Post by Jake Carroll
Hi.
We've now shot the head node in the head (heh) and we're exploring
killing
off/restarting each execd on the compute nodes.
Do you recommend a kill -HUP on the process, or something more
aggressive?
This will in theory "kill" currently executing jobs on each compute
host,
we're assuming?

I can't remember what this refers to, but the init scripts for SGE 8
have a "restart" option which does softstop+start.

You'd better address the network errors before anything else. As in the
tracker, I don't know what causes the PTF errors, though.

Post by Jake Carroll
What's most unusual, about this, is that these time stamps don't match
up
with the error 137 we just saw.

Look in the messages files for what does.

Post by Jake Carroll
This example job was running for two days or so, then just became
unhappy
Job 1307803 (b5_set11_9) Complete
User = someguy
Host = compute-0-6.local
Start Time = 01/14/2013 14:22:12
End Time = 01/21/2013 12:23:02

That's nearly a week, not two days.
--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

Reuti

2013-01-25 11:10:02 UTC

Permalink

Post by Jake Carroll
We figured it out!
Specific user binary was not respecting vf memory complex and decided to
use all the RAM on random nodes it landed on!

So it was killed by the oom-killer?

Was this a hard limit h_vmem or only a complex?

-- Reuti

Post by Jake Carroll
How this generated a 137, and the explanation for what we were told a 137
meant really threw us off however!
Cheers.
--JC

Post by Dave Love

Post by Jake Carroll
Hi.
We've now shot the head node in the head (heh) and we're exploring
killing
off/restarting each execd on the compute nodes.
Do you recommend a kill -HUP on the process, or something more
aggressive?
This will in theory "kill" currently executing jobs on each compute
host,
we're assuming?

I can't remember what this refers to, but the init scripts for SGE 8
have a "restart" option which does softstop+start.

You'd better address the network errors before anything else. As in the
tracker, I don't know what causes the PTF errors, though.

Post by Jake Carroll
What's most unusual, about this, is that these time stamps don't match
up
with the error 137 we just saw.

Look in the messages files for what does.

Post by Jake Carroll
This example job was running for two days or so, then just became
unhappy
Job 1307803 (b5_set11_9) Complete
User = someguy
Host = compute-0-6.local
Start Time = 01/14/2013 14:22:12
End Time = 01/21/2013 12:23:02

That's nearly a week, not two days.
--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

Dave Love

2013-01-29 00:15:31 UTC

Permalink

Post by Jake Carroll
We figured it out!
Specific user binary was not respecting vf memory complex and decided to
use all the RAM on random nodes it landed on!

There's nothing to respect. If you use vf, it's only relevant to
scheduling jobs, not memory usage while running. If you want to
restrict usage (as you typically should), make h_vmem consumable with
appropriate values on the hosts, and use that. See various references
in the archives.

Post by Jake Carroll
How this generated a 137, and the explanation for what we were told a 137
meant really threw us off however!

How come? There was a reference to the documentation with an explicit
answer to the question. (Advice on examining messages and syslog is in
http://arc.liv.ac.uk/SGE/howto/troubleshooting.html.)

--
Community Grid Engine: http://arc.liv.ac.uk/SGE/

Continue reading on narkive:

Search results for '[gridengine users] Error 137 - trying to figure out what it means.' (Questions and Answers)

replies

Husband had feelings for another women our entire marriage? He tells me this 11 years later, what does this mean about our entire marriage?

started 2017-04-11 03:55:39 UTC

marriage & divorce

replies

I have no idea what this means...?

started 2011-06-27 01:11:54 UTC

words & wordplay

replies