Jake Carroll
2013-01-13 23:56:47 UTC
Hi all.
We're trying to figure out the answer to a problem that is escaping us. We can usually self solve most of these issues, but this one, we're having problems trapping and can't find any solid answers for after a lot of looking around on online resources.
One of our quite capable users [read: he rarely needs our help with grid engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for us is that sometimes these array based jobs (non PE's/parallel environments and no mpi/mpich explicit in use) are only crashing sometimes. Some, and not others. It seems almost quasi-random.
The code is written in fortran compiled with Intels ifort, using standard code optimisation (compile flag 02). However, the code is also compiled with optimisation turned off and traceback and error reporting turned on, and in both cases programs failed and no run-time error was printed. The same code was also compiled with gfortran and did also produce error '137'.
The code run successfully numerous times, but is doing something slightly different each time due to random sampling and different model specifications. There are 20 jobs because analyses are run across 20 replicates of a simulations. Previously our user had
no problems running these 20 replicates across 11 different models (20x11=220 runs).
Some specifics:
Array jobMemory allocation is 20GB, and the job uses less than 14GB.
Submitted through a shell script qsub test.sh, where test sh looks like:
-------------------------------------------------------
#$ -cwd
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
#$ -M s<mailto:***@uq.edu.au>***@somedomain.com
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
-----------------------------------------------------------------------------------------------------------------
Intels default is 'static compiling' from what we understand, in anyway no external libraries are used (although Intel uses its own MKL library).
We can't see any obvious memory starvation issues or resource contention problems. Do you have any suggestions in things we could look at to trap this? The error 137 stuff online, after looking around a little, seems sparse at best.
Any help would be appreciated.
--JC
We're trying to figure out the answer to a problem that is escaping us. We can usually self solve most of these issues, but this one, we're having problems trapping and can't find any solid answers for after a lot of looking around on online resources.
One of our quite capable users [read: he rarely needs our help with grid engine] has an unusual issue with certain jobs (seemingly, randomly?) crashing out on error 137. The code is predominantly C++ based running atop SGE 6.2u5 on the ROCKS clusters platform. What is making it hard for us is that sometimes these array based jobs (non PE's/parallel environments and no mpi/mpich explicit in use) are only crashing sometimes. Some, and not others. It seems almost quasi-random.
The code is written in fortran compiled with Intels ifort, using standard code optimisation (compile flag 02). However, the code is also compiled with optimisation turned off and traceback and error reporting turned on, and in both cases programs failed and no run-time error was printed. The same code was also compiled with gfortran and did also produce error '137'.
The code run successfully numerous times, but is doing something slightly different each time due to random sampling and different model specifications. There are 20 jobs because analyses are run across 20 replicates of a simulations. Previously our user had
no problems running these 20 replicates across 11 different models (20x11=220 runs).
Some specifics:
Array jobMemory allocation is 20GB, and the job uses less than 14GB.
Submitted through a shell script qsub test.sh, where test sh looks like:
-------------------------------------------------------
#$ -cwd
#$ -l vf=20G
#$ -N b1_set12_1
#$ -m eas
#$ -M s<mailto:***@uq.edu.au>***@somedomain.com
/path/to/some/stuff/here/bayesRsim <b1_set12_1.par
-----------------------------------------------------------------------------------------------------------------
Intels default is 'static compiling' from what we understand, in anyway no external libraries are used (although Intel uses its own MKL library).
We can't see any obvious memory starvation issues or resource contention problems. Do you have any suggestions in things we could look at to trap this? The error 137 stuff online, after looking around a little, seems sparse at best.
Any help would be appreciated.
--JC