Discussion:
[gridengine users] help
jan roels
2012-11-13 12:42:35 UTC
Permalink
Hi,

I followed the following tutorial:

http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html
on
how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.

First it gave the following error:

11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1"
running on port 6445: "can't bind socket"
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no
additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to
communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before
sending daemonize state

but then i killed the proces and restarted the gridengine-execd but then i
get the following:

/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding

What can i do to fix this?
Reuti
2012-11-13 16:34:19 UTC
Permalink
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html on how to install the SGE. It all went fine on my masternode but on my exec node i have some troubles.
11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before sending daemonize state
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.

-- Reuti
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-14 09:08:14 UTC
Permalink
I got it working again, there was already a proces of execd running that
needed to be killed and then restart the services.

I'm trying to run a script now:


#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
#$-t 1-4200:1

/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"

but it gives the following output:

stdin: is not a tty

and this is the output of my qstat -f:

queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
***@camilla.UGent.be BIP 0/1/1 0.70 lx26-amd64
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
***@node0 BIP 0/24/24 27.71 lx26-amd64
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1 10
35 0.50000 SA root r 11/14/2012 09:57:47 1 11
35 0.50000 SA root r 11/14/2012 09:57:47 1 12
35 0.50000 SA root r 11/14/2012 09:57:47 1 13
35 0.50000 SA root r 11/14/2012 09:57:47 1 14
35 0.50000 SA root r 11/14/2012 09:57:47 1 15
35 0.50000 SA root r 11/14/2012 09:57:47 1 16
35 0.50000 SA root r 11/14/2012 09:57:47 1 17
35 0.50000 SA root r 11/14/2012 09:57:47 1 18
35 0.50000 SA root r 11/14/2012 09:57:47 1 19
35 0.50000 SA root r 11/14/2012 09:57:47 1 20
35 0.50000 SA root r 11/14/2012 09:57:47 1 21
35 0.50000 SA root r 11/14/2012 09:57:47 1 22
35 0.50000 SA root r 11/14/2012 09:57:47 1 23
35 0.50000 SA root r 11/14/2012 09:57:47 1 24
35 0.50000 SA root r 11/14/2012 09:57:47 1 25

############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
35 0.50000 SA root qw 11/14/2012 09:57:38 1
26-4200:1


***@camilla:/nfs/share/sge# qstat -explain c -j 35
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
mail_list: ***@camilla
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
env_list:
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537,
vmem=1.110G, maxvmem=1.110G
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395,
vmem=3.643G, maxvmem=3.643G
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394,
vmem=3.657G, maxvmem=3.657G
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394,
vmem=3.655G, maxvmem=3.655G
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394,
vmem=3.636G, maxvmem=3.636G
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394,
vmem=3.644G, maxvmem=3.644G
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400,
vmem=3.640G, maxvmem=3.640G
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394,
vmem=3.669G, maxvmem=3.669G
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394,
vmem=3.635G, maxvmem=3.635G
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400,
vmem=3.668G, maxvmem=3.668G
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394,
vmem=3.665G, maxvmem=3.665G
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394,
vmem=3.663G, maxvmem=3.663G
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400,
vmem=3.648G, maxvmem=3.648G
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394,
vmem=3.656G, maxvmem=3.656G
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394,
vmem=3.633G, maxvmem=3.633G
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394,
vmem=3.662G, maxvmem=3.662G
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394,
vmem=3.643G, maxvmem=3.643G
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394,
vmem=3.660G, maxvmem=3.660G
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394,
vmem=3.652G, maxvmem=3.652G
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394,
vmem=3.652G, maxvmem=3.652G
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400,
vmem=3.644G, maxvmem=3.644G
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400,
vmem=3.669G, maxvmem=3.669G
scheduling info: queue instance "***@camilla" dropped because
it is full
queue instance "***@node0" dropped because
it is full
All queues dropped because of overload or full
not all array task may be started due to
'max_aj_instances'

You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no
additional information available)
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to
communication errors
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before
sending daemonize state
Post by jan roels
but then i killed the proces and restarted the gridengine-execd but then
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Reuti
2012-11-14 17:24:23 UTC
Permalink
I got it working again, there was already a proces of execd running that needed to be killed and then restart the services.
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1 10
35 0.50000 SA root r 11/14/2012 09:57:47 1 11
35 0.50000 SA root r 11/14/2012 09:57:47 1 12
35 0.50000 SA root r 11/14/2012 09:57:47 1 13
35 0.50000 SA root r 11/14/2012 09:57:47 1 14
35 0.50000 SA root r 11/14/2012 09:57:47 1 15
35 0.50000 SA root r 11/14/2012 09:57:47 1 16
35 0.50000 SA root r 11/14/2012 09:57:47 1 17
35 0.50000 SA root r 11/14/2012 09:57:47 1 18
35 0.50000 SA root r 11/14/2012 09:57:47 1 19
35 0.50000 SA root r 11/14/2012 09:57:47 1 20
35 0.50000 SA root r 11/14/2012 09:57:47 1 21
35 0.50000 SA root r 11/14/2012 09:57:47 1 22
35 0.50000 SA root r 11/14/2012 09:57:47 1 23
35 0.50000 SA root r 11/14/2012 09:57:47 1 24
35 0.50000 SA root r 11/14/2012 09:57:47 1 25
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
35 0.50000 SA root qw 11/14/2012 09:57:38 1 26-4200:1
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537, vmem=1.110G, maxvmem=1.110G
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395, vmem=3.643G, maxvmem=3.643G
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394, vmem=3.657G, maxvmem=3.657G
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394, vmem=3.655G, maxvmem=3.655G
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394, vmem=3.636G, maxvmem=3.636G
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394, vmem=3.644G, maxvmem=3.644G
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400, vmem=3.640G, maxvmem=3.640G
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394, vmem=3.669G, maxvmem=3.669G
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394, vmem=3.635G, maxvmem=3.635G
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400, vmem=3.668G, maxvmem=3.668G
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394, vmem=3.665G, maxvmem=3.665G
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394, vmem=3.663G, maxvmem=3.663G
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400, vmem=3.648G, maxvmem=3.648G
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394, vmem=3.656G, maxvmem=3.656G
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394, vmem=3.633G, maxvmem=3.633G
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394, vmem=3.662G, maxvmem=3.662G
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394, vmem=3.643G, maxvmem=3.643G
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394, vmem=3.660G, maxvmem=3.660G
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400, vmem=3.644G, maxvmem=3.644G
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400, vmem=3.669G, maxvmem=3.669G
All queues dropped because of overload or full
not all array task may be started due to 'max_aj_instances'
The machine is just full.

-- Reuti
You guys know how this can be solved?
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html on how to install the SGE. It all went fine on my masternode but on my exec node i have some troubles.
11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before sending daemonize state
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-22 08:19:13 UTC
Permalink
Hi,

Do you guys now what this error could be:

error reason 2: 11/22/2012 11:12:25 [0:31220]:
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
error reason 3: 11/22/2012 11:12:25 [0:31221]:
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool

this goes on as long as iets running... and my state went to:

69 0.50000 SA root Eqw 11/22/2012 09:12:05 1
1-500:1
69 0.00000 SA root qw 11/22/2012 09:12:05 1
501-4200:1

This is the script i was running:

#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1

/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"

Hope somebody can fix the problem.

Kind Regards
Post by jan roels
Post by jan roels
I got it working again, there was already a proces of execd running that
needed to be killed and then restart the services.
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
Post by jan roels
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
Post by jan roels
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1
10
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
11
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
12
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
13
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
14
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
15
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
16
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
17
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
18
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
19
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
20
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
21
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
22
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
23
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
24
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
25
############################################################################
Post by jan roels
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
JOBS
############################################################################
Post by jan roels
35 0.50000 SA root qw 11/14/2012 09:57:38 1
26-4200:1
Post by jan roels
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Post by jan roels
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537,
vmem=1.110G, maxvmem=1.110G
Post by jan roels
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395,
vmem=3.643G, maxvmem=3.643G
Post by jan roels
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394,
vmem=3.657G, maxvmem=3.657G
Post by jan roels
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394,
vmem=3.655G, maxvmem=3.655G
Post by jan roels
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394,
vmem=3.636G, maxvmem=3.636G
Post by jan roels
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394,
vmem=3.644G, maxvmem=3.644G
Post by jan roels
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400,
vmem=3.640G, maxvmem=3.640G
Post by jan roels
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394,
vmem=3.669G, maxvmem=3.669G
Post by jan roels
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394,
vmem=3.635G, maxvmem=3.635G
Post by jan roels
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400,
vmem=3.668G, maxvmem=3.668G
Post by jan roels
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394,
vmem=3.665G, maxvmem=3.665G
Post by jan roels
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394,
vmem=3.663G, maxvmem=3.663G
Post by jan roels
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400,
vmem=3.648G, maxvmem=3.648G
Post by jan roels
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394,
vmem=3.656G, maxvmem=3.656G
Post by jan roels
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394,
vmem=3.633G, maxvmem=3.633G
Post by jan roels
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394,
vmem=3.662G, maxvmem=3.662G
Post by jan roels
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394,
vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394,
vmem=3.643G, maxvmem=3.643G
Post by jan roels
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394,
vmem=3.660G, maxvmem=3.660G
Post by jan roels
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394,
vmem=3.652G, maxvmem=3.652G
Post by jan roels
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394,
vmem=3.652G, maxvmem=3.652G
Post by jan roels
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400,
vmem=3.644G, maxvmem=3.644G
Post by jan roels
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400,
vmem=3.669G, maxvmem=3.669G
because it is full
because it is full
Post by jan roels
All queues dropped because of overload or
full
Post by jan roels
not all array task may be started due to
'max_aj_instances'
The machine is just full.
-- Reuti
Post by jan roels
You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Post by jan roels
Is there already something running on this port - any older version of
the execd?
Post by jan roels
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket
(no additional information available)
Post by jan roels
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to
communication errors
Post by jan roels
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited
before sending daemonize state
Post by jan roels
Post by jan roels
but then i killed the proces and restarted the gridengine-execd but
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-22 11:30:45 UTC
Permalink
Hi,

qstat -j <jobid> didn't show the full error message, this one is the full
error message:

11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with
exit status = 27
11/22/2012 12:26:11| main|camilla|E|can't open usage file
"active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]:
execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
"/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
or directory
Post by jan roels
Hi,
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
69 0.50000 SA root Eqw 11/22/2012 09:12:05 1
1-500:1
69 0.00000 SA root qw 11/22/2012 09:12:05 1
501-4200:1
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Hope somebody can fix the problem.
Kind Regards
Post by jan roels
I got it working again, there was already a proces of execd running
that needed to be killed and then restart the services.
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
Post by jan roels
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
Post by jan roels
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
1
---------------------------------------------------------------------------------
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
2
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
3
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
4
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
5
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
6
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
7
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
8
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
9
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
10
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
11
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
12
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
13
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
14
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
15
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
16
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
17
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
18
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
19
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
20
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
21
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
22
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
23
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
24
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47 1
25
############################################################################
Post by jan roels
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
JOBS
############################################################################
Post by jan roels
35 0.50000 SA root qw 11/14/2012 09:57:38 1
26-4200:1
Post by jan roels
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Post by jan roels
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs,
io=0.01537, vmem=1.110G, maxvmem=1.110G
Post by jan roels
usage 2: cpu=00:04:17, mem=179.44371 GBs,
io=0.01395, vmem=3.643G, maxvmem=3.643G
Post by jan roels
usage 3: cpu=00:04:37, mem=191.69532 GBs,
io=0.01394, vmem=3.657G, maxvmem=3.657G
Post by jan roels
usage 4: cpu=00:04:34, mem=188.12645 GBs,
io=0.01394, vmem=3.655G, maxvmem=3.655G
Post by jan roels
usage 5: cpu=00:04:16, mem=180.18292 GBs,
io=0.01394, vmem=3.636G, maxvmem=3.636G
Post by jan roels
usage 6: cpu=00:04:22, mem=183.47616 GBs,
io=0.01394, vmem=3.644G, maxvmem=3.644G
Post by jan roels
usage 7: cpu=00:04:15, mem=179.89624 GBs,
io=0.01400, vmem=3.640G, maxvmem=3.640G
Post by jan roels
usage 8: cpu=00:04:55, mem=207.28643 GBs,
io=0.01394, vmem=3.669G, maxvmem=3.669G
Post by jan roels
usage 9: cpu=00:04:27, mem=184.86707 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 10: cpu=00:04:14, mem=179.09446 GBs,
io=0.01394, vmem=3.635G, maxvmem=3.635G
Post by jan roels
usage 11: cpu=00:04:47, mem=195.80372 GBs,
io=0.01400, vmem=3.668G, maxvmem=3.668G
Post by jan roels
usage 12: cpu=00:04:49, mem=203.43895 GBs,
io=0.01394, vmem=3.665G, maxvmem=3.665G
Post by jan roels
usage 13: cpu=00:04:45, mem=196.67175 GBs,
io=0.01394, vmem=3.663G, maxvmem=3.663G
Post by jan roels
usage 14: cpu=00:04:24, mem=185.68047 GBs,
io=0.01400, vmem=3.648G, maxvmem=3.648G
Post by jan roels
usage 15: cpu=00:04:40, mem=195.96253 GBs,
io=0.01394, vmem=3.656G, maxvmem=3.656G
Post by jan roels
usage 16: cpu=00:04:11, mem=179.84016 GBs,
io=0.01394, vmem=3.633G, maxvmem=3.633G
Post by jan roels
usage 17: cpu=00:04:43, mem=196.21689 GBs,
io=0.01394, vmem=3.662G, maxvmem=3.662G
Post by jan roels
usage 18: cpu=00:04:37, mem=197.39875 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 19: cpu=00:04:35, mem=191.55982 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
usage 20: cpu=00:04:26, mem=191.62928 GBs,
io=0.01394, vmem=3.643G, maxvmem=3.643G
Post by jan roels
usage 21: cpu=00:04:42, mem=197.87398 GBs,
io=0.01394, vmem=3.660G, maxvmem=3.660G
Post by jan roels
usage 22: cpu=00:04:36, mem=193.43107 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
usage 23: cpu=00:04:32, mem=193.12103 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
usage 24: cpu=00:04:25, mem=186.56485 GBs,
io=0.01400, vmem=3.644G, maxvmem=3.644G
Post by jan roels
usage 25: cpu=00:04:51, mem=201.81706 GBs,
io=0.01400, vmem=3.669G, maxvmem=3.669G
because it is full
because it is full
Post by jan roels
All queues dropped because of overload or
full
Post by jan roels
not all array task may be started due to
'max_aj_instances'
The machine is just full.
-- Reuti
Post by jan roels
You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Post by jan roels
Is there already something running on this port - any older version of
the execd?
Post by jan roels
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket
(no additional information available)
Post by jan roels
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to
communication errors
Post by jan roels
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited
before sending daemonize state
Post by jan roels
Post by jan roels
but then i killed the proces and restarted the gridengine-execd but
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Reuti
2012-11-22 13:26:04 UTC
Permalink
Hi,
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with exit status = 27
11/22/2012 12:26:11| main|camilla|E|can't open usage file "active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file or directory
Could be a permission problem. Everyone needs read-access to this directory as the jobscript is executed from there.

-- Reuti
Hi,
error reason 2: 11/22/2012 11:12:25 [0:31220]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
error reason 3: 11/22/2012 11:12:25 [0:31221]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
69 0.50000 SA root Eqw 11/22/2012 09:12:05 1 1-500:1
69 0.00000 SA root qw 11/22/2012 09:12:05 1 501-4200:1
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Hope somebody can fix the problem.
Kind Regards
I got it working again, there was already a proces of execd running that needed to be killed and then restart the services.
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1 10
35 0.50000 SA root r 11/14/2012 09:57:47 1 11
35 0.50000 SA root r 11/14/2012 09:57:47 1 12
35 0.50000 SA root r 11/14/2012 09:57:47 1 13
35 0.50000 SA root r 11/14/2012 09:57:47 1 14
35 0.50000 SA root r 11/14/2012 09:57:47 1 15
35 0.50000 SA root r 11/14/2012 09:57:47 1 16
35 0.50000 SA root r 11/14/2012 09:57:47 1 17
35 0.50000 SA root r 11/14/2012 09:57:47 1 18
35 0.50000 SA root r 11/14/2012 09:57:47 1 19
35 0.50000 SA root r 11/14/2012 09:57:47 1 20
35 0.50000 SA root r 11/14/2012 09:57:47 1 21
35 0.50000 SA root r 11/14/2012 09:57:47 1 22
35 0.50000 SA root r 11/14/2012 09:57:47 1 23
35 0.50000 SA root r 11/14/2012 09:57:47 1 24
35 0.50000 SA root r 11/14/2012 09:57:47 1 25
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
35 0.50000 SA root qw 11/14/2012 09:57:38 1 26-4200:1
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537, vmem=1.110G, maxvmem=1.110G
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395, vmem=3.643G, maxvmem=3.643G
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394, vmem=3.657G, maxvmem=3.657G
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394, vmem=3.655G, maxvmem=3.655G
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394, vmem=3.636G, maxvmem=3.636G
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394, vmem=3.644G, maxvmem=3.644G
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400, vmem=3.640G, maxvmem=3.640G
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394, vmem=3.669G, maxvmem=3.669G
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394, vmem=3.635G, maxvmem=3.635G
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400, vmem=3.668G, maxvmem=3.668G
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394, vmem=3.665G, maxvmem=3.665G
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394, vmem=3.663G, maxvmem=3.663G
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400, vmem=3.648G, maxvmem=3.648G
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394, vmem=3.656G, maxvmem=3.656G
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394, vmem=3.633G, maxvmem=3.633G
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394, vmem=3.662G, maxvmem=3.662G
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394, vmem=3.643G, maxvmem=3.643G
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394, vmem=3.660G, maxvmem=3.660G
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400, vmem=3.644G, maxvmem=3.644G
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400, vmem=3.669G, maxvmem=3.669G
All queues dropped because of overload or full
not all array task may be started due to 'max_aj_instances'
The machine is just full.
-- Reuti
You guys know how this can be solved?
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html on how to install the SGE. It all went fine on my masternode but on my exec node i have some troubles.
11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before sending daemonize state
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-22 13:29:14 UTC
Permalink
I tried it with the root-account and with another account... both the same
error
Post by jan roels
Post by jan roels
Hi,
qstat -j <jobid> didn't show the full error message, this one is the
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with
exit status = 27
Post by jan roels
11/22/2012 12:26:11| main|camilla|E|can't open usage file
"active_jobs/76.226/usage" for job 76.226: No such file or directory
execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
"/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
or directory
Could be a permission problem. Everyone needs read-access to this
directory as the jobscript is executed from there.
-- Reuti
Post by jan roels
Hi,
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
Post by jan roels
69 0.50000 SA root Eqw 11/22/2012 09:12:05 1
1-500:1
Post by jan roels
69 0.00000 SA root qw 11/22/2012 09:12:05 1
501-4200:1
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Hope somebody can fix the problem.
Kind Regards
Post by jan roels
I got it working again, there was already a proces of execd running
that needed to be killed and then restart the services.
Post by jan roels
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your
machine(s).
Post by jan roels
Post by jan roels
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Post by jan roels
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore
it.
Post by jan roels
Post by jan roels
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 1
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 2
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 3
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 4
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 5
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 6
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 7
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 8
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 9
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 10
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 11
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 12
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 13
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 14
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 15
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 16
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 17
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 18
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 19
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 20
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 21
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 22
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 23
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 24
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 25
############################################################################
Post by jan roels
Post by jan roels
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING
JOBS
############################################################################
Post by jan roels
Post by jan roels
35 0.50000 SA root qw 11/14/2012 09:57:38
1 26-4200:1
Post by jan roels
Post by jan roels
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Post by jan roels
Post by jan roels
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs,
io=0.01537, vmem=1.110G, maxvmem=1.110G
Post by jan roels
Post by jan roels
usage 2: cpu=00:04:17, mem=179.44371 GBs,
io=0.01395, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
usage 3: cpu=00:04:37, mem=191.69532 GBs,
io=0.01394, vmem=3.657G, maxvmem=3.657G
Post by jan roels
Post by jan roels
usage 4: cpu=00:04:34, mem=188.12645 GBs,
io=0.01394, vmem=3.655G, maxvmem=3.655G
Post by jan roels
Post by jan roels
usage 5: cpu=00:04:16, mem=180.18292 GBs,
io=0.01394, vmem=3.636G, maxvmem=3.636G
Post by jan roels
Post by jan roels
usage 6: cpu=00:04:22, mem=183.47616 GBs,
io=0.01394, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
usage 7: cpu=00:04:15, mem=179.89624 GBs,
io=0.01400, vmem=3.640G, maxvmem=3.640G
Post by jan roels
Post by jan roels
usage 8: cpu=00:04:55, mem=207.28643 GBs,
io=0.01394, vmem=3.669G, maxvmem=3.669G
Post by jan roels
Post by jan roels
usage 9: cpu=00:04:27, mem=184.86707 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
usage 10: cpu=00:04:14, mem=179.09446 GBs,
io=0.01394, vmem=3.635G, maxvmem=3.635G
Post by jan roels
Post by jan roels
usage 11: cpu=00:04:47, mem=195.80372 GBs,
io=0.01400, vmem=3.668G, maxvmem=3.668G
Post by jan roels
Post by jan roels
usage 12: cpu=00:04:49, mem=203.43895 GBs,
io=0.01394, vmem=3.665G, maxvmem=3.665G
Post by jan roels
Post by jan roels
usage 13: cpu=00:04:45, mem=196.67175 GBs,
io=0.01394, vmem=3.663G, maxvmem=3.663G
Post by jan roels
Post by jan roels
usage 14: cpu=00:04:24, mem=185.68047 GBs,
io=0.01400, vmem=3.648G, maxvmem=3.648G
Post by jan roels
Post by jan roels
usage 15: cpu=00:04:40, mem=195.96253 GBs,
io=0.01394, vmem=3.656G, maxvmem=3.656G
Post by jan roels
Post by jan roels
usage 16: cpu=00:04:11, mem=179.84016 GBs,
io=0.01394, vmem=3.633G, maxvmem=3.633G
Post by jan roels
Post by jan roels
usage 17: cpu=00:04:43, mem=196.21689 GBs,
io=0.01394, vmem=3.662G, maxvmem=3.662G
Post by jan roels
Post by jan roels
usage 18: cpu=00:04:37, mem=197.39875 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
usage 19: cpu=00:04:35, mem=191.55982 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
usage 20: cpu=00:04:26, mem=191.62928 GBs,
io=0.01394, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
usage 21: cpu=00:04:42, mem=197.87398 GBs,
io=0.01394, vmem=3.660G, maxvmem=3.660G
Post by jan roels
Post by jan roels
usage 22: cpu=00:04:36, mem=193.43107 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
usage 23: cpu=00:04:32, mem=193.12103 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
usage 24: cpu=00:04:25, mem=186.56485 GBs,
io=0.01400, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
usage 25: cpu=00:04:51, mem=201.81706 GBs,
io=0.01400, vmem=3.669G, maxvmem=3.669G
because it is full
because it is full
Post by jan roels
Post by jan roels
All queues dropped because of overload or
full
Post by jan roels
Post by jan roels
not all array task may be started due to
'max_aj_instances'
Post by jan roels
The machine is just full.
-- Reuti
Post by jan roels
You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Post by jan roels
Post by jan roels
Is there already something running on this port - any older version of
the execd?
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket
(no additional information available)
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to
communication errors
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited
before sending daemonize state
Post by jan roels
Post by jan roels
Post by jan roels
but then i killed the proces and restarted the gridengine-execd but
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
Post by jan roels
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Reuti
2012-11-22 13:34:41 UTC
Permalink
I tried it with the root-account and with another account... both the same error
Is the directory local on "camilla" and the nodename is unqiue?

-- Reuti
Hi,
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with exit status = 27
11/22/2012 12:26:11| main|camilla|E|can't open usage file "active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file or directory
Could be a permission problem. Everyone needs read-access to this directory as the jobscript is executed from there.
-- Reuti
Hi,
error reason 2: 11/22/2012 11:12:25 [0:31220]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
error reason 3: 11/22/2012 11:12:25 [0:31221]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
69 0.50000 SA root Eqw 11/22/2012 09:12:05 1 1-500:1
69 0.00000 SA root qw 11/22/2012 09:12:05 1 501-4200:1
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Hope somebody can fix the problem.
Kind Regards
I got it working again, there was already a proces of execd running that needed to be killed and then restart the services.
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1 10
35 0.50000 SA root r 11/14/2012 09:57:47 1 11
35 0.50000 SA root r 11/14/2012 09:57:47 1 12
35 0.50000 SA root r 11/14/2012 09:57:47 1 13
35 0.50000 SA root r 11/14/2012 09:57:47 1 14
35 0.50000 SA root r 11/14/2012 09:57:47 1 15
35 0.50000 SA root r 11/14/2012 09:57:47 1 16
35 0.50000 SA root r 11/14/2012 09:57:47 1 17
35 0.50000 SA root r 11/14/2012 09:57:47 1 18
35 0.50000 SA root r 11/14/2012 09:57:47 1 19
35 0.50000 SA root r 11/14/2012 09:57:47 1 20
35 0.50000 SA root r 11/14/2012 09:57:47 1 21
35 0.50000 SA root r 11/14/2012 09:57:47 1 22
35 0.50000 SA root r 11/14/2012 09:57:47 1 23
35 0.50000 SA root r 11/14/2012 09:57:47 1 24
35 0.50000 SA root r 11/14/2012 09:57:47 1 25
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
35 0.50000 SA root qw 11/14/2012 09:57:38 1 26-4200:1
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537, vmem=1.110G, maxvmem=1.110G
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395, vmem=3.643G, maxvmem=3.643G
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394, vmem=3.657G, maxvmem=3.657G
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394, vmem=3.655G, maxvmem=3.655G
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394, vmem=3.636G, maxvmem=3.636G
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394, vmem=3.644G, maxvmem=3.644G
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400, vmem=3.640G, maxvmem=3.640G
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394, vmem=3.669G, maxvmem=3.669G
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394, vmem=3.635G, maxvmem=3.635G
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400, vmem=3.668G, maxvmem=3.668G
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394, vmem=3.665G, maxvmem=3.665G
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394, vmem=3.663G, maxvmem=3.663G
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400, vmem=3.648G, maxvmem=3.648G
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394, vmem=3.656G, maxvmem=3.656G
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394, vmem=3.633G, maxvmem=3.633G
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394, vmem=3.662G, maxvmem=3.662G
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394, vmem=3.643G, maxvmem=3.643G
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394, vmem=3.660G, maxvmem=3.660G
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400, vmem=3.644G, maxvmem=3.644G
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400, vmem=3.669G, maxvmem=3.669G
All queues dropped because of overload or full
not all array task may be started due to 'max_aj_instances'
The machine is just full.
-- Reuti
You guys know how this can be solved?
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html on how to install the SGE. It all went fine on my masternode but on my exec node i have some troubles.
11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before sending daemonize state
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-22 13:42:24 UTC
Permalink
I work on an nfs share that is also available on the node. I'm currently
testing with only one node so it's unique...
Post by jan roels
I tried it with the root-account and with another account... both the
same error
Is the directory local on "camilla" and the nodename is unqiue?
-- Reuti
Post by jan roels
Post by jan roels
Hi,
qstat -j <jobid> didn't show the full error message, this one is the
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited
with exit status = 27
Post by jan roels
Post by jan roels
11/22/2012 12:26:11| main|camilla|E|can't open usage file
"active_jobs/76.226/usage" for job 76.226: No such file or directory
execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
"/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
or directory
Post by jan roels
Could be a permission problem. Everyone needs read-access to this
directory as the jobscript is executed from there.
Post by jan roels
-- Reuti
Post by jan roels
Hi,
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
Post by jan roels
Post by jan roels
69 0.50000 SA root Eqw 11/22/2012 09:12:05
1 1-500:1
Post by jan roels
Post by jan roels
69 0.00000 SA root qw 11/22/2012 09:12:05
1 501-4200:1
Post by jan roels
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Post by jan roels
Hope somebody can fix the problem.
Kind Regards
Post by jan roels
I got it working again, there was already a proces of execd running
that needed to be killed and then restart the services.
Post by jan roels
Post by jan roels
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your
machine(s).
Post by jan roels
Post by jan roels
Post by jan roels
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Post by jan roels
Post by jan roels
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to
ignore it.
Post by jan roels
Post by jan roels
Post by jan roels
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 1
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 2
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 3
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 4
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 5
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 6
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 7
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 8
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 9
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 10
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 11
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 12
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 13
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 14
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 15
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 16
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 17
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 18
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 19
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 20
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 21
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 22
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 23
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 24
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 25
############################################################################
Post by jan roels
Post by jan roels
Post by jan roels
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
PENDING JOBS
############################################################################
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root qw 11/14/2012 09:57:38
1 26-4200:1
Post by jan roels
Post by jan roels
Post by jan roels
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Post by jan roels
Post by jan roels
Post by jan roels
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs,
io=0.01537, vmem=1.110G, maxvmem=1.110G
Post by jan roels
Post by jan roels
Post by jan roels
usage 2: cpu=00:04:17, mem=179.44371 GBs,
io=0.01395, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
Post by jan roels
usage 3: cpu=00:04:37, mem=191.69532 GBs,
io=0.01394, vmem=3.657G, maxvmem=3.657G
Post by jan roels
Post by jan roels
Post by jan roels
usage 4: cpu=00:04:34, mem=188.12645 GBs,
io=0.01394, vmem=3.655G, maxvmem=3.655G
Post by jan roels
Post by jan roels
Post by jan roels
usage 5: cpu=00:04:16, mem=180.18292 GBs,
io=0.01394, vmem=3.636G, maxvmem=3.636G
Post by jan roels
Post by jan roels
Post by jan roels
usage 6: cpu=00:04:22, mem=183.47616 GBs,
io=0.01394, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
Post by jan roels
usage 7: cpu=00:04:15, mem=179.89624 GBs,
io=0.01400, vmem=3.640G, maxvmem=3.640G
Post by jan roels
Post by jan roels
Post by jan roels
usage 8: cpu=00:04:55, mem=207.28643 GBs,
io=0.01394, vmem=3.669G, maxvmem=3.669G
Post by jan roels
Post by jan roels
Post by jan roels
usage 9: cpu=00:04:27, mem=184.86707 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
usage 10: cpu=00:04:14, mem=179.09446 GBs,
io=0.01394, vmem=3.635G, maxvmem=3.635G
Post by jan roels
Post by jan roels
Post by jan roels
usage 11: cpu=00:04:47, mem=195.80372 GBs,
io=0.01400, vmem=3.668G, maxvmem=3.668G
Post by jan roels
Post by jan roels
Post by jan roels
usage 12: cpu=00:04:49, mem=203.43895 GBs,
io=0.01394, vmem=3.665G, maxvmem=3.665G
Post by jan roels
Post by jan roels
Post by jan roels
usage 13: cpu=00:04:45, mem=196.67175 GBs,
io=0.01394, vmem=3.663G, maxvmem=3.663G
Post by jan roels
Post by jan roels
Post by jan roels
usage 14: cpu=00:04:24, mem=185.68047 GBs,
io=0.01400, vmem=3.648G, maxvmem=3.648G
Post by jan roels
Post by jan roels
Post by jan roels
usage 15: cpu=00:04:40, mem=195.96253 GBs,
io=0.01394, vmem=3.656G, maxvmem=3.656G
Post by jan roels
Post by jan roels
Post by jan roels
usage 16: cpu=00:04:11, mem=179.84016 GBs,
io=0.01394, vmem=3.633G, maxvmem=3.633G
Post by jan roels
Post by jan roels
Post by jan roels
usage 17: cpu=00:04:43, mem=196.21689 GBs,
io=0.01394, vmem=3.662G, maxvmem=3.662G
Post by jan roels
Post by jan roels
Post by jan roels
usage 18: cpu=00:04:37, mem=197.39875 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
usage 19: cpu=00:04:35, mem=191.55982 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
usage 20: cpu=00:04:26, mem=191.62928 GBs,
io=0.01394, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
Post by jan roels
usage 21: cpu=00:04:42, mem=197.87398 GBs,
io=0.01394, vmem=3.660G, maxvmem=3.660G
Post by jan roels
Post by jan roels
Post by jan roels
usage 22: cpu=00:04:36, mem=193.43107 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
Post by jan roels
usage 23: cpu=00:04:32, mem=193.12103 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
Post by jan roels
usage 24: cpu=00:04:25, mem=186.56485 GBs,
io=0.01400, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
Post by jan roels
usage 25: cpu=00:04:51, mem=201.81706 GBs,
io=0.01400, vmem=3.669G, maxvmem=3.669G
because it is full
because it is full
Post by jan roels
Post by jan roels
Post by jan roels
All queues dropped because of overload
or full
Post by jan roels
Post by jan roels
Post by jan roels
not all array task may be started due to
'max_aj_instances'
Post by jan roels
Post by jan roels
The machine is just full.
-- Reuti
Post by jan roels
You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Post by jan roels
Post by jan roels
Post by jan roels
Is there already something running on this port - any older version
of the execd?
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind
socket (no additional information available)
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due
to communication errors
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited
before sending daemonize state
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
but then i killed the proces and restarted the gridengine-execd
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be
excluded.
Post by jan roels
Post by jan roels
Post by jan roels
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Reuti
2012-11-22 14:07:25 UTC
Permalink
I work on an nfs share that is also available on the node. I'm currently testing with only one node so it's unique...
Just be aware, that in this case the job script will be send by SGE to the execd on the node which stores it in turn on the NFS server (which might be the same machine as the master).

I'm not sure about the error message: is it mounted with "noexec" and/or "allsquash"/"rootsquash"?

But the error should be "permission denied" in these cases.

-- Reuti
I tried it with the root-account and with another account... both the same error
Is the directory local on "camilla" and the nodename is unqiue?
-- Reuti
Hi,
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited with exit status = 27
11/22/2012 12:26:11| main|camilla|E|can't open usage file "active_jobs/76.226/usage" for job 76.226: No such file or directory
11/22/2012 12:26:11| main|camilla|E|11/22/2012 12:26:10 [0:11412]: execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76, "/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file or directory
Could be a permission problem. Everyone needs read-access to this directory as the jobscript is executed from there.
-- Reuti
Hi,
error reason 2: 11/22/2012 11:12:25 [0:31220]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
error reason 3: 11/22/2012 11:12:25 [0:31221]: execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
69 0.50000 SA root Eqw 11/22/2012 09:12:05 1 1-500:1
69 0.00000 SA root qw 11/22/2012 09:12:05 1 501-4200:1
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Hope somebody can fix the problem.
Kind Regards
I got it working again, there was already a proces of execd running that needed to be killed and then restart the services.
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash your machine(s).
/var/software/packages/Mathematica/7.0/Executables/math -run "teller=$SGE_TASK_ID;<< ModelCaCO31.m"
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to ignore it.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 1
---------------------------------------------------------------------------------
35 0.50000 SA root r 11/14/2012 09:57:47 1 2
35 0.50000 SA root r 11/14/2012 09:57:47 1 3
35 0.50000 SA root r 11/14/2012 09:57:47 1 4
35 0.50000 SA root r 11/14/2012 09:57:47 1 5
35 0.50000 SA root r 11/14/2012 09:57:47 1 6
35 0.50000 SA root r 11/14/2012 09:57:47 1 7
35 0.50000 SA root r 11/14/2012 09:57:47 1 8
35 0.50000 SA root r 11/14/2012 09:57:47 1 9
35 0.50000 SA root r 11/14/2012 09:57:47 1 10
35 0.50000 SA root r 11/14/2012 09:57:47 1 11
35 0.50000 SA root r 11/14/2012 09:57:47 1 12
35 0.50000 SA root r 11/14/2012 09:57:47 1 13
35 0.50000 SA root r 11/14/2012 09:57:47 1 14
35 0.50000 SA root r 11/14/2012 09:57:47 1 15
35 0.50000 SA root r 11/14/2012 09:57:47 1 16
35 0.50000 SA root r 11/14/2012 09:57:47 1 17
35 0.50000 SA root r 11/14/2012 09:57:47 1 18
35 0.50000 SA root r 11/14/2012 09:57:47 1 19
35 0.50000 SA root r 11/14/2012 09:57:47 1 20
35 0.50000 SA root r 11/14/2012 09:57:47 1 21
35 0.50000 SA root r 11/14/2012 09:57:47 1 22
35 0.50000 SA root r 11/14/2012 09:57:47 1 23
35 0.50000 SA root r 11/14/2012 09:57:47 1 24
35 0.50000 SA root r 11/14/2012 09:57:47 1 25
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
35 0.50000 SA root qw 11/14/2012 09:57:38 1 26-4200:1
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
sge_o_path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs, io=0.01537, vmem=1.110G, maxvmem=1.110G
usage 2: cpu=00:04:17, mem=179.44371 GBs, io=0.01395, vmem=3.643G, maxvmem=3.643G
usage 3: cpu=00:04:37, mem=191.69532 GBs, io=0.01394, vmem=3.657G, maxvmem=3.657G
usage 4: cpu=00:04:34, mem=188.12645 GBs, io=0.01394, vmem=3.655G, maxvmem=3.655G
usage 5: cpu=00:04:16, mem=180.18292 GBs, io=0.01394, vmem=3.636G, maxvmem=3.636G
usage 6: cpu=00:04:22, mem=183.47616 GBs, io=0.01394, vmem=3.644G, maxvmem=3.644G
usage 7: cpu=00:04:15, mem=179.89624 GBs, io=0.01400, vmem=3.640G, maxvmem=3.640G
usage 8: cpu=00:04:55, mem=207.28643 GBs, io=0.01394, vmem=3.669G, maxvmem=3.669G
usage 9: cpu=00:04:27, mem=184.86707 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 10: cpu=00:04:14, mem=179.09446 GBs, io=0.01394, vmem=3.635G, maxvmem=3.635G
usage 11: cpu=00:04:47, mem=195.80372 GBs, io=0.01400, vmem=3.668G, maxvmem=3.668G
usage 12: cpu=00:04:49, mem=203.43895 GBs, io=0.01394, vmem=3.665G, maxvmem=3.665G
usage 13: cpu=00:04:45, mem=196.67175 GBs, io=0.01394, vmem=3.663G, maxvmem=3.663G
usage 14: cpu=00:04:24, mem=185.68047 GBs, io=0.01400, vmem=3.648G, maxvmem=3.648G
usage 15: cpu=00:04:40, mem=195.96253 GBs, io=0.01394, vmem=3.656G, maxvmem=3.656G
usage 16: cpu=00:04:11, mem=179.84016 GBs, io=0.01394, vmem=3.633G, maxvmem=3.633G
usage 17: cpu=00:04:43, mem=196.21689 GBs, io=0.01394, vmem=3.662G, maxvmem=3.662G
usage 18: cpu=00:04:37, mem=197.39875 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 19: cpu=00:04:35, mem=191.55982 GBs, io=0.01394, vmem=3.653G, maxvmem=3.653G
usage 20: cpu=00:04:26, mem=191.62928 GBs, io=0.01394, vmem=3.643G, maxvmem=3.643G
usage 21: cpu=00:04:42, mem=197.87398 GBs, io=0.01394, vmem=3.660G, maxvmem=3.660G
usage 22: cpu=00:04:36, mem=193.43107 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 23: cpu=00:04:32, mem=193.12103 GBs, io=0.01394, vmem=3.652G, maxvmem=3.652G
usage 24: cpu=00:04:25, mem=186.56485 GBs, io=0.01400, vmem=3.644G, maxvmem=3.644G
usage 25: cpu=00:04:51, mem=201.81706 GBs, io=0.01400, vmem=3.669G, maxvmem=3.669G
All queues dropped because of overload or full
not all array task may be started due to 'max_aj_instances'
The machine is just full.
-- Reuti
You guys know how this can be solved?
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html on how to install the SGE. It all went fine on my masternode but on my exec node i have some troubles.
11/13/2012 13:44:43| main|node0|E|communication error for "node0/execd/1" running on port 6445: "can't bind socket"
Is there already something running on this port - any older version of the execd?
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind socket (no additional information available)
11/13/2012 13:45:12| main|node0|C|abort qmaster registration due to communication errors
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited before sending daemonize state
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd error: can't resolve host name
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be excluded.
-- Reuti
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
jan roels
2012-11-22 14:23:07 UTC
Permalink
/nfs/share
192.168.100.0/255.255.255.0(rw,async,no_root_squash,no_subtree_check)

these are the options in the /etc/exports file and i just mount it without
any special options just:

mount -t nfs 192.168.100.20:/nfs/share /nfs/share
Post by jan roels
Post by jan roels
I work on an nfs share that is also available on the node. I'm currently
testing with only one node so it's unique...
Just be aware, that in this case the job script will be send by SGE to the
execd on the node which stores it in turn on the NFS server (which might be
the same machine as the master).
I'm not sure about the error message: is it mounted with "noexec" and/or
"allsquash"/"rootsquash"?
But the error should be "permission denied" in these cases.
-- Reuti
Post by jan roels
Post by jan roels
I tried it with the root-account and with another account... both the
same error
Post by jan roels
Is the directory local on "camilla" and the nodename is unqiue?
-- Reuti
Post by jan roels
Post by jan roels
Hi,
qstat -j <jobid> didn't show the full error message, this one is the
11/22/2012 12:26:11| main|camilla|E|shepherd of job 76.226 exited
with exit status = 27
Post by jan roels
Post by jan roels
Post by jan roels
11/22/2012 12:26:11| main|camilla|E|can't open usage file
"active_jobs/76.226/usage" for job 76.226: No such file or directory
execvlp(/var/spool/gridengine/execd/camilla/job_scripts/76,
"/var/spool/gridengine/execd/camilla/job_scripts/76") failed: No such file
or directory
Post by jan roels
Post by jan roels
Could be a permission problem. Everyone needs read-access to this
directory as the jobscript is executed from there.
Post by jan roels
Post by jan roels
-- Reuti
Post by jan roels
Hi,
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
execvlp(/var/spool/gridengine/execd/node0/job_scripts/69, "/var/spool
Post by jan roels
Post by jan roels
Post by jan roels
69 0.50000 SA root Eqw 11/22/2012 09:12:05
1 1-500:1
Post by jan roels
Post by jan roels
Post by jan roels
69 0.00000 SA root qw 11/22/2012 09:12:05
1 501-4200:1
Post by jan roels
Post by jan roels
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-t 1-4200:1
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Post by jan roels
Post by jan roels
Hope somebody can fix the problem.
Kind Regards
Post by jan roels
I got it working again, there was already a proces of execd
running that needed to be killed and then restart the services.
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
#!/bin/bash
#$-cwd
#$-N SA
#$-S /bin/sh
Don't run scripts at root. If something goes wring it might trash
your machine(s).
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
/var/software/packages/Mathematica/7.0/Executables/math -run
"teller=$SGE_TASK_ID;<< ModelCaCO31.m"
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
stdin: is not a tty
It's just a warning - unless someone complains I would suggest to
ignore it.
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
queuename qtype resv/used/tot. load_avg arch
states
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 1
---------------------------------------------------------------------------------
lx26-amd64
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 2
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 3
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 4
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 5
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 6
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 7
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 8
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 9
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 10
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 11
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 12
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 13
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 14
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 15
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 16
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 17
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 18
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 19
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 20
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 21
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 22
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 23
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 24
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root r 11/14/2012 09:57:47
1 25
############################################################################
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS -
PENDING JOBS
############################################################################
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
35 0.50000 SA root qw 11/14/2012 09:57:38
1 26-4200:1
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
==============================================================
job_number: 35
exec_file: job_scripts/35
submission_time: Wed Nov 14 09:57:38 2012
owner: root
uid: 0
group: root
gid: 0
sge_o_home: /root
sge_o_log_name: root
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
sge_o_shell: /bin/bash
sge_o_workdir: /nfs/share/sge
sge_o_host: camilla
account: sge
cwd: /nfs/share/sge
notify: FALSE
job_name: SA
jobshare: 0
shell_list: NONE:/bin/sh
script_file: HistDisCaCO31.sh
job-array tasks: 1-4200:1
usage 1: cpu=00:05:20, mem=105.16135 GBs,
io=0.01537, vmem=1.110G, maxvmem=1.110G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 2: cpu=00:04:17, mem=179.44371 GBs,
io=0.01395, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 3: cpu=00:04:37, mem=191.69532 GBs,
io=0.01394, vmem=3.657G, maxvmem=3.657G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 4: cpu=00:04:34, mem=188.12645 GBs,
io=0.01394, vmem=3.655G, maxvmem=3.655G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 5: cpu=00:04:16, mem=180.18292 GBs,
io=0.01394, vmem=3.636G, maxvmem=3.636G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 6: cpu=00:04:22, mem=183.47616 GBs,
io=0.01394, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 7: cpu=00:04:15, mem=179.89624 GBs,
io=0.01400, vmem=3.640G, maxvmem=3.640G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 8: cpu=00:04:55, mem=207.28643 GBs,
io=0.01394, vmem=3.669G, maxvmem=3.669G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 9: cpu=00:04:27, mem=184.86707 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 10: cpu=00:04:14, mem=179.09446 GBs,
io=0.01394, vmem=3.635G, maxvmem=3.635G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 11: cpu=00:04:47, mem=195.80372 GBs,
io=0.01400, vmem=3.668G, maxvmem=3.668G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 12: cpu=00:04:49, mem=203.43895 GBs,
io=0.01394, vmem=3.665G, maxvmem=3.665G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 13: cpu=00:04:45, mem=196.67175 GBs,
io=0.01394, vmem=3.663G, maxvmem=3.663G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 14: cpu=00:04:24, mem=185.68047 GBs,
io=0.01400, vmem=3.648G, maxvmem=3.648G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 15: cpu=00:04:40, mem=195.96253 GBs,
io=0.01394, vmem=3.656G, maxvmem=3.656G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 16: cpu=00:04:11, mem=179.84016 GBs,
io=0.01394, vmem=3.633G, maxvmem=3.633G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 17: cpu=00:04:43, mem=196.21689 GBs,
io=0.01394, vmem=3.662G, maxvmem=3.662G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 18: cpu=00:04:37, mem=197.39875 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 19: cpu=00:04:35, mem=191.55982 GBs,
io=0.01394, vmem=3.653G, maxvmem=3.653G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 20: cpu=00:04:26, mem=191.62928 GBs,
io=0.01394, vmem=3.643G, maxvmem=3.643G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 21: cpu=00:04:42, mem=197.87398 GBs,
io=0.01394, vmem=3.660G, maxvmem=3.660G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 22: cpu=00:04:36, mem=193.43107 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 23: cpu=00:04:32, mem=193.12103 GBs,
io=0.01394, vmem=3.652G, maxvmem=3.652G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 24: cpu=00:04:25, mem=186.56485 GBs,
io=0.01400, vmem=3.644G, maxvmem=3.644G
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
usage 25: cpu=00:04:51, mem=201.81706 GBs,
io=0.01400, vmem=3.669G, maxvmem=3.669G
dropped because it is full
because it is full
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
All queues dropped because of overload
or full
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
not all array task may be started due
to 'max_aj_instances'
Post by jan roels
Post by jan roels
Post by jan roels
The machine is just full.
-- Reuti
Post by jan roels
You guys know how this can be solved?
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.htmlon how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:43| main|node0|E|communication error for
"node0/execd/1" running on port 6445: "can't bind socket"
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Is there already something running on this port - any older
version of the execd?
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:44:44| main|node0|E|commlib error: can't bind
socket (no additional information available)
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:12| main|node0|C|abort qmaster registration
due to communication errors
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
11/13/2012 13:45:14| main|node0|W|daemonize error: child exited
before sending daemonize state
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
but then i killed the proces and restarted the gridengine-execd
/etc/init.d/gridengine-exec restart
* Restarting Sun Grid Engine Execution Daemon sge_execd
error: can't resolve host name
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
error: can't get configuration from qmaster -- backgrounding
What can i do to fix this?
Any firewall on the machines? Ports 6444 and 6445 need to be
excluded.
Post by jan roels
Post by jan roels
Post by jan roels
Post by jan roels
-- Reuti
Post by jan roels
_______________________________________________
users mailing list
https://gridengine.org/mailman/listinfo/users
Dave Love
2012-11-18 23:16:08 UTC
Permalink
Post by jan roels
Hi,
http://verahill.blogspot.be/2012/06/setting-up-sun-grid-engine-with-three.html
on
how to install the SGE. It all went fine on my masternode but on my exec
node i have some troubles.
I strongly recommend not using the Debian packages, which that seems to
be about. One of the problems is that it doesn't deal with an entire
cluster on updates, e.g. shutting down execds.

The latest SGE snapshot has simple debian packaging for add-on packages
installing into /opt, as traditional:
<http://arc.liv.ac.uk/downloads/SGE/snapshots/>. There's also an
updated version of the complex packaging provided by Debian at
<https://arc.liv.ac.uk/trac/SGE/browser/gridengine.debian> (which
doubtless needs more fixing and the latest snapshot).

I'd be interested in feedback.
--
Community Grid Engine: http://arc.liv.ac.uk/SGE/
Continue reading on narkive:
Loading...