Hi,
our lcg-CE (lcg-CE-3.0.23-0) seems to cancel any job you submit.
[root@ce05 root]# head -n1 /var/log/messages
Apr 20 04:04:13 ce05 syslogd 1.4.1: restart.
[root@ce05 root]# tail -n1 /var/log/messages
Apr 21 10:08:45 ce05 GRAM gatekeeper[30470]: Got connection 142.90.90.46 at Mon Apr 21 10:08:45 2008
# grep -c "added to DEQUEUE list" /var/log/messages
1276
^^^^
logs from the CE to out batch torque/maui of a single job:
CE:
Apr 20 21:06:37 ce05 gridinfo[11193]: JMA 2008/04/20 21:06:37 GATEKEEPER_JM_ID 2008-04-20.21:06:27.0000003453.0000000022 has GRAM_SCRIPT_JOB_ID 1208718397:lcgpbs:internal_2177507569:11193.1208718392 manager type lcgpbs
Apr 20 21:07:42 ce05 gridinfo: [11413-12965] Submitted job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 to batch system lcgpbs with ID 4151819.pbs01.pic.es
Apr 20 21:10:39 ce05 gridinfo: [11413-11413] Job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 added to DEQUEUE list
Apr 20 21:10:39 ce05 gridinfo: [11413-19604] Job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 (batch ID 4151819.pbs01.pic.es) REMOVED from batch system ok
PBS:
Torque:
04/20/2008 21:07:42;0100;PBS_Server;Job;4151819.pbs01.pic.es;enqueuing into gshort, state 1 hop 1
04/20/2008 21:07:42;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Queued at request of [log in to unmask], owner = [log in to unmask], job name = STDIN, queue = gshort
04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Modified at request of [log in to unmask]
04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Run at request of [log in to unmask]
04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Modified at request of [log in to unmask]
04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;MOM rejected modify request, error: 15001
04/20/2008 21:10:39;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job deleted at request of [log in to unmask]
04/20/2008 21:10:39;0100;PBS_Server;Job;4151819.pbs01.pic.es;dequeuing from gshort, state EXITING
Maui:
04/20 21:09:35 INFO: job '4151819' loaded: 1 ops001 ops 86400 Idle 0 1208718462 [NONE] [NONE] [NONE] >= 0 >= 0 [slc4] 1208718575
04/20 21:09:35 MRMJobStart(4151819,Msg,SC)
04/20 21:09:35 MPBSJobStart(4151819,base,Msg,SC)
04/20 21:09:35 MPBSJobModify(4151819,Resource_List,Resource,td006.pic.es)
04/20 21:09:35 MPBSJobModify(4151819,Resource_List,Resource,1)
04/20 21:09:35 WARNING: cannot set job '4151819.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id')
04/20 21:09:35 INFO: job '4151819' successfully started
04/20 21:11:36 INFO: active PBS job 4151819 has been removed from the queue. assuming successful completion
So, anyone could help us to determine what cause a ce to cancel jobs?
Yesterday we reconfigured (using yaim) the host, but the error persit,
then, after a reboot, it started working again... Maybe a hw failure?
TIA,
Arnau
|