Arnau Bria wrote:
> our lcg-CE (lcg-CE-3.0.23-0) seems to cancel any job you submit.
>
> [root@ce05 root]# head -n1 /var/log/messages
> Apr 20 04:04:13 ce05 syslogd 1.4.1: restart.
> [root@ce05 root]# tail -n1 /var/log/messages
> Apr 21 10:08:45 ce05 GRAM gatekeeper[30470]: Got connection 142.90.90.46 at Mon Apr 21 10:08:45 2008
>
> # grep -c "added to DEQUEUE list" /var/log/messages
> 1276
> ^^^^
>
>
> logs from the CE to out batch torque/maui of a single job:
>
>
> CE:
> Apr 20 21:06:37 ce05 gridinfo[11193]: JMA 2008/04/20 21:06:37 GATEKEEPER_JM_ID 2008-04-20.21:06:27.0000003453.0000000022 has GRAM_SCRIPT_JOB_ID 1208718397:lcgpbs:internal_2177507569:11193.1208718392 manager type lcgpbs
> Apr 20 21:07:42 ce05 gridinfo: [11413-12965] Submitted job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 to batch system lcgpbs with ID 4151819.pbs01.pic.es
> Apr 20 21:10:39 ce05 gridinfo: [11413-11413] Job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 added to DEQUEUE list
> Apr 20 21:10:39 ce05 gridinfo: [11413-19604] Job 1208718397:lcgpbs:internal_2177507569:11193.1208718392 (batch ID 4151819.pbs01.pic.es) REMOVED from batch system ok
>
>
> PBS:
>
> Torque:
> 04/20/2008 21:07:42;0100;PBS_Server;Job;4151819.pbs01.pic.es;enqueuing into gshort, state 1 hop 1
> 04/20/2008 21:07:42;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Queued at request of [log in to unmask], owner = [log in to unmask], job name = STDIN, queue = gshort
> 04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Modified at request of [log in to unmask]
> 04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Run at request of [log in to unmask]
> 04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job Modified at request of [log in to unmask]
> 04/20/2008 21:09:35;0008;PBS_Server;Job;4151819.pbs01.pic.es;MOM rejected modify request, error: 15001
> 04/20/2008 21:10:39;0008;PBS_Server;Job;4151819.pbs01.pic.es;Job deleted at request of [log in to unmask]
> 04/20/2008 21:10:39;0100;PBS_Server;Job;4151819.pbs01.pic.es;dequeuing from gshort, state EXITING
>
>
> Maui:
>
> 04/20 21:09:35 INFO: job '4151819' loaded: 1 ops001 ops 86400 Idle 0 1208718462 [NONE] [NONE] [NONE] >= 0 >= 0 [slc4] 1208718575
> 04/20 21:09:35 MRMJobStart(4151819,Msg,SC)
> 04/20 21:09:35 MPBSJobStart(4151819,base,Msg,SC)
> 04/20 21:09:35 MPBSJobModify(4151819,Resource_List,Resource,td006.pic.es)
> 04/20 21:09:35 MPBSJobModify(4151819,Resource_List,Resource,1)
> 04/20 21:09:35 WARNING: cannot set job '4151819.pbs01.pic.es' attr 'Resource_List:neednodes' to '1' (rc: 15001 'Unknown Job Id')
> 04/20 21:09:35 INFO: job '4151819' successfully started
> 04/20 21:11:36 INFO: active PBS job 4151819 has been removed from the queue. assuming successful completion
>
>
> So, anyone could help us to determine what cause a ce to cancel jobs?
The /opt/globus/lib/perl/Globus/GRAM/JobManager/lcgpbs.pm code
will cancel any job that is reported with 'W' status:
--------------------------------------------------------------
if(/Q|W|T/)
{
if ($status_line eq "W")
{
$self->cancel();
$state = Globus::GRAM::JobState::FAILED;
}
else
{
$state = Globus::GRAM::JobState::PENDING;
}
}
--------------------------------------------------------------
There must have been a good reason for that...
|