Can you please check if qsub from a pool account on the gCe is working?
And please make a job-logging-info -v 2 on the UI to know exactly the reason issued for failure.
Nuno
-----Original Message-----
From: LHC Computer Grid - Rollout on behalf of Antun Balaz
Sent: Wed 7/11/2007 5:03 PM
To: [log in to unmask]
Subject: [LCG-ROLLOUT] Problems with gCE after latest updates
Hi,
After applying all the latest updates yesterday on all nodes, and
reconfiguring them, we are experiencing significant problems with gCE, and
just with this node.
The setup of our site assumes that lcg-CE and gCE share the same TORQUE server
installed on lcg-CE. After the update and reconfiguring all nodes, jobs sent
to lcg_CE are successfully executed, while those sent to gCE fail. They
actually reach batch system, where they are apparently sen to some WN (the
same as the jobs sent to lcg-CE which pass), but on WN there are no any traces
of gCE jobs arriving in pbs_mom logs (while lcg-CE jobs are logged correctly).
On TORQUE server I see the following traces of such a job sent to gCE (I am
mapped to dteam004):
[root@ce root]# showq -r| grep dteam004
29138+ R DEF ------ 1.0 qo dteam004 dteam wn02 1
2:02:00:00 Wed Jul 11 15:50:00
[root@ce root]# tracejob 29138
/var/spool/pbs/mom_logs/20070711: No such file or directory
/var/spool/pbs/sched_logs/20070711: No such file or directory
Job: 29138.ce.phy.bg.ac.yu
07/11/2007 15:49:59 S enqueuing into dteam, state 1 hop 1
07/11/2007 15:49:59 S Job Queued at request of [log in to unmask],
owner = [log in to unmask], job name = blahjob_fm2961, queue = dteam
07/11/2007 15:49:59 A queue=dteam
07/11/2007 15:50:00 S Job Modified at request of [log in to unmask]
07/11/2007 15:50:00 S Job Run at request of [log in to unmask]
After some time, the job is put to Hold state:
[root@ce root]# showq | grep 29138
29138 dteam004 Hold 1 2:02:00:00 Wed Jul 11 15:49:59
No additional information is logged to pbs_server logs, nor there are any
traces on wn02 (where this job supposedly went) about the job 29138.
In gram_job_mgr_2420.log I see the series of messages identical (apart from
time stamps) to the ones below, which seems fairly usual to me.
Wed Jul 11 16:14:45 2007 JM_SCRIPT: New Perl JobManager created.
Wed Jul 11 16:14:45 2007 JM_SCRIPT: polling job 2479
7/11 16:14:45 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
7/11 16:14:45 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
7/11 16:14:55 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
7/11 16:14:55 JMI: testing job manager scripts for type fork exist and
permissions are ok.
7/11 16:14:55 JMI: completed script validation: job manager type is fork.
7/11 16:14:55 JMI: in globus_gram_job_manager_poll()
7/11 16:14:55 JMI: local stdout filename =
/home/dteam004/.globus/.gass_cache/local/md5/4a/058426fba8e47fb45d3e95957ff6ea/md5/94/7c4fdd29d7104121ad40f4b6d4a524/data.
7/11 16:14:55 JMI: local stderr filename = /dev/null.
7/11 16:14:55 JMI: poll: seeking: https://gce.phy.bg.ac.yu:20046/2420/1184161703/
7/11 16:14:55 JMI: poll_fast: ******** Failed to find
https://gce.phy.bg.ac.yu/2420/1184161703/
7/11 16:14:55 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts)
7/11 16:14:55 JMI: cmd = poll
7/11 16:14:55 JMI: returning with success
Other files in /home/dteam004 look OK, e.g. SchedLog and MasterLog does not
reveal anything. Mapping is OK etc.
What is strange is the following: we have a bunch of SL4 gLite-3.0 WNs
installed in compatibility mode, and another bunch of native SL4 gLite-3.1
WNs. We first updated and reconfigured (as requested in the release notes)
lcg-CE, gCE, and gLite-3.0 WNs, and ran into these problems with gCE. Then we
verified that there are no problems if gCE jobs end up on gLite-3.1 WNs (which
were not updated). After these nodes are updated, they stopped working as well
for gCE jobs.
What I see is that there are no gahp processes on gCE, which are supposed to
be there when everything works. So maybe this is after all condor conf problem?
Any help would be appreciated.
Thanks, Antun
-----
Antun Balaz
Research Assistant
E-mail: [log in to unmask]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3713152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade, Serbia
-----
|