JISCMail - TB-SUPPORT Archives

Hi All,

I'm seeking a word of advice on a problem affecting our CE and/or its
ability to communicate with the wms/rb:

we intermittently fail SAM tests with error "Got a job held event,
reason: Globus error 131: the user proxy expired (job is still
running)". We seem to fail also Steve Lloyd's Atlas tests, or rather,
they show up in yellow state, and the job status is "Running". SAM
tests are submitted through rb113.cern.ch and Steve's tests through
lcgwms01.gridpp.rl.ac.uk, so the problem must be at our end.

I've tracked the jobs through the batch system and they run and
complete with no error. The failure is at the next stage. The job
monitor on the gatekeeper fails with the following type of error:

10/28 13:18:39 JMI: local stdout filename = /grid/home/
opssgm/.globus/.gass_cache/local/md5/7f/ab42624c324da55977b89cc3a446d8/
md5/14/0d388effa9126ec877fa959f535c41/data.
10/28 13:18:39 JMI: local stderr filename = /dev/null.
10/28 13:18:39 JMI: poll: seeking: https://pc90.hep.ucl.ac.uk:20200/26501/1225199667/
10/28 13:18:39 JMI: poll_fast: ******** Failed to find https://pc90.hep.ucl.ac.uk/26501/1225199667/
10/28 13:18:39 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)

Is anyone familiar with this error? I haven't managed to uncover
anything useful from Globus forums yet.
This does not seem to occur all the times, but the increasing number
of job-monitor processes running is doing us no favor in terms of load
on the machine.

Thanks,
Gianfranco

--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT