Print

Print


Hi All,

I'm seeking a word of advice on a problem affecting our CE and/or its  
ability to communicate with the wms/rb:

we intermittently fail SAM tests with error "Got a job held event,  
reason: Globus error 131: the user proxy expired (job is still  
running)". We seem to fail also Steve Lloyd's Atlas tests, or rather,  
they show up in yellow state, and the job status is "Running". SAM  
tests are submitted through rb113.cern.ch and Steve's tests through  
lcgwms01.gridpp.rl.ac.uk, so the problem must be at our end.

I've tracked the jobs through the batch system and they run and  
complete with no error. The failure is at the next stage. The job  
monitor on the gatekeeper fails with the following type of error:

10/28 13:18:39 JMI: local stdout filename = /grid/home/ 
opssgm/.globus/.gass_cache/local/md5/7f/ab42624c324da55977b89cc3a446d8/ 
md5/14/0d388effa9126ec877fa959f535c41/data.
10/28 13:18:39 JMI: local stderr filename = /dev/null.
10/28 13:18:39 JMI: poll: seeking: https://pc90.hep.ucl.ac.uk:20200/26501/1225199667/
10/28 13:18:39 JMI: poll_fast: ******** Failed to find https://pc90.hep.ucl.ac.uk/26501/1225199667/
10/28 13:18:39 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl  
scripts)

Is anyone familiar with this error? I haven't managed to uncover  
anything useful from Globus forums yet.
This does not seem to occur all the times, but the increasing number  
of job-monitor processes running is doing us no favor in terms of load  
on the machine.

Thanks,
Gianfranco

-- 
Dr. Gianfranco Sciacca			Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy		Internal: 33044
University College London		D15 - Physics Building
London WC1E 6BT