Hi all,
A status report from RAL:
The RAL Firewall is now open for the RB on ports 7772 and 9000 so in theory
sites using our RB should be in business. Please give it a whirl.
I discovered a misconfiguration in our networking settings that caused the
nodes to be using the wrong machine as primary DNS - this slowed the whole
system down while it waited for a completely innocent batch worker in the
main farm to respond to DNS requests - something it wasn't doing (as you'd
expect!) - and then querying the secondary namserver. This is now fixed and
the systems is much quicker.
However: the CE is still playing up. Both Steve and I have been scratching
our heads over this for a while and come up with the following:
1/ Some job requests fail to set up the correct directory of job payload on
the CE in /home/dteamNNN, so when the job request is sent to the WN, the WN
tries and fails to copy the payload and the job is requeued (seemingly for
ever though I think it tries only three times to run it before giving up).
Most of the jobs seemed to be Trevor's moonitor jobs though the logs show
some local jobs failing too.
2/ Now it seems to be running better - single edg jobs run through OK as do
jobs submmitted locally with qsub. However if I submit two or more local
edg jobs in quick succession through the RB, the second and subsequent fail
with hand of god message:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job :
https://lcgrb01.gridpp.rl.ac.uk:9000/i8Zjp6Hg2PXoFXgpmW3F4A
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-short
reached on: Thu Sep 25 13:51:47 2003
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job :
https://lcgrb01.gridpp.rl.ac.uk:9000/xcFTN1MU_Xik8Z0gH9BKOA
Current Status: Done (Cancelled)
Exit code: 0
Status Reason: Cannot read JobWrapper output, both from Condor and from
Maradona.
Destination: lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-short
reached on: Thu Sep 25 13:56:51 2003
*************************************************************
3/ I'm still getting failures of all offsite jobs - various reasons:
A random RB-chose-the-destination:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job :
https://lcgrb01.gridpp.rl.ac.uk:9000/YMfrFf4I7b49yKJxhDP9jg
Current Status: Aborted
Status Reason: Job RetryCount (3) hit for
https://lcgrb01.gridpp.rl.ac.uk:9000/YMfrFf4I7b49yKJxhDP9jg
Destination:
wn-02-29-a.cr.cnaf.infn.it:2119/jobmanager-lcgpbs-infinite
reached on: Thu Sep 25 13:57:42 2003
A job direct to CERN:
*************************************************************
BOOKKEEPING INFORMATION:
Printing status info for the Job :
https://lcgrb01.gridpp.rl.ac.uk:9000/jzeFQEGAAfXk_-WWENh0TQ
Current Status: Aborted
Status Reason: Cannot plan (a helper failed)
Destination: adc0015.cern.ch:2119/jobmanager-lcgpbs-short
reached on: Thu Sep 25 13:15:01 2003
*************************************************************
I'm beginning to conclude that an update is not a feasible solution when
config changes take place - the RB was broken completely and the problems
we've had since do not inspire confidence.
Changing the RB ports was a bad move for firewalled sites. I still don't
have an official list of required ports for all the node types so I'm about
to resort to asking for a firewall log to see what to ports are actually in use.
If anyone has any ideas about where I go from here to track down the
problems, please get in touch. I'm not about to reinstall the boxes without
understanding why they are faulty and the sure knowledge that a reinstall
will fix the problem.
Cheers, Martin.
|