Hi Chris,
> I'm not sure but certainly the NGI_UK nagios server was submitting jobs
> against it which were succeeding and from when I was looking in the logs
> for any evidence of a problem other jobs were coming in and ending up on
> the queue.
>
> [...]
> >
> >So is the CE blacklisted only for the submissions done by a specific WMS
> >(while everything works properly for jobs submitted by other WMSes) ?
>
> As far as I can tell, yes. That's what makes it so hard to detect and
> debug - if it wasn't the CERN WMSes used for the Experiment SAM tests and
> the CMS JobRobot I would not have spotted it until someone ticketed me.
There definitely is a problem with your CEs as seen from CERN:
---------------------------------------------------------------------------
$ ./chk-svc-cert heplnx206.pp.rl.ac.uk 8443
socket: Connection timed out
connect:errno=29
unable to load certificate
23598:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:647:
Expecting: TRUSTED CERTIFICATE
---------------------------------------------------------------------------
$ ./chk-svc-cert heplnx142.pp.rl.ac.uk 8443
getaddrinfo: Name or service not known
connect:errno=2
unable to load certificate
23669:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:647:
Expecting: TRUSTED CERTIFICATE
---------------------------------------------------------------------------
And a bit later:
---------------------------------------------------------------------------
$ ./chk-svc-cert heplnx206.pp.rl.ac.uk 8443
write:errno=104
unable to load certificate
23749:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:647:
Expecting: TRUSTED CERTIFICATE
---------------------------------------------------------------------------
Error 104 == ECONNRESET.
Compare with another CE in your area (lines wrapped for clarity):
---------------------------------------------------------------------------
$ chk-svc-cert t2ce06.physics.ox.ac.uk 8443
depth=2 /C=UK/O=eScienceRoot/OU=Authority/CN=UK e-Science Root
verify return:1
depth=1 /C=UK/O=eScienceCA/OU=Authority/CN=UK e-Science CA
verify return:1
depth=0 /C=UK/O=eScience/OU=Oxford/L=OeSC/CN=t2ce06.physics.ox.ac.uk/
[log in to unmask]
verify return:1
DONE
notBefore=Sep 7 10:22:43 2010 GMT
notAfter=Oct 7 10:22:43 2011 GMT
---------------------------------------------------------------------------
The script is attached.
This looks like a network problem close to your site: traffic to/from port
8443 has a very bad quality of service or may even be dropped altogether
once the connection has been established...
#!/bin/sh
usage()
{
echo "Usage: $0 { host:port | host port }" >&2
exit 2
}
test $# = 1 || test $# = 2 || usage
case $1${2+:$2} in
*:[1-9]*[0-9])
: # OK
;;
*)
usage
esac
uid=`id -u`
if test "X$uid" = X0
then
d=/etc/grid-security
cert=$d/hostcert.pem
key=$d/hostkey.pem
else
: ${X509_USER_PROXY:=/tmp/x509up_u$uid}
PATH=$PATH:$GLOBUS_LOCATION/bin
grid-proxy-info -exists || {
echo "$0: you need to have a valid proxy" >&2
exit 1
}
cert=$X509_USER_PROXY
key=$X509_USER_PROXY
cafile="-CAfile $X509_USER_PROXY"
fi
: ${X509_CERT_DIR:=/etc/grid-security/certificates/}
openssl s_client -ssl3 -cert $cert -key $key $cafile \
-CApath $X509_CERT_DIR -connect "$1"${2+:"$2"} < /dev/null |
openssl x509 -noout -dates
|