Heya Elena, Raul and Stephen, thanks for the replies.
> it's only comp**.private.dns.zone WN's.
> There is no problem with wn*** wn's.
>
> Perhaps you can compare settings for these two clusters.
> http://panda.cern.ch/server/pandamon/query?jobsummary=site&site=UKI-NORTHGRID-LANCS-HEP
comp* nodes are on an LSF cluster (bane of my life). The nodes are
CentOS5 rather then SL5, but this hasn't bitten us before. The CE is an
old glite 3.2 machine that's in the last few weeks of its life (see the
thread we have going concerning it on LCG-ROLLOUT).
The comp* nodes are tarball nodes using version: glite-WN-3.2.12-1, the
wn* nodes are using the (yikes, out of date :-S) glite-WN-3.2.4-0.
They're separate software mounts on different machines and the CA and
CRLS are distributed separately. However I doubt software mount problems
are the cause (as there are a only a dozen jobs running on the comp*
nodes I think we can safely rule out load).
Following Stephen's suggestion I looked up the crl's on the two clusters
and they appear to be the same. No badness can be seen in the fetch-crl
logs or when running it by hand.
>
> Also the problem is seen for prod jobs only.
We had a spate of analysis failures with "file not found" errors that I
tried to investigate (and all I managed to rule out was that the SE
wasn't causing any problems). It could that these errors were disguised
in the analysis job output.
>
> Perhaps it's a problem with user DN on one of ce's?
This could be the case, but I'm unsure what to look for beyond
eyeballing and comparing the two. I noticed yesterday that some
production jobs were passing, and that there didn't seem to be any
pattern. (Jobs passed and failed for both the robot's DN & Alastair's,
sometimes on the same machine).
There was one change this week, I restarted the cream services after
installing a new certificate on the CE. However this seem like a
server-side certifcate problem to me, the problems look to be in or near
user space. What are others opinion on this?
I had a look at the proxies in the sandbox/group/user/proxy directory,
they all seemed in order under some light openssl scrutiny (in date,
subject correct).
Raul had seen this problem before when he ran out of pool accounts,
which sadly wasn't the case here (we have 50 prdatlas pools and have
only used 10). He's also seen it when torque used up all the file
descriptors. There's something similar for LSF that could happen, I'm
looking into that.
Thanks again,
Matt
>
> Cheers,
> Elena
>
> On 30 Aug 2012, at 17:08, Matt Doidge wrote:
>
>> Hello,
>> First up sorry for the cross post, my apologies to all who end up getting this message twice. Desperate times create desperate admin.
>>
>> Lancaster's having a bunch of atlas production jobs failing with the unhelpful error message:
>> Get error: Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2702, Bad credentials)
>>
>> By a bunch I mean 95% of all jobs that one on one of our clusters (the infamous HEC, abaddon.hec.lancs.ac.uk). Our other cluster is working fine.
>>
>> A link to one of the failures:
>> http://panda.cern.ch/server/pandamon/query?job=1589012289
>>
>> Various sources from google suggested a few fixes, such as checking for clock skew (there wasn't any), checking the CA certificates on the workers (they seem okay, I redistributed them just in case) and checking the load on the NAT (which is fine, nothing odd going on there that I can see, in fact as we're in test mode things are very quiet). Other cases of this error message suggested problems at the LFC end, but as things are working for our other cluster and everyone else I don't thing this is the case.
>>
>> Has anyone been plagued by these or similar errors?
>>
>> Thanks in advance all,
>> Matt
>
> __________________________________________________
> Dr Elena Korolkova
> Email: [log in to unmask]
> Tel.: +44 (0)114 2223553
> Fax: +44 (0)114 2223555
> Department of Physics and Astronomy
> University of Sheffield
> Sheffield, S3 7RH, United Kingdom
>
>
>
>
|