Hello,
>> (Jobs passed and failed for both the robot's DN& Alastair's,
>> sometimes on the same machine).
>
> That's sounding like some kind of network problem, on the basis that it's hard to see what else could be different (I suppose your comment about file descriptors could be similar).
It does, but if it is it's not load related, and I would have expected a
few failures with some other mode if the our network had gotten "dodgey".
The file descriptor angle is looking promising. On our compute nodes we
have a number of open files of 1024, for a cluster with over 2000 jobslots.
It's a little naive, but running `lsof | wc -l` on our compute nodes
pulls up between 1500 & 2500 open files.
Looks like we may have a winner?
Cheers,
Matt
>
>> There was one change this week, I restarted the cream services after
>> installing a new certificate on the CE. However this seem like a
>> server-side certifcate problem to me, the problems look to be in or
>> near user space. What are others opinion on this?
>
> The LFC lookup should be done on the WN with user credentials, so I don't think the CE is relevant. It could be either that the client doesn't recognise the LFC host cert, hence CAs, CRLs etc, or that the LFC doesn't recognise the user proxy - but then it would be the same everywhere. (Unless the CE is somehow mangling it in transit?)
>
> Stephen
>
|