Apologies for replying to myself,
> The file descriptor angle is looking promising. On our compute nodes we
> have a number of open files of 1024, for a cluster with over 2000 jobslots.
>
> It's a little naive, but running `lsof | wc -l` on our compute nodes
> pulls up between 1500 & 2500 open files.
>
> Looks like we may have a winner?
But maybe we don't. Having a look at our working cluster I see a very
similar number of open files for the same file descriptor ulimit.
>>> There was one change this week, I restarted the cream services after
>>> installing a new certificate on the CE. However this seem like a
>>> server-side certifcate problem to me, the problems look to be in or
>>> near user space. What are others opinion on this?
>>
>> The LFC lookup should be done on the WN with user credentials, so I
>> don't think the CE is relevant. It could be either that the client
>> doesn't recognise the LFC host cert, hence CAs, CRLs etc, or that the
>> LFC doesn't recognise the user proxy - but then it would be the same
>> everywhere. (Unless the CE is somehow mangling it in transit?)
It could be that the CE is mangling the proxy, although I would be
surprised if it is. I wish the error message was more helpful (not the
first time any of us has wished that) and was clear whether the problem
was with the user's credential or the lfc's.
Thanks,
Matt
>
>
>
>>
>> Stephen
>>
|