On 05/02/14 16:40, Matt Raso-Barnett wrote:
> I've found something else today on the WNs which looks to be perhaps the
> problem.
>
> I turned on maximum log output for glexec on the WN earlier (somehow I
> missed this variable when looking through /etc/glexec.conf before) and
> immediately saw the following:
>
> glexec[51695] 20140205T145808Z: Reading in
> GLEXEC_CLIENT_CERT='/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy'.
>
> glexec[51695] 20140205T145808Z: Could not lock file during reading of
> proxy
> /mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy.
> glexec[51695] 20140205T145808Z: Reading proxy failed.
> glexec[51695] 20140205T145808Z: Failed to lock
> $GLEXEC_CLIENT_CERT=/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy,
> $GLEXEC_SOURCE_PROXY=(NULL) or destination proxy.
>
> I'm not sure yet though why this is failing but these messages are
> occuring at the time the nagios check fails so they are likely the reason.
Sorry to reply to myself, but this definitely looks like it might be the
issue for me -- testing flock where it is writing the lock file to our
lustre file system fails, but writing out to a local disk like /tmp
works fine.
It seems from some initial googling that I need to tweak the way we
mount lustre to support flock.
Does this sound familiar to anyone else (Chris W maybe)?
Cheers,
Matt
|