On 5 Feb 2014, at 16:52, Matt Raso-Barnett wrote:
> On 05/02/14 16:40, Matt Raso-Barnett wrote:
>> I've found something else today on the WNs which looks to be perhaps the
>> problem.
>>
>> I turned on maximum log output for glexec on the WN earlier (somehow I
>> missed this variable when looking through /etc/glexec.conf before) and
>> immediately saw the following:
>>
>> glexec[51695] 20140205T145808Z: Reading in
>> GLEXEC_CLIENT_CERT='/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy'.
>>
>> glexec[51695] 20140205T145808Z: Could not lock file during reading of
>> proxy
>> /mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy.
>> glexec[51695] 20140205T145808Z: Reading proxy failed.
>> glexec[51695] 20140205T145808Z: Failed to lock
>> $GLEXEC_CLIENT_CERT=/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy,
>> $GLEXEC_SOURCE_PROXY=(NULL) or destination proxy.
>>
>> I'm not sure yet though why this is failing but these messages are
>> occuring at the time the nagios check fails so they are likely the reason.
>
> Sorry to reply to myself, but this definitely looks like it might be the issue for me -- testing flock where it is writing the lock file to our lustre file system fails, but writing out to a local disk like /tmp works fine.
>
> It seems from some initial googling that I need to tweak the way we mount lustre to support flock.
>
> Does this sound familiar to anyone else (Chris W maybe)?
>
> Cheers,
> Matt
Not sure if flock is needed or not re glexec but I recently made it the default mount option for Lustre so that we could run HDF5 parallel IO, which needs flock.
Most nodes have had lustre remounted since then but not all, including the grid nodes. You'll need to dismount and then remount lustre when no jobs are running for it to take effect.
Jeremy
|