On 02/05/2014 04:40 PM, Matt Raso-Barnett wrote:
> Thanks Gareth, I do get some mappings I think on the Argus server:
>
> ...
> 3147133 -rw-r--r-- 1 root root 0 Dec 13 2012 pilops09
> 3147134 -rw-r--r-- 1 root root 0 Dec 13 2012 pilops10
> 3147132 -rw-r--r-- 2 root root 0 Dec 18 23:13 pilops08
> 3147132 -rw-r--r-- 2 root root 0 Dec 18 23:13
> %2fc%3duk%2fo%3descience%2fou%3doxford%2fl%3doesc%2fcn%3dkashif%20mohammad:pilops:ops
> 3147132 -rw-r--r-- 2 root root 0 Dec 18 23:13 pilops08
> 3147132 -rw-r--r-- 2 root root 0 Dec 18 23:13
> %2fc%3duk%2fo%3descience%2fou%3doxford%2fl%3doesc%2fcn%3dkashif%20mohammad:pilops:ops
> 3147126 -rw-r--r-- 2 root root 0 Feb 5 16:29 pilops02
> 3147126 -rw-r--r-- 2 root root 0 Feb 5 16:29
> %2fc%3duk%2fo%3descience%2fou%3doxford%2fl%3doesc%2fcn%3dkashif%20mohammad%2fcn%3drobot%3agridclient:pilops:ops
> 3147126 -rw-r--r-- 2 root root 0 Feb 5 16:29 pilops02
> 3147126 -rw-r--r-- 2 root root 0 Feb 5 16:29
> %2fc%3duk%2fo%3descience%2fou%3doxford%2fl%3doesc%2fcn%3dkashif%20mohammad%2fcn%3drobot%3agridclient:pilops:ops
>
That's Kashif's tests running. The mappping is symbolised by a hard link
to the same inode, 3147126. You can rm the long link, and it should be
remade the next time. It may be set to a different user, e.g. pilops03.
> Is this the only place I should expect this? There is no gridmapdir on
> the WNs but there is on the cream server.
It depends how you have set it up. On your cream servers, you can do this:
[root@hepgrid5 glitecfg]# cat /root/glitecfg/services/glite-creamce
...........
USE_ARGUS=yes
ARGUS_PEPD_ENDPOINTS="https://hepgrid9.ph.liv.ac.uk:8154/authz"
CREAM_PEPC_RESOURCEID="http://ph.liv.ac.uk/hepgrid5"
This cream server uses ARGUS. It doesn't need a gridmapdir.
If it doesn't have that, it could use gridmapdir. If you have many
servers (ARGUS, CREAM1, CREAM2...) using gridmapdir, it should be shared
from one of the servers with (say) NFS, so it's coherent.
>
> I've found something else today on the WNs which looks to be perhaps
> the problem.
>
> I turned on maximum log output for glexec on the WN earlier (somehow I
> missed this variable when looking through /etc/glexec.conf before) and
> immediately saw the following:
>
> glexec[51695] 20140205T145808Z: Reading in
> GLEXEC_CLIENT_CERT='/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy'.
> glexec[51695] 20140205T145808Z: Could not lock file during reading of
> proxy
> /mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy.
> glexec[51695] 20140205T145808Z: Reading proxy failed.
> glexec[51695] 20140205T145808Z: Failed to lock
> $GLEXEC_CLIENT_CERT=/mnt/lustre/grid/users/pilatl01/home_cream_445503617/cream_445503617.proxy,
> $GLEXEC_SOURCE_PROXY=(NULL) or destination proxy.
>
Hm... looks bad. Dunno much about Lustre, sorry. It looks bad, though.
It could account for why dteam proxy in /tmp works, while job proxy in
lustre does not.
> I'm not sure yet though why this is failing but these messages are
> occuring at the time the nagios check fails so they are likely the
> reason.
You can test it by doing the dteam test, but putting the dteam proxy in
(say) /mnt/lustre/grid/users/pilatl01/home_cream_445503617/MyDteamProxy
as if it were a job proxy. If it fails, yet works when it's in /tmp,
you're very close to nailing this issue. How to fix it is another
things, and needs Lustre expertise.
Steve
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|