Hi *,
We had a problem here with the new package cleanup-jobdirs on the
CE. It probably doesn't affect most of you, or perhaps we are the
only ones who were affected by it. But we found it by chance, it may
be that for some sites, they have the problem but have not yet found
it by chance. So, read on.
What does cleanup-jobdirs do : this is explained well in the
release notes. It "looked harmless enough" :-) Basically it looks in
your gridmapdir for all your pool accounts, then one by one it cd's
into the home of the pool account, and cleans up one specific subdir
where a lot of temp files could accumulate. This is done via a cron
job, every six hours. It is useful since there are so many temp files
here, that it sometimes causes resource exhaustion on the CE.
However, at our site, our pool account homes are automounted. Every
time somebody does a "cd ~atlb021' for example, a new NFS mount is
created (unless one already existed for that particular account's home
directory). the cleanup script cleans all our pool homes in a short
amount of time ... there are 2300 of these pool accounts ... so on
each CE (we have three of them), at exactly the same time, there are
2300 separate NFS mounts attempted, in a short period of time. This
exhausts the number of allowed mounts, and what happens is that other
mounts start failing during the time that the script is run. We did
not anticipate this consequence of the cleanup-jobdirs script.
We found this here, because I had some private cron jobs that failed
every six hours .. the symptom is that they could not find files
located in my (automounted) home directory on the CE machine.
You can check if you have the problem : look in /var/log/messages on
your CE machines, and between 06:47 and 06:50 do you see messages like:
Oct 18 06:47:12 gazon kernel: RPC: Can't bind to reserved port (98).
Oct 18 06:47:12 gazon kernel: RPC: can't bind to reserved port.
Oct 18 06:47:12 gazon kernel: RPC: error 5 connecting to server
schuur.nikhef.nl
Oct 18 06:47:12 gazon kernel: RPC: Can't bind to reserved port (98).
Oct 18 06:47:12 gazon kernel: RPC: can't bind to reserved port.
Oct 18 06:47:12 gazon kernel: RPC: error 5 connecting to server
schuur.nikhef.nl
Oct 18 06:47:12 gazon automount[17574]: >> mount: schuur.nikhef.nl:/
project/share/pool/atlas/atlas156: can't read superblock
Oct 18 06:47:12 gazon automount[17574]: mount(nfs): nfs: mount failure
schuur.nikhef.nl:/project/share/pool/atlas/atlas156 on /home/atlas156
Oct 18 06:47:12 gazon automount[17574]: failed to mount /home/atlas156
I am assuming here (by saying 06:47) that the cronjob runtime is hard-
coded and not randomly generated by YAIM ... you can find when yours
is set to run by looking in /etc/cron.d on the CE.
You probably will not have this problem unless your setup is like
ours, that each pool account home is a separate mount via eg an
automount map.
Hope this helps somebody!
J "/home/templon/.signature : file not found" T
|