Yo,
I have spent an entirely enjoyable week bouncing between production
system maintenance tasks, and trying to beat the estimated response time
system into a deployable state. It has been quite instructive. I
thought I'd share one of the experiences because:
- it is amusing (well, at least I laughed)
- it illustrates lots of pitfalls that I hope others can avoid
- it provides proof to a claim I often make here, that I truly am an idiot.
I am cc'ing the glite-discuss list as a heads-up for what awaits them
when they hit the big time.
It all started about a month ago when Rod Walker was seeing lots of
problems at NIKHEF, and pointed out that his pool directory here had
expanded beyond all reasonable proportions, and perhaps this was
starting to make things creak in the job submission chain.
Seeing that yes indeed, the number of leftover files there was
approaching a hundred thousand, I decided to write a cleanup script. A
colleague had once scolded me for using 'find' to do this -- "that's why
they made tmpwatch". So I ran a tmpwatch command (after a few tests on
an active pool directory) over the whole pool account space, telling it
to clean up stuff more than 14 days old.
It cleaned Rod's space all right, along with correctly cleaning up all
the active accounts. It also:
- completely deleted all pool account directories (the entire thing)
that had not been used in the last 14 days
- completely deleted all the .ssh subdirectories for all the pool
accounts, as they had not been changed since creation of the accounts,
with the exception of one famous example.
augh. so i write a script to put back all the deleted accounts, and to
create the .ssh directories and autogenerated keys. furthermore, i
modified the cleanup script to touch critical files in .ssh, and to
touch the .globus directory, before running tmpwatch, in order to avoid
deleting these critical files/directories.
A couple weeks later, I noticed lots of .nfs000312328 type files in a
biomed account. Strange. Ask the guys about it, they didn't know, I
clean them and move on. About the same time, I start seeing increasing
numbers of globus-job-manager processes hanging around. At some point
there were so many of these that I investigated. There were posts about
this on LCG-ROLLOUT; the bottom line is that my tmpwatch command was
checking one of the three 'time' parameters on the file, I forget which,
but the globus GASS mechanism always sets this time to midnight on 1
january 1970 ... so these files are older than fourteen days, and get
thrown away. The jobmanager has this file thrown away while it is still
open, and hence hangs on a stale NFS file handle. Hence lots of
jobmanagers hanging around forever. And hence many .nfs90902493 looking
files in the pool home dirs. Augh. Another bug fix.
Just today, a colleague comes in and asks 'do you have any idea
what happened to all the .globus directories in the pool account space"?
Turns out we were starting to get globus error messages about not
being able to create files in the .globus space.
Yes. I forgot to create the .globus dirs when I created the .ssh dirs.
The first fix to my script then did 'touch -a .globus' to preserve
this important but non-existent directory. touch -a does not usually
create a directory if the thing being touched does not exist. and it is
usually quite difficult to write to a subdirectory of a zero-length file.
Now I am back to a sequence of find commands for my cleanup script, and
learning all sorts of interesting things about how to defend against
filename attacks ... see e.g.
http://www.unix.org.ua/orelly/networking/puis/ch11_05.htm
Have a good weekend and think three times before hitting return ...
J "never can tell how things will interact" T
ps: I will spare you the story of attacks that are possible when you
'eval' a python expression coming from an external source.
|