Hi Peter,
I get the feeling you miss the point, even though you correctly
identified two of the things I wanted to convey:
> 1. think before you code
> 2. test before you go live
While these are both things I did, I certainly did not do them enough,
otherwise I would have sent a much less amusing message.
The thing you seem to miss is that this has everything to do with
deployment realities. Firstly, GASS is supposed to clean up after
itself; it doesn't always. On a production system running on the order
of a half million jobs per year, a minor leak in a system can leave
behind a major mess. On a testbed, you might not even bother or notice.
Another thing is that on a testbed, there is a small user population so
you can do these things by hand (like we did in EDG every so often after
Mario and Ingo had run their Nth job storm) where it is much easier to
get it right; cd into the directory and do 'rm' on the junk. It is much
harder to write a script to do the right thing.
A script is definitely necessary on the production system. There we
have about 1600 pool accounts, compared to only 60 on the PPS. The
first time I ran the script, it deleted well over two million files.
Next, who would think to teach your cleanup script that some files,
despite claiming to be over 35 years old, were really created only an
hour ago? A tmpwatch script cleaning up old files and leading to an
exploding job population falls into the class of 'weird interaction
bugs'. In a completely different context I got an email from someone I
know from university days, who now heads the development team for
Hotmail. He said this:
==============
The Hotmail system serves hundreds of millions of users and runs on
thousands of machines. While the Hotmail application is not particularly
complex (hey, it's just email), the system itself is very complex. ALL
of the hard bugs are 'weird interaction' bugs, and we've seen many, many
bugs that occur only at scale.)
==================
The last point, 'only at scale', is relevant here. The interaction
problem between
1) a bug in GASS leaving behind junk, requiring a cleanup script
2) GASS also creating files with 1970 timestamp
3) the cleanup script dutifully removing these old files
4) job managers hanging on stale NFS file handles
would likely have been completely missed if we'd been doing it on the
PPS. Why? The job manager exits about five minutes after all jobs from
the particular user are gone. PPS jobs so far have never lasted more
than 23 seconds, and there are very few of them late at night when
cleanup scripts run. Looking at the log files, I can't see a single job
on the PPS history that would have hung due to a stale NFS file handle;
their five minute and 23 second job-manager lifetimes never overlapped
with the period in which the cleanup script ran. In contrast, this past
week the production system averaged about a hundred running jobs on any
given late evening.
About the stupidity of touching a non-existent file and expecting it to
magically become a directory, you are right of course (and I did claim i
am an idiot) ... but this is also deployment, as this kind of stuff
happens as well when people interact with software in ways never
suspected by the developers. yesterday I saw something like the
following on a glite 1.2 CE at startup
--secure : invalid host name
A colleague wanted to install a CE in test mode but didn't want to turn
on RGMA because he didn't want to expose it to the rest of the world
yet. So he left the R-GMA hostname blank in the XML config. Touch a
nonexistent file, connect to a nonexistent host ...
start_service --host --secure
faithfully filling in an empty string for the hostname.
oooops ;-)
My message wasn't directed at senior people like you who obviously know
all this. It also wasn't directed specifically at gLite; I am kind of
surprised you thought it necessary to reply. There are lots of people
in our project who haven't been at scale yet; new site admins still at
the ten WN stage, younger members of the gLite team. I had hoped that
they might get a laugh out of it but as well be a bit better prepared
for what is coming as the scale continues to increase.
And finally, I am not sure what you mean with your last comment about
not laughing, but personally when I get to the point that I can't laugh
at my own boneheadedness, I usually stop and go to yoga.
J "five days and counting" T
|