hiyya jeff
this has not much to do with your title 'deployment realities'.
it has simply to do with sysadmins putting together scripts
on a live system without
a) proper design
b) proper testing
i had experiences like that when i was sysadmining our
university boxes 15 years ago.. your lessons to be learned:
1. think before you code
2. test before you go live
sort of obvious, but this is still 95% of the cause of trouble.
and even if you do it right, there are problems where you didn't
think of some possibilities, just as you realized.
i'm not claiming we are doing it right either in glite, not because
of lack of knowledge or experience but because of lack of people
and of time. and knowing that you do the 'suboptimal' thing
usually does not make you laugh.
ciao
peter
----- Original Message -----
From: "Jeff Templon" <[log in to unmask]>
To: "LHC Computer Grid - Rollout" <[log in to unmask]>;
<[log in to unmask]>
Sent: Saturday, July 30, 2005 12:28 AM
Subject: deployment realities
> Yo,
>
> I have spent an entirely enjoyable week bouncing between production system
> maintenance tasks, and trying to beat the estimated response time system
> into a deployable state. It has been quite instructive. I thought I'd
> share one of the experiences because:
>
> - it is amusing (well, at least I laughed)
> - it illustrates lots of pitfalls that I hope others can avoid
> - it provides proof to a claim I often make here, that I truly am an
> idiot.
>
> I am cc'ing the glite-discuss list as a heads-up for what awaits them when
> they hit the big time.
>
> It all started about a month ago when Rod Walker was seeing lots of
> problems at NIKHEF, and pointed out that his pool directory here had
> expanded beyond all reasonable proportions, and perhaps this was starting
> to make things creak in the job submission chain.
>
> Seeing that yes indeed, the number of leftover files there was approaching
> a hundred thousand, I decided to write a cleanup script. A colleague had
> once scolded me for using 'find' to do this -- "that's why they made
> tmpwatch". So I ran a tmpwatch command (after a few tests on an active
> pool directory) over the whole pool account space, telling it to clean up
> stuff more than 14 days old.
>
> It cleaned Rod's space all right, along with correctly cleaning up all the
> active accounts. It also:
>
> - completely deleted all pool account directories (the entire thing) that
> had not been used in the last 14 days
> - completely deleted all the .ssh subdirectories for all the pool
> accounts, as they had not been changed since creation of the accounts,
> with the exception of one famous example.
>
> augh. so i write a script to put back all the deleted accounts, and to
> create the .ssh directories and autogenerated keys. furthermore, i
> modified the cleanup script to touch critical files in .ssh, and to touch
> the .globus directory, before running tmpwatch, in order to avoid deleting
> these critical files/directories.
>
> A couple weeks later, I noticed lots of .nfs000312328 type files in a
> biomed account. Strange. Ask the guys about it, they didn't know, I
> clean them and move on. About the same time, I start seeing increasing
> numbers of globus-job-manager processes hanging around. At some point
> there were so many of these that I investigated. There were posts about
> this on LCG-ROLLOUT; the bottom line is that my tmpwatch command was
> checking one of the three 'time' parameters on the file, I forget which,
> but the globus GASS mechanism always sets this time to midnight on 1
> january 1970 ... so these files are older than fourteen days, and get
> thrown away. The jobmanager has this file thrown away while it is still
> open, and hence hangs on a stale NFS file handle. Hence lots of
> jobmanagers hanging around forever. And hence many .nfs90902493 looking
> files in the pool home dirs. Augh. Another bug fix.
>
> Just today, a colleague comes in and asks 'do you have any idea
> what happened to all the .globus directories in the pool account space"?
> Turns out we were starting to get globus error messages about not being
> able to create files in the .globus space.
>
> Yes. I forgot to create the .globus dirs when I created the .ssh dirs.
> The first fix to my script then did 'touch -a .globus' to preserve this
> important but non-existent directory. touch -a does not usually create a
> directory if the thing being touched does not exist. and it is usually
> quite difficult to write to a subdirectory of a zero-length file.
>
> Now I am back to a sequence of find commands for my cleanup script, and
> learning all sorts of interesting things about how to defend against
> filename attacks ... see e.g.
>
> http://www.unix.org.ua/orelly/networking/puis/ch11_05.htm
>
> Have a good weekend and think three times before hitting return ...
>
> J "never can tell how things will interact" T
>
> ps: I will spare you the story of attacks that are possible when you
> 'eval' a python expression coming from an external source.
>
|