JISCMail - TB-SUPPORT Archives

Peter Gronbech wrote:
> We use a combination of yum (to load updated rpms), yumit to advise us
> on the patch status, and copy of an ssh key in
[snip]
> I can then spot that systems need patching with yumit, check what is
> required (ie which oatches are missing, again with yumit) and the type
> ont2nodes yum -y update 
> I'm sure there are many other solutions to this problem but this works
> for me.

Wow, that just made me realise "yet another complication of grid 
computing".  Pete's system sounds excellent.  Very simple to manage.  I 
would imagine there could be big implications for running jobs, though, 
if the software they are using is changing under their feet.

Are there risks that this might throw off software execution?  I 
certainly imagine it might.  We (LHCb) do a bunch of software version 
checks at the start of execution (and not just for LHCb/Physics 
software).  Weird failures are one thing, but failures are probably 
better than producing bad results *without* any errors or output 
inconsistencies due to changes in software between two steps of a job.

What happens if a library is updated?  I don't know enough about how 
link-resolution and inodes work to understand whether dynamic libraries 
are all "referenced" at the start of execution, so the OS holds inode 
references to the old library, even if the physical file changes later 
(but still during execution of some process which is referencing it).

I suppose ideally it would be good to "inject" update jobs into the 
queue, but then the three problems exist:

1. This means syncing on both (or all 4) processors, which will almost 
certainly will mean significant wasted CPU (50% of average job length on 
duals, and more on quads, I guess).

2. Not very nice to have different nodes running different software, and 
perhaps even impossible if it is an update which relates to the 
grid/cluster infrastructure.  This would imply certain types of update 
probably require a full site "sync".

3. How to make sure those update/admin jobs get run exactly once on 
every node.  Oh, I suppose cluster software must have a way of doing 
this, as it would be a common problem.

What are people's thoughts on that?

Cheers,

Ian



-- 
Ian Stokes-Rees                 [log in to unmask]
Particle Physics, Oxford        http://www-pnp.physics.ox.ac.uk/~stokes