Dr D J Colling wrote:
> Unfortunately we had yet another air conditioning failure at around 12.00
> UK time today. This time it was caused by the engineers who were
> invesitgating our earlier failure who shut off the air conditioning
> without any notice. I believe that ~45 LHCb jobs were lost for which I
> apologise.
>
> On a positive note the enginneers claim that that they have now fixed the
> original cause of the problem (a blocked filter) and that they now expect
> us to have no further problems in the future.
Out of curriosity, how long did it take for the cluster to overheat?
Did it shutdown "nicely" or did nodes just start to fail and then go
off-line, or did some "watchdog" just say "too hot" and do a power-cut
to the nodes?
I have often wondered if it would be possible to use APM "stuff" to
respond to these kind of conditions and do a "suspend to disk",
followed, at the appropriate time, by a "resume". Is there any
particular reason this is not possible?
Cheers,
Ian.
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes
|