On Wed, Nov 07, 2007 at 02:19:46PM -0000, Jensen, J (Jens) wrote:
> Since this user ought to have stopped this activity over 7 days ago, I've
> traced all interactions of him with our SE to their originating hosts,
> since 20071105030430.985562Z, so just my last two log files. There
> are still, till 2150 UTC on November 6th 543 TYPE=STOR operations,
> in about 42 hours, involving 75 hosts spread all over the grid. Indeed
> most of the UK ones have stopped, except for:
>
> dgc-grid-40.brunel.ac.uk 2
> dgc-grid-44.brunel.ac.uk 4
> fal-pygrid-19.lancs.ac.uk 16
> lcg.shef.ac.uk 8
> wd44.hep.ph.ic.ac.uk 20
For wd44 the node was out of the batch system (before the incident) but
some jobs where left running. Since the batch system was not
there to enforce wallclock/cpu time and since the biomed job was a pilot
one none of the individual processes hit the cpu time limit :(
Lesson learned, if the batch system is not there to kill jobs don't
expect them to ever end....
Kostas
|