Print

Print


On Wed, Nov 07, 2007 at 02:19:46PM -0000, Jensen, J (Jens) wrote:

> Since this user ought to have stopped this activity over 7 days ago, I've
> traced all interactions of him with our SE to their originating hosts,
> since 20071105030430.985562Z, so just my last two log files. There
> are still, till 2150 UTC on November 6th 543 TYPE=STOR operations,
> in about 42 hours, involving 75 hosts spread all over the grid. Indeed
> most of the UK ones have stopped, except for:
> 
>  dgc-grid-40.brunel.ac.uk                 2
>  dgc-grid-44.brunel.ac.uk                 4
>  fal-pygrid-19.lancs.ac.uk                16
>  lcg.shef.ac.uk                           8
>  wd44.hep.ph.ic.ac.uk                     20

For wd44 the node was out of the batch system (before the incident) but
some jobs where left running. Since the batch system was not
there to enforce wallclock/cpu time and since the biomed job was a pilot
one none of the individual processes hit the cpu time limit :(

Lesson learned, if the batch system is not there to kill jobs don't
expect them to ever end....

Kostas