We've had the same problem at RAL.
I `qdel'ed the jobs and Torque clears the working directory. The log
file was logESD which made it easy to search. I've retained one of the
log files for Rod.
We've tried various methods of terminating jobs that exceed their memory
allocation with mixed success.
Martin.
> -----Original Message-----
> From: LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Jeff Templon
> Sent: 14 April 2005 15:20
> To: [log in to unmask]
> Subject: [LCG-ROLLOUT] possible problems at NIKHEF
>
>
> Hi *,
>
> Not sure many people would have been affected, since mostly ATLAS jobs
> are running here, but just in case:
>
> we had several instances of failed jobs here today. There are some
> ATLAS jobs that are getting into an infinite loop, then running out of
> memory, then printing 'out of memory' until the job's working dir
> expires (I've seen as much as 90 GB of 'no memory' messages).
>
> When the WD expires, it means there is no space for other
> jobs to run on
> this machine (all WDs on same device via TMPDIR patch). So the node
> becomes a black hole. We had about sixty of the suckers a
> couple hours ago.
>
> Anyway I hope nobody but ATLAS lost any jobs, and if you did, we know
> what the problem was and are working on the fix (qdel all the jobs and
> then check 100 worker nodes for big TMPDIRs and delete. ugh)
>
> This brings up the point once again: we need space management
> on worker
> nodes, and a way to blow a job out of the water if it exceeds its
> assigned space.
>
> JT
>
> ps the ATLAS guys (mostly Rod Walker here) have been real
> nice about it,
> asking us to send some more info so they can cancel the whole set of
> jobs to avoid further grief ...
>
|