hi
we had the problem that the machine was so out of memory that the TMPDIR
patch in torque could not malloc enough free memory to store the path
name -- the path name that needed to be fed to 'rm' ... so the remove
failed.
anybody looked into per-job ulimits ??? Miron was right, need circuit
breakers everywhere ...
JT
Bly, MJ (Martin) wrote:
> We've had the same problem at RAL.
>
> I `qdel'ed the jobs and Torque clears the working directory. The log
> file was logESD which made it easy to search. I've retained one of the
> log files for Rod.
>
> We've tried various methods of terminating jobs that exceed their memory
> allocation with mixed success.
>
> Martin.
>
>
>
>>-----Original Message-----
>>From: LHC Computer Grid - Rollout
>>[mailto:[log in to unmask]] On Behalf Of Jeff Templon
>>Sent: 14 April 2005 15:20
>>To: [log in to unmask]
>>Subject: [LCG-ROLLOUT] possible problems at NIKHEF
>>
>>
>>Hi *,
>>
>>Not sure many people would have been affected, since mostly ATLAS jobs
>>are running here, but just in case:
>>
>>we had several instances of failed jobs here today. There are some
>>ATLAS jobs that are getting into an infinite loop, then running out of
>>memory, then printing 'out of memory' until the job's working dir
>>expires (I've seen as much as 90 GB of 'no memory' messages).
>>
>>When the WD expires, it means there is no space for other
>>jobs to run on
>>this machine (all WDs on same device via TMPDIR patch). So the node
>>becomes a black hole. We had about sixty of the suckers a
>>couple hours ago.
>>
>>Anyway I hope nobody but ATLAS lost any jobs, and if you did, we know
>>what the problem was and are working on the fix (qdel all the jobs and
>>then check 100 worker nodes for big TMPDIRs and delete. ugh)
>>
>>This brings up the point once again: we need space management
>>on worker
>>nodes, and a way to blow a job out of the water if it exceeds its
>>assigned space.
>>
>> JT
>>
>>ps the ATLAS guys (mostly Rod Walker here) have been real
>>nice about it,
>>asking us to send some more info so they can cancel the whole set of
>>jobs to avoid further grief ...
>>
|