Some of the jobs (from a non LHC VO) running at QMUL look like they have
a memory leak - and indeed appear to have crashed some machines.
I've now put the following limits in my sge queue config for the grid
queues:
s_data 4G
h_data 4.1G
I hope that will make things better for other VOs.
I know that Atlas have put in place measures in their job submission
scripts to kill the jobs when they exceed a memory threshold - and
therefore know that this is what has gone wrong.
Can the scripts be posted as an example of good practice for smaller VOs
please.
Chris
|