------------------------------------------------------------------------------------
Publication from : Maarten Litmaath 1689 <[log in to unmask]> (CERN)
This mail has been sent using the broadcasting tool available at http://cic.in2p3.fr
------------------------------------------------------------------------------------
Dear colleagues,
in the past 2 weeks there have been serious problems with CERN
production RBs due to file systems filling up completely with
huge output sandboxes. The worst example:
-----------------------------------------------------------------------------
total 59492104
-rw-rw---- 1 cms002 edguser 60860395520 Sep 18 03:36 ORCA_000097.stderr
-rw-rw---- 1 cms002 edguser 13778 Sep 17 16:32 ORCA_000097.stdout
-----------------------------------------------------------------------------
Indeed: a 60 _GB_ file! Filled with the same error message over and over.
Obviously we need to do something about it fast.
The next version of the RB code, currently being tested, will limit the
size of an output sandbox to a maximum value set by the RB admin.
The job wrapper sorts the files in the output sandbox by size and copies
those files to the RB whose combined size does not exceed the limit;
the difference between the combined size and the limit is divided by
the number of remaining files and each such file is truncated to the
resulting value, after which it is copied to the RB. An event is logged
for each file that needed to be truncated or was not found. In that case
edg-job-status will show that the job is \"Done (with errors)\" and as usual
edg-job-get-logging-info -v 1 will have the details.
Each output sandbox globus-url-copy to the RB is tried in a loop:
if it fails, the problem is assumed to be temporary (e.g. network down)
and the operation is retried after a delay that is doubled each time,
starting at 5 minutes; the job wrapper will give up after 5 hours.
An event is logged for any globus-url-copy problem.
The maximum output sandbox size should be set to a small value,
e.g. 10 MB like for the input sandbox. An RB is _not_ an SE.
However, to smoothen the transition we propose to start with 100 MB.
To mitigate the problem on the RBs right now we have launched a
continuous cleanup job with the following characteristics:
- any sandbox file older than 3 weeks is deleted;
- any sandbox file larger than 100 MB is truncated to 100 MB;
- any sandbox file larger than 10 MB whose name matches the following
patterns is truncated to 10 MB:
*.out
*.err
*.log
*.stdout
*.stderr
Comments?
|