Hi,
I added
# 5GB max file size to prevent runaway logs
ulimit -f 51200000
to my runscript to catch runaway logs.
It might be friendlier to cut the middle out of large log files rather
than chop off the bottom, e.g.
do
bytes=`stat -c %s $f`
if [ $bytes -gt 10000000 ];then
echo "Trimming $f with size $bytes bytes"
head -100000 $f > $f.trim
tail -100000 $f >> $f.trim
mv $f.trim $f
fi
done
There`s often important stuff at the bottom.
Cheers,
Rod.
On Wed, 21 Sep 2005, EGEE BROADCAST wrote:
> ------------------------------------------------------------------------------------
> Publication from : Maarten Litmaath 1689 <[log in to unmask]> (CERN)
> This mail has been sent using the broadcasting tool available at http://cic.in2p3.fr
> ------------------------------------------------------------------------------------
>
> Dear colleagues,
> in the past 2 weeks there have been serious problems with CERN
> production RBs due to file systems filling up completely with
> huge output sandboxes. The worst example:
>
> -----------------------------------------------------------------------------
> total 59492104
> -rw-rw---- 1 cms002 edguser 60860395520 Sep 18 03:36 ORCA_000097.stderr
> -rw-rw---- 1 cms002 edguser 13778 Sep 17 16:32 ORCA_000097.stdout
> -----------------------------------------------------------------------------
>
> Indeed: a 60 _GB_ file! Filled with the same error message over and over.
>
> Obviously we need to do something about it fast.
>
> The next version of the RB code, currently being tested, will limit the
> size of an output sandbox to a maximum value set by the RB admin.
>
> The job wrapper sorts the files in the output sandbox by size and copies
> those files to the RB whose combined size does not exceed the limit;
> the difference between the combined size and the limit is divided by
> the number of remaining files and each such file is truncated to the
> resulting value, after which it is copied to the RB. An event is logged
> for each file that needed to be truncated or was not found. In that case
> edg-job-status will show that the job is \"Done (with errors)\" and as usual
> edg-job-get-logging-info -v 1 will have the details.
>
> Each output sandbox globus-url-copy to the RB is tried in a loop:
> if it fails, the problem is assumed to be temporary (e.g. network down)
> and the operation is retried after a delay that is doubled each time,
> starting at 5 minutes; the job wrapper will give up after 5 hours.
> An event is logged for any globus-url-copy problem.
>
> The maximum output sandbox size should be set to a small value,
> e.g. 10 MB like for the input sandbox. An RB is _not_ an SE.
> However, to smoothen the transition we propose to start with 100 MB.
>
> To mitigate the problem on the RBs right now we have launched a
> continuous cleanup job with the following characteristics:
>
> - any sandbox file older than 3 weeks is deleted;
>
> - any sandbox file larger than 100 MB is truncated to 100 MB;
>
> - any sandbox file larger than 10 MB whose name matches the following
> patterns is truncated to 10 MB:
>
> *.out
> *.err
> *.log
> *.stdout
> *.stderr
>
> Comments?
>
--
Rod Walker +1 6042913051
|