Carlos Borrego Iglesias wrote:
> /opt/condor/bin/condor_q | tail:
>
> [root@lcgrb02 root]# /opt/condor/bin/condor_q |tail
> 536966.0 edguser 1/26 07:57 0+00:01:58 H 0 0.0
> JobWrapper.https_3
> 536967.0 edguser 1/26 07:57 0+00:02:27 H 0 0.0
> JobWrapper.https_3
> 536968.0 edguser 1/26 07:58 0+00:06:38 H 0 0.0
> JobWrapper.https_3
> 536972.0 edguser 1/26 07:59 0+00:02:57 H 0 0.0
> JobWrapper.https_3
> 536976.0 edguser 1/26 08:01 0+00:02:03 H 0 0.0
> JobWrapper.https_3
> 536977.0 edguser 1/26 08:01 0+00:03:43 H 0 0.0
> JobWrapper.https_3
> 536979.0 edguser 1/26 08:01 0+00:00:00 I 0 0.0
> JobWrapper.https_3
> 536981.0 edguser 1/26 08:02 0+00:01:46 H 0 0.0
> JobWrapper.https_3
>
> 5396 jobs; 667 idle, 92 running, 4637 held
Your RB has run more than 536981 jobs. It turns out that a size of 14 GB
is about expected for so many jobs. We see that many jobs are put on hold,
quite probably due to one of the MySQL tables having reached a hard limit:
either one of the files in /var/lib/mysql/lbserver20 has reached 4 GB,
or a table has reached a maximum number of rows. In LCG-2_4_0 we will
set those limits to much higher values. In your case we have 3 options:
a. Preserve the database and enlarge the tables: that may require a
downtime of many hours. We once did that for the test zone RB at CERN
and it took a whole day; probably our recipe was inefficient, but it
is the only recipe that has been tested.
b. Allow current jobs to drain from the RB, but no longer accept new jobs.
Then, after a grace period of a few days, remake the database with new
limits that we will provide. This means the job history will be lost,
but probably that is acceptable. To my next message I will attach a
script to drain an RB, and another to reopen the RB.
c. Switch off the RB daemons and mysqld, remake the database, and restart
the daemons. This would all current jobs to be lost.
What do you say?
>>>We are experimenting a problem caused by the exponential growth of the
>>>/var/lib/mysql/ directory on our RB. In one week the disk used space has
>>>increased in 3GB causing the machine to reboot. At the moment the
>>>/var/lib/mysql/ is almost 14Gb big. Is this normal? Is there any way to
>>>reduce what is being written to the data base.
|