Print

Print


Thanks all, I've set innodb_buffer_pool_size=512M and straight away I'm 
seeing better performance. It still looks like mysql is unhappy (that 
rogue query still isn't completing) but the CPU consumption has halved. 
Hopefully it will be more robust now.

Cheers all,
Matt

Sam Skipsey wrote:
> On 23 June 2011 10:56, Matt Doidge <[log in to unmask]> wrote:
>> Heyup,
>> Sadly I spoke too soon and we still have load issues (although it takes
>> about 8 hours for the cream to grind to a halt now, rather then 4...),
>> despite stopping jobs from the Glasgow factory and other factors.
> 
>>> If you have any tuning suggestions that would be much appreciated! >Though
>>> this might resolve itself if the multitude of leases for Graeme >is the root
>>> cause.
>>>
>> I, like some naive numpty, never bothered with any basic mysql tuning for
>> our cream. I'm planning on rectifying that today (probably using some of the
>> dpm tricks). I'll let y'all know how that pans out.
>>
> 
> Increasing the innodb_buffer_pool_size should be all you need to do,
> really. For some reason, CREAM, like DPM, never properly sets this
> important variable...
> 
> Sam
> 
>> Cheers,
>> Matt
>>
>>
>> Andrew Washbrook wrote:
>>> Hi Matt,
>>>
>>> Thanks for your summary of the Cream related issues at Lancaster - it is
>>> good to know we are not a lone voice in the wilderness!
>>>
>>> On the face of it I get the same kind of pattern here:
>>> mysql> select userId, count(*) from job_lease group by userId;
>>>
>>> +------------------------------------------------------------------------------------------------+----------+|
>>> userId
>>>                   | count(*) |
>>>
>>> +------------------------------------------------------------------------------------------------+----------+|
>>> _C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_graeme_stewart_atlas_Role_pilot_Capability_NULL
>>>      |      244 |
>>> |
>>> _C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_graeme_stewart_atlas_Role_production_Capability_NULL
>>> |      689
>>> |+------------------------------------------------------------------------------------------------+----------+
>>>
>>> [mysql output might be mangled above but you get the picture]
>>>
>>> but I dont have "bupdater_loop_interval" defined in
>>> /opt/glite/etc/blah.config. What should this be set to and what does it do?
>>>
>>> For the other mysql check:
>>> mysql> select count(*), name from command group by name;
>>> +----------+----------------+
>>> | count(*) | name           |
>>> +----------+----------------+
>>> |        1 | JOB_START      |
>>> |        1 | SET_JOB_STATUS |
>>> +----------+----------------+
>>> 2 rows in set (0.00 sec)
>>>
>>> Though I had to reboot the machine this morning due to memory swapping
>>> (quelle surprise) so this might not be representative of normal running.
>>>
>>> Other things we have noticed:
>>>
>>> - CPU is continutally occupied with the main offender being the
>>> BUpdaterSGE process, and sge_helper every so often. (see attached ganglia
>>> graph)
>>>
>>> - I periodically have to run "./JobDBAdminPurger.sh -u cream -p XXX -f
>>> cancelled-joblist"
>>> every couple of weeks to clear the CreamDB of cancelled jobs. I get the
>>> joblist from the message build up in /opt/glite/var/log/glite-ce-cream.log:
>>>
>>> 22 Jun 2011 14:03:25,219 INFO
>>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
>>> (LeaseManager.java:343) -(TIMER) Job has been cancelled. jobId =
>>> CREAM87562591622 Jun 2011 14:03:25,222 INFO
>>>  org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
>>> (LeaseManager.java:343) - (TIMER) Job has been cancelled. jobId =
>>> CREAM875801373
>>> + 3500 other similar messages (repeats every 10 minutes)
>>>
>>> Not very satisfactory! Just to note, this is running under a VM so some
>>> performance issues could stem from hypervisor overhead (and Ewan suggested
>>> that disk access under a VM could be factor as well).
>>>
>>> If you have any tuning suggestions that would be much appreciated! Though
>>> this might resolve itself if the multitude of leases for Graeme is the root
>>> cause.
>>>
>>> Cheers,
>>> Andy.
>>>
>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>> On 21 Jun 2011, at 15:57, Matt Doidge wrote:
>>>
>>>> Heya all, as promised in the meeting our experiences dealing with cream
>>>> load troubles.
>>>>
>>>> Although we're not out of the woods yet we've been hammering the cream
>>>> load problems for a while. The load seems to come from two linked
>>>> sources - the mysql daemon and the BUpdater process. The cream support
>>>> guys have squinted long and hard at our problems and think it's down to
>>>> two sources:
>>>>
>>>> 1) Too many "leases" being created by the atlas pilot factories - one
>>>> lease should be created per user (i.e. factory) but one of the factories
>>>> seems to be creating one lease per job.
>>>>   - this can be seen using this query in the creamdb:
>>>>   select userId, count(*) from job_lease group by userId;
>>>>
>>>> (as I understand it the count per user should be 1. For us it was in the
>>>> hundreds for one user - Graeme's. Peter tracked down the factory condor
>>>> versions for us and Glasgow's factory appears to be on condor-7.5.5-1,
>>>> whilst the other factories are on condor-7.5.6-1). I've asked Peter to
>>>> stop submission to us from Glasgow and will see what happens.
>>>>
>>>> 2) The above problem is compounded by us not having
>>>> "bupdater_loop_interval" set in our blah.config. This causes us to use
>>>> the default of 5 (seconds), which is apparently a bit low. We've set
>>>> ours to 20 to see if that makes a difference. I don't think this
>>>> variable is currently support in yaim (or at least support for it is
>>>> fairly new).
>>>>
>>>> These two factors combine to cause a backlog of requests to mysql,
>>>> ramping up the load and generally causing badness. This backlog can be
>>>> seen with the creamdbquery:
>>>> select count(*), name from command group by name;
>>>>
>>>> Hopefully this will be useful to you chaps.
>>>>
>>>> Cheers,
>>>> Matt
>>>>