On 23 June 2011 10:56, Matt Doidge <[log in to unmask]> wrote:
> Heyup,
> Sadly I spoke too soon and we still have load issues (although it takes
> about 8 hours for the cream to grind to a halt now, rather then 4...),
> despite stopping jobs from the Glasgow factory and other factors.
>
>> If you have any tuning suggestions that would be much appreciated! >Though
>> this might resolve itself if the multitude of leases for Graeme >is the root
>> cause.
>>
>
> I, like some naive numpty, never bothered with any basic mysql tuning for
> our cream. I'm planning on rectifying that today (probably using some of the
> dpm tricks). I'll let y'all know how that pans out.
>
Increasing the innodb_buffer_pool_size should be all you need to do,
really. For some reason, CREAM, like DPM, never properly sets this
important variable...
Sam
> Cheers,
> Matt
>
>
> Andrew Washbrook wrote:
>>
>> Hi Matt,
>>
>> Thanks for your summary of the Cream related issues at Lancaster - it is
>> good to know we are not a lone voice in the wilderness!
>>
>> On the face of it I get the same kind of pattern here:
>> mysql> select userId, count(*) from job_lease group by userId;
>>
>> +------------------------------------------------------------------------------------------------+----------+|
>> userId
>> | count(*) |
>>
>> +------------------------------------------------------------------------------------------------+----------+|
>> _C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_graeme_stewart_atlas_Role_pilot_Capability_NULL
>> | 244 |
>> |
>> _C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_graeme_stewart_atlas_Role_production_Capability_NULL
>> | 689
>> |+------------------------------------------------------------------------------------------------+----------+
>>
>> [mysql output might be mangled above but you get the picture]
>>
>> but I dont have "bupdater_loop_interval" defined in
>> /opt/glite/etc/blah.config. What should this be set to and what does it do?
>>
>> For the other mysql check:
>> mysql> select count(*), name from command group by name;
>> +----------+----------------+
>> | count(*) | name |
>> +----------+----------------+
>> | 1 | JOB_START |
>> | 1 | SET_JOB_STATUS |
>> +----------+----------------+
>> 2 rows in set (0.00 sec)
>>
>> Though I had to reboot the machine this morning due to memory swapping
>> (quelle surprise) so this might not be representative of normal running.
>>
>> Other things we have noticed:
>>
>> - CPU is continutally occupied with the main offender being the
>> BUpdaterSGE process, and sge_helper every so often. (see attached ganglia
>> graph)
>>
>> - I periodically have to run "./JobDBAdminPurger.sh -u cream -p XXX -f
>> cancelled-joblist"
>> every couple of weeks to clear the CreamDB of cancelled jobs. I get the
>> joblist from the message build up in /opt/glite/var/log/glite-ce-cream.log:
>>
>> 22 Jun 2011 14:03:25,219 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
>> (LeaseManager.java:343) -(TIMER) Job has been cancelled. jobId =
>> CREAM87562591622 Jun 2011 14:03:25,222 INFO
>> org.glite.ce.creamapi.jobmanagement.cmdexecutor.LeaseManager
>> (LeaseManager.java:343) - (TIMER) Job has been cancelled. jobId =
>> CREAM875801373
>> + 3500 other similar messages (repeats every 10 minutes)
>>
>> Not very satisfactory! Just to note, this is running under a VM so some
>> performance issues could stem from hypervisor overhead (and Ewan suggested
>> that disk access under a VM could be factor as well).
>>
>> If you have any tuning suggestions that would be much appreciated! Though
>> this might resolve itself if the multitude of leases for Graeme is the root
>> cause.
>>
>> Cheers,
>> Andy.
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> ------------------------------------------------------------------------
>>
>>
>> On 21 Jun 2011, at 15:57, Matt Doidge wrote:
>>
>>> Heya all, as promised in the meeting our experiences dealing with cream
>>> load troubles.
>>>
>>> Although we're not out of the woods yet we've been hammering the cream
>>> load problems for a while. The load seems to come from two linked
>>> sources - the mysql daemon and the BUpdater process. The cream support
>>> guys have squinted long and hard at our problems and think it's down to
>>> two sources:
>>>
>>> 1) Too many "leases" being created by the atlas pilot factories - one
>>> lease should be created per user (i.e. factory) but one of the factories
>>> seems to be creating one lease per job.
>>> - this can be seen using this query in the creamdb:
>>> select userId, count(*) from job_lease group by userId;
>>>
>>> (as I understand it the count per user should be 1. For us it was in the
>>> hundreds for one user - Graeme's. Peter tracked down the factory condor
>>> versions for us and Glasgow's factory appears to be on condor-7.5.5-1,
>>> whilst the other factories are on condor-7.5.6-1). I've asked Peter to
>>> stop submission to us from Glasgow and will see what happens.
>>>
>>> 2) The above problem is compounded by us not having
>>> "bupdater_loop_interval" set in our blah.config. This causes us to use
>>> the default of 5 (seconds), which is apparently a bit low. We've set
>>> ours to 20 to see if that makes a difference. I don't think this
>>> variable is currently support in yaim (or at least support for it is
>>> fairly new).
>>>
>>> These two factors combine to cause a backlog of requests to mysql,
>>> ramping up the load and generally causing badness. This backlog can be
>>> seen with the creamdbquery:
>>> select count(*), name from command group by name;
>>>
>>> Hopefully this will be useful to you chaps.
>>>
>>> Cheers,
>>> Matt
>>>
>>
>
|