Heya all, as promised in the meeting our experiences dealing with cream
load troubles.
Although we're not out of the woods yet we've been hammering the cream
load problems for a while. The load seems to come from two linked
sources - the mysql daemon and the BUpdater process. The cream support
guys have squinted long and hard at our problems and think it's down to
two sources:
1) Too many "leases" being created by the atlas pilot factories - one
lease should be created per user (i.e. factory) but one of the factories
seems to be creating one lease per job.
- this can be seen using this query in the creamdb:
select userId, count(*) from job_lease group by userId;
(as I understand it the count per user should be 1. For us it was in the
hundreds for one user - Graeme's. Peter tracked down the factory condor
versions for us and Glasgow's factory appears to be on condor-7.5.5-1,
whilst the other factories are on condor-7.5.6-1). I've asked Peter to
stop submission to us from Glasgow and will see what happens.
2) The above problem is compounded by us not having
"bupdater_loop_interval" set in our blah.config. This causes us to use
the default of 5 (seconds), which is apparently a bit low. We've set
ours to 20 to see if that makes a difference. I don't think this
variable is currently support in yaim (or at least support for it is
fairly new).
These two factors combine to cause a backlog of requests to mysql,
ramping up the load and generally causing badness. This backlog can be
seen with the creamdbquery:
select count(*), name from command group by name;
Hopefully this will be useful to you chaps.
Cheers,
Matt
|