Gentlepersons,
Whilst Glasgow has been doing reasonably well under the kind of mixed
load that ATLAS's STEP09 framework provides, the last week showed that
our (DPM) storage was getting very stressed by the user analysis jobs.
Load on some pool nodes was reaching 60 or higher, mainly due to
backed up rfcp transactions from (we think) WMS user analysis jobs
doing native rfio transfers. Although we didn't see overwhelming
failures, this did result in the job efficiency for analysis jobs
dropping to less than 10%. We played with the read-ahead buffer sizes,
which lessened the extremity of the load on our pool servers, but not
enough to really affect the efficiencies of the jobs - it appears that
WMS user analysis jobs were entering some strange state where they'd
continue to pull data for up to 12 or even 24 hours, just hanging
around and otherwise dead to the world.
This was clearly not a good thing, and it was impossible to tell what
relative effect the other analysis jobs were having due to the mass of
different job types running at once. We tried limiting Panda jobs to
100, but the effects, if any, were swamped by the WMS analysis, and
limiting WMS analysis was complicated by the "dead" jobs.
So. We limited the number of WMS user analysis jobs on the cluster to
1, and cleared out the "dead" jobs that were still trying to do rfcps
this morning, and limited Panda analysis to 200 concurrent.
The plan is:
Today and some of tomorrow: ramp up only the Panda user analysis jobs
allowed, through 300 to 500 (and hopefully up to 1000), watching to
see if we can get stable storage with reasonable efficiency at each
limit. (We're currently at 500 now, waiting for the number of jobs to
stabilise and enough data to exist at this setting.)
We have some tools to parse our batch system logs, so it should be
easy to compare efficiencies historically, too.
Thursday: ramp down Panda user analysis, and try ramping up the WMS
analysis jobs instead. I expect that we'll do this in smaller units
than 100, since they seem to be much more stressful.
At the end, hopefully, we'll have useful numbers for the maximum
useful load we can sustain of each job type at a reasonably efficiency
and storage infrastructure load, and a useful comparator for relative
stress caused by the two types of job.
Of course, this would be even more useful if other sites (UK for
starters) could do something similar, so we could compare data across
storage and cluster implementations too.
Sam
|