I believe Manchester is running only panda analysis due to an oversight.
So our problems haven't been created by WMS jobs. We started with no
limit on the number of panda analysis jobs which for us meant around
800 per cluster with disastrous results 97% failure. We then reduced the
number of concurrent jobs to 400 and it was a bit better but still jobs
were hanging for hours and many timed out (~60-70%) and yesterday I
reduced down to 200 jobs per cluster and this morning the number of
failures was reduced to 7%. I also set the read head buffer to 32MB
which wasn't much effective but I still have to play with other values.
We are going to implement channel bonding as next step and then I'll
start all over again.
Sam Skipsey wrote:
> Whilst Glasgow has been doing reasonably well under the kind of mixed
> load that ATLAS's STEP09 framework provides, the last week showed that
> our (DPM) storage was getting very stressed by the user analysis jobs.
> Load on some pool nodes was reaching 60 or higher, mainly due to
> backed up rfcp transactions from (we think) WMS user analysis jobs
> doing native rfio transfers. Although we didn't see overwhelming
> failures, this did result in the job efficiency for analysis jobs
> dropping to less than 10%. We played with the read-ahead buffer sizes,
> which lessened the extremity of the load on our pool servers, but not
> enough to really affect the efficiencies of the jobs - it appears that
> WMS user analysis jobs were entering some strange state where they'd
> continue to pull data for up to 12 or even 24 hours, just hanging
> around and otherwise dead to the world.
> This was clearly not a good thing, and it was impossible to tell what
> relative effect the other analysis jobs were having due to the mass of
> different job types running at once. We tried limiting Panda jobs to
> 100, but the effects, if any, were swamped by the WMS analysis, and
> limiting WMS analysis was complicated by the "dead" jobs.
> So. We limited the number of WMS user analysis jobs on the cluster to
> 1, and cleared out the "dead" jobs that were still trying to do rfcps
> this morning, and limited Panda analysis to 200 concurrent.
> The plan is:
> Today and some of tomorrow: ramp up only the Panda user analysis jobs
> allowed, through 300 to 500 (and hopefully up to 1000), watching to
> see if we can get stable storage with reasonable efficiency at each
> limit. (We're currently at 500 now, waiting for the number of jobs to
> stabilise and enough data to exist at this setting.)
> We have some tools to parse our batch system logs, so it should be
> easy to compare efficiencies historically, too.
> Thursday: ramp down Panda user analysis, and try ramping up the WMS
> analysis jobs instead. I expect that we'll do this in smaller units
> than 100, since they seem to be much more stressful.
> At the end, hopefully, we'll have useful numbers for the maximum
> useful load we can sustain of each job type at a reasonably efficiency
> and storage infrastructure load, and a useful comparator for relative
> stress caused by the two types of job.
> Of course, this would be even more useful if other sites (UK for
> starters) could do something similar, so we could compare data across
> storage and cluster implementations too.