Hi all,
Since just before the weekend the Birmingham tier 2 has been suffering
a strange problem. We currently have ~1200 job slots available but
we're only running about 600-700 jobs. We have zero queued jobs (every
time a job is sent to us, it is scheduled and run straight away).
We're just not being sent enough jobs to fill us up.
Running lcg-infosites gives:
# CPU Free Total Jobs Running Waiting ComputingElement
----------------------------------------------------------------
1160 507 288 205 83
epgr02.ph.bham.ac.uk:8443/cream-pbs-long
1160 507 103 94 9
epgr02.ph.bham.ac.uk:8443/cream-pbs-short
so we're advertising the correct number of slots but they're not being filled.
I can't see anything in the CREAM, ARGUS, TORQUE or Maui logs to
suggest any problem and we're green on all of the dashboards (nagios,
panda etc).
We have about 100 jobs in waiting status but even if I kill these
(since waiting jobs are usually caused by expired proxies) the number
of jobs just increases back to the same level again.
The number of jobs we're running is extremely stable as seen in the
attached screenshot. The places where the red W line drops to zero is
where I've manually cleaned out those jobs which causes a temporary
jump in running jobs which quickly levels off again. The number of
queued jobs is barely visible and is just the blue dots at the bottom.
Has anyone else had this problem or know here I should look to try to
resolve it?
Cheers,
Matt Williams
University of Birmingham
|