Dear All,
On Tue, 17 Feb 2009 Elena Korolkova wrote:
> we are struggling with a problem that it is taking forever for jobs,
> including SAM test jobs, to run through.
> ....
> the jobs aren't getting to the pbs system
Bristol is seeing this having started suddenly on lcgce01. About 1am Fri a
cms job got stuck - still in Q state but assigned to a jobslot.
Bunch of OPS SAM tests failed for lcgce01(not unexpected).
That cms job was cancelled & since then - altho user jobs are running well -
no regular OPS SAM jobs & none of my test-jobs submitted from UI, run on that
jobslot reserved for ops & short queue.
They don't even reach pbs.
Yet the CE isn't failing OPS SAM tests (but it's failing all CMS SAM tests &
has started failing lhcb SAM tests).
Submit opssgm jobbs fr SAM Admin portal = didn't even reach PBS.
Over 4900 cms jobs were found queued on lcgce01. Perhaps a limit/load issue.
There were a few thousand globus-gma jobs; restart makes them go away.
cms authorized cancelling most of those queued jobs. Done.
Still SAM Admin portal or my UI test jobs don't even reach pbs.
Today, still no OPS SAM jobs. Reboot the CE & the SAM Admin portal job
reached pbs & ran; 3 hrs later, SAM Admin still says "Running" (so now
delay = getting output back to publish). Lots more CMS jobs queue up (1400),
so THOSE are reaching PBS just fine - why?!
Not many globus-gma processes.
And my short-queue test jobs from UI don't even reach PBS. At all.
Sometimes "Got a job held event, reason: Globus error 10: data transfer to
the server failed" or "user proxy expired".
Nothing's changed on lcgce01 recently AFAIK (except latest kernel), we
avoid yaim as it trashes too much config.
Any advice on debbugging?
|