Dear ALL,
after some time of rather quiet WMS operation we have (again) one WMS
server that is stuck in changing job states. It might be just by change
the one server that we upgraded to most recent WMS release 3.1.29
(following the details in the release node of course). Some few test
jobs went through but after one of production work the machine seems to
be stuck.
There are some jobs on the WMS that are still in the state running or
scheduled although we know that they are done (even the output sandbox
is on the WMS!). New jobs stay in the infamous Ready/unavailable status
for many hours. Trying to understand what's happening to those jobs, we
see that they make it until to the actual Condor submit but remain in
the Condor state Idle for whatever reason.
The services have been restarted several times, so the dead lock must be
persistent somewhere. The LB server (on a different machine) can be
excluded almost for sure since it servers other WMS machines without
problems.
Has anyone an idea or a hint to a recipe how to get the WMS going again?
A good weekend to everyone!
Cheers, Christoph
|