Thanks Henry,
You were right, the problem was with one of the worker nodes. It took a
while finding out which worker node, as the PBS server logs don't indicate
which WN it is attempt to run the job on, and the logs on the dodgy WN have
nothing to indicate a problem.
It is surprising that the PBS server doesn't try to resubmit to an alternative
WN when it detects a problem.
Anyway, thanks again for your idea
Dave
Henry Nebrensky wrote:
> On Mon, 29 Sep 2008, David Robson wrote:
>
>> We've just recently had a long period where all our PBS jobs queue,
>> and then, one minute later, dequeue.
> ...
>> Does anyone know of reasons why only jobs coming from the gatekeeper immediately
>> get dequeued?? Can anyone suggest any debugging techniques to get to the bottom
>> of this?
>
> Could be a ropy ("Black Hole") worker node accepting and trashing fresh
> jobs.
>
> Our PBS tends to fill jobs in from the highest numbered node. If things
> are working properly now and you have some free nodes, try taking them
> off-line and see if things keep working after the queues fill up a bit.
>
> Also check the usual suspects - NTP, SSH/scp within the cluster, free disk
> space, etc.
>
> Thanks
>
> Henry
>
--
___________________________________________________
David Robson
CODAS & IT Department, UKAEA Culham
Culham Science Centre, Abingdon, OXON, OX14 3DB, UK
Voice: +44(0)1235-46-4569, Fax: 4404
Work email: [log in to unmask]
Home email: [log in to unmask]
|