Happy New Year! first of all, then back to our usual problems :(
On 31 Dec some multiple cancelations occurred on
lcgrb01.gridpp.rl.ac.uk, and about 22000+ jobs piled up on this machine
without being dispatched. Yesterday I had to ban new jobs coming on
lcgrb01, waiting for the rest to be processed.
The rate of processing is ~10k jobs / 24hrs so I have to wait for
another day or so.
On the other hand the /var/edgwl/logging/status.log file contains mostly
the following entries
...
JOBID=https://lcgrb01.gridpp.rl.ac.uk:9000/oDRUTAX4lByCp9398XpTLw
OWNER=xxxxxxxxxx BKSERVER=lcgrb01.gridpp.rl.ac.uk:9000
NETWORKSERVER=lcgrb01.gridpp.rl.ac.uk:7772 VO=alice
LASTUPDATETIME=1199332938 STATE
NAME=Ready STATEENTERTIME=1199332934 CONDORID= DESTINATION= EXITCODE=0
DONECODE=0 STATUSREASON=Submitting job(s)
ERROR:
GSS Major Status: General failure
GSS Minor Status Error Chain:
import_cred.c:160: gss_import_cred: Unable to read credential for
import: Couldn''t open the file:
/opt/edg/var/spool/edg-wl-renewd/58207d78acb2ed48d0c40d656e89a7e9.3625
...
JOBID=https://lcgrb01.gridpp.rl.ac.uk:9000/rxzJOHtbVsscijiKxLDMrg
OWNER=xxxxxxxxxxx BKSERVER=lcgrb01.gridpp.rl.ac.uk:9000
NETWORKSERVER=lcgrb01.gridpp.rl.ac.uk:7772 VO=alice
LASTUPDATETIME=1199334187 STATENAME=Aborted STATEENTERTIME=1199334187
CONDORID= DESTINATION= EXITCODE=0 DONECODE=0 STATUSREASON=Submission to
condor failed.
...
(I changed the real DN with 'xxxxxxxx')
So I assume all these thousands of jobs will fail in the same manner.
My question, is it something that can be done to correct things, or to
speed up the clearing process?
Many thanks,
Catalin
|