Hi Catalin,
> On one glite-WMS at RAL the 'condor_q' tool reports thousands of held
> jobs (some since January). I checked information about such job on some
> log files
>
> [root@lcgwms01 glite]# grep 403336 /var/glite/logmonitor/CondorG.log/*
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:000
> (403336.000.000) 04/10 19:25:05 Job submitted from host:
> <130.246.183.215:50825>
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:017
> (403336.000.000) 04/10 19:25:15 Job submitted to Globus
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:027
> (403336.000.000) 04/10 19:25:15 Job submitted to grid resource
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:012
> (403336.000.000) 04/11 04:14:18 Job was held.
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 04:14:22 JC: 1 - Job cancelled from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 04:24:30 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 04:34:41 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 04:44:55 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 04:55:04 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 05:05:15 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:008
> (403336.000.000) 04/11 05:15:23 JC: 3 - Cannot cancel job from queue
> /var/glite/logmonitor/CondorG.log/CondorG.1239381360.log:012
> (403336.000.000) 04/11 07:45:09 Job was held.
>
> Also there is no entry in SandboxDir related to this job, also nothing
> in /var/glite/jobcontrol/condorio or /var/glite/jobcontrol/submit
> I tried to find something in 'lbproxy' database, but couldn't find any
> referral about this job.
>
> However 'condor_q' gets from somewhere information about this held job.
> Does anyone know more details?
The info is remembered in the /var/local/condor/spool/history* files.
> [...]
>
> I'd like to use condor_rm to remove these old held jobs from condor job
> queue, but I'd like to know how it works. Just to mention that the
> normal purging mechanism on WMS is working but it seems to have no
> effect on these held jobs.
There have always been cases where jobs do not get purged completely,
due to various bugs. The latest WMS version should be doing better.
Just condor_rm the junk that got left behind.
|