On Tue, 14 Jun 2011 10:21:55 +0100
Gonçalo Borges wrote:
> Hi All...
Hi Gonçalo,
> 2./ Indeed, I have a very high number of jobs in the "old" directory:
>
> # ll /var/glite/jobcontrol/jobdir/new/ | wc -l
> 87
> # ll /var/glite/jobcontrol/jobdir/old/ | wc -l
> 3053
>
> Nevertheless, the entries in /var/glite/jobcontrol/jobdir/old/ are
> not so old (30 minutes or so) and I guess they reflect unmatched jobs
> which according to my settings (ExpiryPeriod = 3600; MatchRetryPeriod
> = 600) are retried every 10 m during an hour:
>
> # date; ll -tr /var/glite/jobcontrol/jobdir/old/ | head -n 2
> Tue Jun 14 10:12:49 WEST 2011
> total 10660
> -rw-r--r-- 1 glite glite 165 Jun 14 09:46
> 20110614T084634.949002_3086826048
Option 1: I had similar problems in our WMS. On a high load, it stopped
seeing bdii resources and jobs were not able to start. If it's the
case, you will find some descriptive message in
workload_manager_events.log. And, for sovling it, we installed
google_perf_tools (you will find the receipt at WMS known_issues).
Options 2: Have you recently upgraded lb? If yes, ensure
glite-lb-authz.conf has the correct values.
*You could also install WMSMonitor. Good tool for quick check.
> ---*---
>
> 3./ If I run condor_q, I see 4000 jobs in held state.
>
> # condor_q
> (...)
> 4028 jobs; 0 idle, 4 running, 4024 held
>
>
> I do not think it is normal to have such a number of held jobs in the
> condor queue, and I wonder of somehow, the problem I'm seeing is not
> a consequence of that. Is there anything else I could check?
Maarten senme this script wich must be in cron:
# cat /usr/local/sbin/clean_condor_jobs.sh
#!/bin/bash
CONDOR_CRAP=`/opt/condor-c/bin/condor_q -hold | grep glite | awk '{print $1}'`
for JOB_ID in $CONDOR_CRAP
do
echo "Removing job: " $JOB_ID
/opt/condor-c/bin/condor_rm $JOB_ID
# sleep 2
/opt/condor-c/bin/condor_rm -forcex $JOB_ID
done
> Cheers
> Goncalo
HTH,
Arnau
|