On 06/13/2012 04:37 PM, John Hill wrote:
> While investigating the recent supposed CVMFS and analysis job issues
> at Cambridge, I came across PBS errors in /var/log/messages on the WNs
> which reported copy errors when getting files from the CREAM Sandbox
> area. Further digging has identified these as old pilot jobs (some
> from August last year!) which are still lurking in the PBS queue and
> are being periodically restarted. "showq" indicates that we have about
> 3500 of these relic jobs.I was wondering whether there was there a
> recommended way to tidy up the queue?
>
> John
I opened a bug on this, a year ago:
https://ggus.eu/tech/ticket_show.php?ticket=72506
No action yet:(
A way to deal with them is to run this script (see below) on your CREAM
server(s) - it will show you the jobs that are cycling.
Steve
:--- SCRIPT ---
[root@hepgrid6 scripts]#
[root@hepgrid6 scripts]# cat stagein_check.pl
#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(usleep nanosleep);
# Get hostname
my $hostname;
open(CMD,"hostname|") or die("No open cmd, $!");
while(<CMD>) {
my $l = $_; chomp($l);
$hostname = $l;
}
close(CMD);
# Get list of jobs
my @jobs;
open(CMD,"qstat |") or die("No open cmd, $!");
while(<CMD>) {
my $l = $_; chomp($l);
if (($l !~ /^Job/) and ($l !~ /^---/)) {
if ($l !~ / Q /) {
if ($l !~ / R /) {
my @fields = split(" ",$l);
push(@jobs,$fields[0]);
}
}
}
}
close(CMD);
# Get the full qstat of every job, fixing truncation
foreach my $j (@jobs) {
my $goodQstatText = '';
usleep(200 );
open(CMD,"qstat -f $j |") or die("No open cmd, $!");
while(<CMD>) {
if (/^\t/) {
chomp($goodQstatText);
s/^\t//;
}
$goodQstatText .= $_;
}
close(CMD);
my $jobName = '';
# Go over each line, getting name and list of stagein files
my @goodQstatLines = split("\n",$goodQstatText);
foreach my $l (@goodQstatLines) {
if ($l =~ /Job Id: /) {
$jobName = $l;
}
if ($l =~ /stagein = /) {
if ($l =~ /$hostname/) {
my @files = split(",",$l) ;
for my $f (@files) {
$f =~ s/^.*://;
# Make sure each stagein file exists; grumble if not
if (! -f $f) {
print("$jobName - missing stagein: $f\n");
}
}
}
}
}
}
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|