Hi Kashif,
On 07/17/2013 10:24 AM, Kashif Mohammad wrote:
> This error never went away completely and I can still see 10-20 error almost every day in our log file.
I've looked at this, and found that it is a bug, with two parts.
Below I've written up what I know about it. It leads to failed jobs.
Steve
:--- Bug Write Up -------------
Part 1 of the bug causes "Cannot assign requested address (99) in
send_job". Looking at a section of log with that error, I see this,
typically,
in the server logs (some fields omitted):
Job;434560.hepgrid96.ph.liv.ac.uk;Job Queued at request of
[log in to unmask]
Job;450013.hepgrid96.ph.liv.ac.uk;enqueuing into long, state 1 hop 1
Job;450013.hepgrid96.ph.liv.ac.uk;Job Queued at request of
[log in to unmask]
Job;450013.hepgrid96.ph.liv.ac.uk;Job Run at request of
[log in to unmask]
LOG_ERROR::Cannot assign requested address (99) in send_job, send_job
failed to c0a81510 port 15002
Job;450013.hepgrid96.ph.liv.ac.uk;unable to run job, MOM rejected/rc=1
Job;450013.hepgrid96.ph.liv.ac.uk;Job Run at request of
[log in to unmask]
Job;450013.hepgrid96.ph.liv.ac.uk;Exit_status=0 resources_used.cput=00:01:41
Job;450013.hepgrid96.ph.liv.ac.uk;dequeuing from long, state COMPLETE
In that case, job 434560 arrived at the torque server. It got queued and
"run" was initiated.
But a network error caused "Cannot assign requested address (99) in
send_job," and
"unable to run job, MOM rejected/rc=1". But the problem was not fatal -
"run" was initiated
again soon later, and the job went well, ending with state COMPLETE. The
vast majority
of "Cannot assign requested address (99)" events end like this.
However, a tiny minority
of the events end poorly. This is part 2 of the bug. Events that follow
this path
leave entries in the server logs like this:
Job;469438.hepgrid96.ph.liv.ac.uk;enqueuing into long, state 1 hop 1
Job;469438.hepgrid96.ph.liv.ac.uk;Job Queued at request of
[log in to unmask]
Job;469438.hepgrid96.ph.liv.ac.uk;Job Run at request of
[log in to unmask]
LOG_ERROR::Cannot assign requested address (99) in send_job, send_job
failed to c0a81003 port 15002
469438.hepgrid96.ph.liv.ac.uk;unable to run job, MOM rejected/rc=1
469438.hepgrid96.ph.liv.ac.uk;Job deleted at request of
[log in to unmask]
Job;469438.hepgrid96.ph.liv.ac.uk;dequeuing from long, state EXITING
I've removed the timestamps, to fit the records in, but there were
hours/days between the
rejection and when I noticed the "stuck" job and qdel'd it myself. No
attempt was made to
restart these, as far as the logs show. This tiny portion of the events
happens because the
exec_host (shown by qstat -f) is assigned and can't be changed after -
that is what makes
these events different from the others.
BTW: It would be good to know if you have some of these stuck jobs at
your site - I've written a script that can find them (see below). To
invoke it:
cd /var/log/torque/server_logs; cat `ls -rt | tail -9` |
~/scripts/jobs_with_mom_rej.pl | grep not
Script (jobs_with_mom_rej.pl):
#!/usr/bin/perl
use strict;
my %jobRejectedHash;
while(<STDIN>) {
if (/.*PBS_Server;Job;(\d*).*MOM rejected\/rc\=1/) {
my $jobNo = $1;
if (!(defined($jobRejectedHash{$jobNo}))) {
$jobRejectedHash{$jobNo} = -1;
}
}
if (/.*PBS_Server;Job;(\d*).*Job Run at request of/) {
my $jobNo = $1;
if (defined($jobRejectedHash{$jobNo})) {
$jobRejectedHash{$jobNo} = 1;
}
}
}
# List any job that was rejected, then later run
foreach my $j (keys(%jobRejectedHash)) {
if ($jobRejectedHash{$j} == 1) {
print("Job $j - rejected and was later run\n");
}
else {
print("Job $j - rejected and not later run\n");
}
}
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|