Hi All,
I'd be very interested in anyone else has had something like this (see
below).
For background, please read this:
http://northgrid-tech.blogspot.co.uk/2014/04/kernel-problems-at-liverpool.html
Cheers,
Steve
-------- Original Message --------
Return-Path: <[log in to unmask]>
X-Original-To: [log in to unmask]
Delivered-To: [log in to unmask]
Received: from hep169.ph.liv.ac.uk (hep169.ph.liv.ac.uk
[138.253.48.169]) by hep.ph.liv.ac.uk (Postfix) with ESMTP id
909F638BDF2F for <[log in to unmask]>; Tue, 8 Jul 2014 10:20:52
+0100 (BST)
Message-ID: <[log in to unmask]>
Date: Tue, 08 Jul 2014 10:20:52 +0100
From: Stephen Jones <[log in to unmask]>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101
Thunderbird/24.6.0
MIME-Version: 1.0
To: sys <[log in to unmask]>
Subject: timeouts
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
All,
John noticed a few hung_task timeouts on the cluster.
[root@hepgrid8 sjones]# for n in `pbsnodes | grep ^r | sort`; do echo -n
$n" "; ssh -o ConnectTimeout=5 $n dmesg | grep -i hung_task_timeout_secs
| wc -l; done | grep -v " "0$
r25-n12.ph.liv.ac.uk 1
r26-n01.ph.liv.ac.uk 11
r26-n02.ph.liv.ac.uk 2
The signature of the error is similar to the kernel problems encountered
previously this year,
i.e. hung_task_timeout_secs messages in dmesg, and very high and
disruptive load, but :
1) the events are much rarer,
2) recovery appears to be quicker and
3) it's only witnessed in ILC jobs so far.
So, in short, some kernel bug between 2.6.32-358... and 2.6.32-431... is
probably still present but far
less active.
At this level of incidence, it seems overkill to launch a big
investigation. The last effort (which killed off
99% of these problems) took a month or more. It was too expensive to
repeat for a handful of jobs.
We'll continue to monitor these events with (examples) ganglia and the
command above and hope
that some whizz-kid Linux hacker eventually gets to the bottom of the
problem. If we see a spike,
we'll do more.
Cheers,
Steve
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|