On 09/19/2014 03:19 PM, Winnie Lacesso wrote:
> Happy Friday!
>
> About 1/4 Bristol's WN kernel panic'd today & it looks like the culprit
> are user jobs, guess, overloading or other badness on the WN - the kernel
> panic mentions gfortran & the 8-core WN load hits about 29 before it
> bails. The jobs are via cmspil004.
29 on an 8 core box is no catastrophe. It's 3 and a bit tasks
contending for
each CPU - high, but far from a disaster unless it lasts for (say)
15 or 30 minutes. That type of load does not usually cause a panic in
my opinion.
What kernel is it, by the way? Any SL6 kernel after the first
2.6.32-431.* and
before 2.6.32-431.23.3 is flaky in my opinion, i.e. 2.6.32-431.23.3 is the
first non-flaky kernel in that range (this applies to SuperMicro boards
mostly).
> Some cmspil004 jobs are still running & we seem unable in any cmspil004
> working dir to find the real CMS user DN, & we're usually pretty good at
> being able to do that (I can for lhcb pilot jobs no problem).
I see.
> We emailed some CMS contacts & they said the real user DN must be in the
> glexec logs. On the WN, /var/log/glexec/* files are all empty. On the
> CREAM CE /var/log/glexec does not exist &
> /var/blah/user_blah_job_registry.bjr/registry.proxydir points to pilot
> proxies - again no real user DN info.
The file /var/log/glexec should be on the worker node, but the ones on
mine are empty.
Maybe you could do it from this file on the ARGUS server, using the file
dates?
/var/log/argus/pepd/process.log
And here's a horrid way to approximately find out.
ARGUS# cd /etc/grid-security/gridmapdir
ARGUS# for inode in `ls -i *cms* | grep ' [a-z]' |sed -e "s/ .*//"`; do
ls -lrti | grep $inode | grep '2 r' ; done
It lists everything of CMS that has a link (hence "2 r"). The resulting
list will contain the "culprit".
You may find out by the dates, but beware - once a link is made, it is
not remade, hence it
may have been made ages ago. Or maybe not. If the jobs are still
coming, delete all the links; the
link will be remade and you'll see the DN next time. If the jobs are
not coming in, and the links are
all old (stale) then you'll have to try something else.
> I'm not very familiar with the argus server & don't see any logs in
> /var/log/argus/* that look like they contain DNs. But said logs must be
> *somewhere* on it.....?
Did you check /var/log/argus/pepd/process.log ?
> So we should be able to trace via glexec info from the pilot job arriving
> at CE, to WN, & find out the DN of a real user's job; the pool account
> cmspil004 can run jobs for many CMS users, we just want to identify this
> one....
Then do that trick of removing all the links in
/etc/grid-security/gridmapdir
The next one made is your "man".
> Is there some (ideally clear & easy) guidance "out there" for how to do
> this? I've been away from LCG support for 2 yrs so may've missed it if
> it's well known "out there" somewhere.
Dunno, but it raises a documentation issue. For audit, a way to track this
MUST be available. I'll put it in my to do list.
Cheers,
Steve
> Winnie Lacesso / Bristol University Particle Physics Computing Systems
> HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|