Happy Friday!
About 1/4 Bristol's WN kernel panic'd today & it looks like the culprit
are user jobs, guess, overloading or other badness on the WN - the kernel
panic mentions gfortran & the 8-core WN load hits about 29 before it
bails. The jobs are via cmspil004.
Some cmspil004 jobs are still running & we seem unable in any cmspil004
working dir to find the real CMS user DN, & we're usually pretty good at
being able to do that (I can for lhcb pilot jobs no problem).
We emailed some CMS contacts & they said the real user DN must be in the
glexec logs. On the WN, /var/log/glexec/* files are all empty. On the
CREAM CE /var/log/glexec does not exist &
/var/blah/user_blah_job_registry.bjr/registry.proxydir points to pilot
proxies - again no real user DN info.
I'm not very familiar with the argus server & don't see any logs in
/var/log/argus/* that look like they contain DNs. But said logs must be
*somewhere* on it.....?
So we should be able to trace via glexec info from the pilot job arriving
at CE, to WN, & find out the DN of a real user's job; the pool account
cmspil004 can run jobs for many CMS users, we just want to identify this
one....
Is there some (ideally clear & easy) guidance "out there" for how to do
this? I've been away from LCG support for 2 yrs so may've missed it if
it's well known "out there" somewhere.
Winnie Lacesso / Bristol University Particle Physics Computing Systems
HH Wills Physics Laboratory, Tyndall Avenue, Bristol, BS8 1TL, UK
|