Hi,
I've been bugging Jisc about this bug for ages, so thought I'd try
looking at it myself. I didn't make much headway, but perhaps enough to
let someone who knows what they're doing fix it :)
The failure mode is that tids processes do not die, and instead sit
around chewing 100% CPU - over time you have enough of these to bring
your IdP to its knees. We've bodged round this by having a cron job do
system tids restart ever 2 hours :(
strace on a spinning tids produces no output (suggesting no system calls
are being made), gdb (with moonshot-trust-router-dbg installed) always
looks roughly like:
Attaching to program: /usr/bin/tids, process 2374
[New LWP 2375]
[New LWP 2376]
[New LWP 2377]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fac5b1804c2 in log4shib::Category::getChainedPriority() const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
(gdb) bt
#0 0x00007fac5b1804c2 in log4shib::Category::getChainedPriority() const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#1 0x00007fac5b17fc79 in log4shib::Category::isPriorityEnabled(int)
const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#2 0x00007fac5b1809fc in log4shib::Category::info(char const*, ...) ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#3 0x00007fac5d9a63c2 in shibsp::SPConfig::term() ()
from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#4 0x00007fac5d9a6a88 in shibsp::SPInternalConfig::term() ()
from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#5 0x00007fac5dd6fd7e in shibresolver::ShibbolethResolver::term() ()
from /usr/lib/x86_64-linux-gnu/libshibresolver.so.1
#6 0x00007fac5e1c9189 in gssEapLocalAttrProviderFinalize ()
from /usr/lib/x86_64-linux-gnu/gss/mech_eap.so
#7 0x00007fac5e1c1174 in ?? () from
/usr/lib/x86_64-linux-gnu/gss/mech_eap.so
#8 0x00007fac5f5ebff8 in __run_exit_handlers (status=status@entry=0,
listp=0x7fac5f9755f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#9 0x00007fac5f5ec045 in __GI_exit (status=status@entry=0) at exit.c:104
#10 0x00000000004059b7 in tids_accept (tids=0x190e200, listen=<optimized
out>)
at tid/tids.c:485
#11 0x0000000000405dec in tids_start (tids=tids@entry=0x190e200,
req_handler=req_handler@entry=0x403dc0 <tids_req_handler>,
auth_handler=auth_handler@entry=0x403d70 <auth_handler>,
hostname=<optimized out>, port=port@entry=12309,
cookie=cookie@entry=0x1904dc0) at tid/tids.c:546
#12 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
at tid/example/tids_main.c:389
tid/tids.c:485 is
exit(0); /* exit to kill forked child process */
...so it appears to be a bug in something's exit handlers?
getChainedPriority does have a loop in it:
const Category* c = this;
while(c->getPriority() >= Priority::NOTSET) {
c = c->getParent();
}
...which makes me wonder if something is being incorrectly initialised,
but I'm rather clutching at straws here.
Debian/Ubuntu don't ship a log4shib library with debugging symbols
installed.
I then installed moonshot-gss-eap-dbg and the problem seems much slower
to recur (to the point that I'd thought it had caused the problem to
entirely go away); now a bt looks like:
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fd74ece84c2 in log4shib::Category::getChainedPriority() const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
(gdb) bt
#0 0x00007fd74ece84c2 in log4shib::Category::getChainedPriority() const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#1 0x00007fd74ece7c79 in log4shib::Category::isPriorityEnabled(int)
const ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#2 0x00007fd74ece89fc in log4shib::Category::info(char const*, ...) ()
from /usr/lib/x86_64-linux-gnu/liblog4shib.so.1
#3 0x00007fd75150e3c2 in shibsp::SPConfig::term() ()
from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#4 0x00007fd75150ea88 in shibsp::SPInternalConfig::term() ()
from /usr/lib/x86_64-linux-gnu/libshibsp.so.6
#5 0x00007fd7518d7d7e in shibresolver::ShibbolethResolver::term() ()
from /usr/lib/x86_64-linux-gnu/libshibresolver.so.1
#6 0x00007fd751d310f7 in gss_eap_shib_attr_provider::finalize ()
at util_shib.cpp:481
#7 0x00007fd751d31189 in gssEapLocalAttrProviderFinalize (
minor=minor@entry=0x7ffeac0d31f4) at util_shib.cpp:550
#8 0x00007fd751d29174 in (anonymous
namespace)::finalize_class::~finalize_class (this=<optimized out>,
__in_chrg=<optimized out>) at util_attr.cpp:101
#9 0x00007fd753153ff8 in __run_exit_handlers (status=status@entry=0,
listp=0x7fd7534dd5f8 <__exit_funcs>,
run_list_atexit=run_list_atexit@entry=true) at exit.c:82
#10 0x00007fd753154045 in __GI_exit (status=status@entry=0) at exit.c:104
#11 0x00000000004059b7 in tids_accept (tids=0x968200, listen=<optimized
out>)
at tid/tids.c:485
#12 0x0000000000405dec in tids_start (tids=tids@entry=0x968200,
req_handler=req_handler@entry=0x403dc0 <tids_req_handler>,
auth_handler=auth_handler@entry=0x403d70 <auth_handler>,
hostname=<optimized out>, port=port@entry=12309,
cookie=cookie@entry=0x95edc0) at tid/tids.c:546
#13 0x0000000000403a94 in main (argc=<optimized out>, argv=<optimized out>)
at tid/example/tids_main.c:389
This is an Ubuntu Xenial system, but I've seen the runaway-tids problem
basically since I started looking at the moonshot pilot back at my
previous job (where we were running Debian).
Regards,
Matthew
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
|