Hi,
Today it's second time we are experiencing problems with our top BDII nodes. It happened first time a week or so ago, and I believed it was a transient issue. But it was not.
So on one of our nodes
[root@lcgbdii04 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 672G 2.9G 635G 1% /
tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/sda1 248M 87M 149M 37% /boot
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
tmpfs 1.5G 0 1.5G 0% /var/lib/bdii/db
Also in /var/log/messages
...
Oct 4 09:51:29 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:51:29 lcgbdii04 abrt-server[23422]: Saved Python crash dump of pid 22113 to /var/spool/abrt/pyhook-2016-10-04-09:51:29-22113
Oct 4 09:51:29 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:51:29-22113' creation detected
Oct 4 09:51:29 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:51:29 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:51:29-22113' exited with 1
Oct 4 09:51:29 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:51:29-22113'
Oct 4 09:51:40 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:51:40 lcgbdii04 abrt-server[23428]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:51:52 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:51:52 lcgbdii04 abrt-server[23432]: Saved Python crash dump of pid 23431 to /var/spool/abrt/pyhook-2016-10-04-09:51:52-23431
Oct 4 09:51:52 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:51:52-23431' creation detected
Oct 4 09:51:52 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:51:52 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:51:52-23431' exited with 1
Oct 4 09:51:52 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:51:52-23431'
Oct 4 09:54:05 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:54:05 lcgbdii04 abrt-server[23562]: Saved Python crash dump of pid 23561 to /var/spool/abrt/pyhook-2016-10-04-09:54:05-23561
Oct 4 09:54:05 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:54:05-23561' creation detected
Oct 4 09:54:05 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:54:05 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:54:05-23561' exited with 1
Oct 4 09:54:05 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:54:05-23561'
Oct 4 09:54:05 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:54:05 lcgbdii04 abrt-server[23591]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:54:06 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:54:06 lcgbdii04 abrt-server[23693]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:55:10 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:55:10 lcgbdii04 abrt-server[25121]: Saved Python crash dump of pid 23694 to /var/spool/abrt/pyhook-2016-10-04-09:55:10-23694
Oct 4 09:55:10 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:55:10-23694' creation detected
Oct 4 09:55:10 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:55:10 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:55:10-23694' exited with 1
Oct 4 09:55:10 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:55:10-23694'
Oct 4 09:56:02 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:56:02 lcgbdii04 abrt-server[26425]: Saved Python crash dump of pid 25124 to /var/spool/abrt/pyhook-2016-10-04-09:56:02-25124
Oct 4 09:56:02 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:56:02-25124' creation detected
Oct 4 09:56:02 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:56:02 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:56:02-25124' exited with 1
Oct 4 09:56:02 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:56:02-25124'
Oct 4 09:56:13 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:56:13 lcgbdii04 abrt-server[26432]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:56:25 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:56:25 lcgbdii04 abrt-server[26434]: Saved Python crash dump of pid 26433 to /var/spool/abrt/pyhook-2016-10-04-09:56:25-26433
Oct 4 09:56:25 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:56:25-26433' creation detected
Oct 4 09:56:25 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:56:25 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:56:25-26433' exited with 1
Oct 4 09:56:25 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:56:25-26433'
Oct 4 09:58:39 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:58:39 lcgbdii04 abrt-server[26533]: Saved Python crash dump of pid 26532 to /var/spool/abrt/pyhook-2016-10-04-09:58:39-26532
Oct 4 09:58:39 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:58:39-26532' creation detected
Oct 4 09:58:39 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:58:39 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:58:39-26532' exited with 1
Oct 4 09:58:39 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:58:39-26532'
Oct 4 09:58:39 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:58:39 lcgbdii04 abrt-server[26562]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:58:39 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:58:39 lcgbdii04 abrt-server[26664]: Not saving repeating crash in '/usr/sbin/bdii-update'
Oct 4 09:59:38 lcgbdii04 abrt: detected unhandled Python exception in '/usr/sbin/bdii-update'
Oct 4 09:59:38 lcgbdii04 abrt-server[29345]: Saved Python crash dump of pid 26665 to /var/spool/abrt/pyhook-2016-10-04-09:59:38-26665
Oct 4 09:59:38 lcgbdii04 abrtd: Directory 'pyhook-2016-10-04-09:59:38-26665' creation detected
Oct 4 09:59:38 lcgbdii04 abrtd: Package 'bdii' isn't signed with proper key
Oct 4 09:59:38 lcgbdii04 abrtd: 'post-create' on '/var/spool/abrt/pyhook-2016-10-04-09:59:38-26665' exited with 1
Oct 4 09:59:38 lcgbdii04 abrtd: Deleting problem directory '/var/spool/abrt/pyhook-2016-10-04-09:59:38-26665'
Oct 4 10:02:24 lcgbdii04 kernel: slapd[31093] general protection ip:7fa7e5e9381b sp:7fa759e793f0 error:0 in libc-2.12.so[7fa7e5e1b000+18a000]
...
Also I restarted another node (part of the lcgbdii.gridpp.rl.ac.uk alias), and during the first run of the bdii-update
/var/log/messages
...
Oct 4 11:20:07 lcgbdii01 kernel: slapd[6245] general protection ip:7fe31fdd481b sp:7fe19affc3f0 error:0 in libc-2.12.so[7fe31fd5c000+18a000]
...
/var/log/bdii/bdii-update.log
...
2016-10-04 11:16:15,056: [DEBUG] Sorting Add Keys
2016-10-04 11:16:15,476: [DEBUG] Writing ldif_add to disk
2016-10-04 11:16:15,901: [DEBUG] Adding New Entries
2016-10-04 11:16:16,031: [DEBUG] Logging Errors
2016-10-04 11:19:03,273: [DEBUG] Logging Errors
2016-10-04 11:20:07,826: [DEBUG] Logging Errors
2016-10-04 11:20:07,845: [WARNING] ldap_result: Can't contact LDAP server (-1)
2016-10-04 11:20:07,845: [WARNING] ldapadd: update failed: GlueSALocalID=PRIMEDI
SKONLY:replica:online,GlueSEUniqueID=dcsrm.usatlas.bnl.gov,Mds-Vo-name=BNL-ATLAS
,mds-vo-name=local,o=grid
2016-10-04 11:20:07,845: [WARNING] ldap_add: Can't contact LDAP server (-1)
2016-10-04 11:20:07,846: [WARNING] ldapadd: update failed: GlueVOViewLocalID=ops
,GlueCEUniqueID=uagrid.org.ua:2811/nordugrid-SLURM-alice,Mds-Vo-name=UA_ICYB_ARC
,mds-vo-name=local,o=grid
2016-10-04 11:20:07,846: [WARNING] ldap_add: Can't contact LDAP server (-1)
2016-10-04 11:20:07,846: [WARNING] ldapadd: update failed: GlueVOViewLocalID=ops,GlueCEUniqueID=cream-ge-8-kit.gridka.de:8443/cream-sge-sl6,Mds-Vo-name=FZK-LCG2,mds-vo-name=local,o=grid
.
.
.
2016-10-04 11:20:13,080: [INFO] PluginsTime: 0
2016-10-04 11:20:13,080: [INFO] ProvidersTime: 89
2016-10-04 11:20:13,080: [DEBUG] ldapadd o=infosys updatestats
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
2016-10-04 11:20:13,128: [INFO] Sleeping for 120 seconds
[root@lcgbdii01 ~]# rpm -qa|grep bdii
bdii-5.2.23-1.el6.noarch
bdii-config-top-1.0.10-1.el6.noarch
emi-bdii-top-1.0.2-2.el6.noarch
glite-yaim-bdii-4.3.15-1.el6.noarch
Are other sites running topBDII having similar problems?
Any ideas how to fix them?
Many thanks in advance,
Catalin Condurache
RAL Tier-1
|