JISCMail - TB-SUPPORT Archives

RAL certainly saw problems with connectivity to CERN yesterday afternoon. 

John

-----Original Message-----
From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Tiziana Ferrari
Sent: 13 April 2012 17:10
To: NGI Operations Centre managers
Subject: [Noc-managers] site-bdii instability on April 12 around 16 CET

Dear NGIs

as you can see in the ticket linked in the message below, several sites 
in IberGrid, but also in the Italian NGI, were malfunctioning yesterday 
April 12 around 16 CET. Some of these site-BDIIs were reported to be 
gLite 3.2, but IberGrid is currently assessing if the problem also 
affected EMI releases.

A GEANT backbone connectivity problem was also reported at the same 
time, and it is still not clear if the two issues are related.

I would recommend you to get in contact with your site managers to 
assess the status in your NGI. Please report problems at the next 
operations meeting which will take place on Monday next week.

Please remind your site administrators that gLite 3.2 is reaching end of 
life at the end of the month, and UMD 1.6 includes two important 
upgrades: BDII site 1.1.0 and BDII core 1.3.0, which fix a number of 
known instability issues [*].

Best wishes
Tiziana

[*] http://repository.egi.eu/2012/04/02/release-umd-1-6-0/
- BDII site 1.1.0: This update enables as default Openldap 2.4 for the 
Site-BDII, previous Openldap versions caused instability issues.
- BDII core 1.3.0: This version reduces the disk and memory footprint, 
to improve the stability of both Site-BDII and Top-BDII services. In 
this release Top-BDII cache is enabled by default, and information is 
cached for 12 hours (these parameters are configurable in YAIM).

-------- Original Message --------
Subject: [Operations] GEANT failure impacted site-bdii performance
Date: Fri, 13 Apr 2012 12:44:23 +0100
From: Gonçalo Borges <[log in to unmask]>
Organisation: LIP
To: [log in to unmask],        "[log in to unmask]" 
<[log in to unmask]>, 
"[log in to unmask]" <[log in to unmask]>

Hi Tiziana, Peter..

Yesterday, the Portuguese NREN informed us that there was a problem with
Geant Network, in particular, with a router in Genebra (12th April 2012
around 16 CET)

Coincidence or not, NGI_IBERGRID infrastructure faced at the same time
huge problems at all sites with the site-bdii services. Several sites
complained that yesterday (12th April around 16h CET) it was impossible
to restart the site-bdii service. After every restart, immediately, it
went to 100% cpu usage and it didn't answer to queries.

Today, the GEANT situation has been re-established, but the site bdiis
in the majority of the sites have not recovered and had to be restarted
manually. Moreover, sites are still complaining that the site BDIIs are
consuming a lot of memory. Here are 3 examples at 3 different sites:

IFIC in Valencia:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
13712 ldap 18 0 4577m 1.0g 1.0g S 2.3 13.0 18:25.16 slapd

UB-LCG2 in Barcelona (slapd is using 4 GB (800MB of Resident Memory, RAM)):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ldap 30483 0.2 84.0 4690792 861984 ? Ssl 09:37 0:13 /usr/sbin/slapd -f
/etc/bdii/bdii-slapd.conf -h ldap://0.0.0.0:2170 -u ldap
ldap 30493 0.0 0.3 124460 3276 ? S 09:37 0:02 /usr/bin/python
/usr/sbin/bdii-update -c /etc/bdii/bdii.conf -d

LIP in Lisbon:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9227 ldap 18 0 4671m 1.0g 1.0g S 0.0 51.0 1:47.80 slapd

Therefore, we do not understand:
1) Why is the service (which is supposed to be internal) affected by
such GEANT issue...
2) Why, after the GEANT issue, site-bdii did not recover from it...
3) Why, after the manual restart, there is too much memory consumed

Advice is needed in order to understand the situation, and to avoid
future problems. The service should not be affected in this way in this
kind of problems

I've opened ticket
      https://ggus.eu/ws/ticket_info.php?ticket=81235
to track the issue

Best Regards
Goncalo Borges

-- 
Tiziana Ferrari
EGI.eu Operations
Science Park 140, 1098 XG Amsterdam, NL
m: 0031 (0)6 3037 2691

_______________________________________________

Operations mailing list

[log in to unmask]

https://mailman.egi.eu/mailman/listinfo/operations

_______________________________________________

Noc-managers mailing list

[log in to unmask]

https://mailman.egi.eu/mailman/listinfo/noc-managers

TB-SUPPORT Archives

TB-SUPPORT@JISCMAIL.AC.UK

View:

Options

JiscMail Tools

RSS Feeds and Sharing

Search Archives

Archives