from atlas elog
>>>
From: Zbigniew Baranowski <[log in to unmask]>
Date: Thu, Apr 12, 2012 at 20:54
Subject: Network disruption in the grid
To: "grid-service-databases (LCG distributed deployment of databases)" <[log in to unmask]>
Dear All,
All replication links between T0 and T1s are restored after big network glitch that happened around 4pm today and last up to 7pm.
LHCB replicas are already up to date but ATLAS is still catching up.
Best regards,
Zbyszek
<<<
Elena
On 13 Apr 2012, at 17:58, John Gordon wrote:
> RAL certainly saw problems with connectivity to CERN yesterday afternoon.
>
> John
>
> -----Original Message-----
> From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Tiziana Ferrari
> Sent: 13 April 2012 17:10
> To: NGI Operations Centre managers
> Subject: [Noc-managers] site-bdii instability on April 12 around 16 CET
>
> Dear NGIs
>
> as you can see in the ticket linked in the message below, several sites
> in IberGrid, but also in the Italian NGI, were malfunctioning yesterday
> April 12 around 16 CET. Some of these site-BDIIs were reported to be
> gLite 3.2, but IberGrid is currently assessing if the problem also
> affected EMI releases.
>
> A GEANT backbone connectivity problem was also reported at the same
> time, and it is still not clear if the two issues are related.
>
> I would recommend you to get in contact with your site managers to
> assess the status in your NGI. Please report problems at the next
> operations meeting which will take place on Monday next week.
>
> Please remind your site administrators that gLite 3.2 is reaching end of
> life at the end of the month, and UMD 1.6 includes two important
> upgrades: BDII site 1.1.0 and BDII core 1.3.0, which fix a number of
> known instability issues [*].
>
> Best wishes
> Tiziana
>
> [*] http://repository.egi.eu/2012/04/02/release-umd-1-6-0/
> - BDII site 1.1.0: This update enables as default Openldap 2.4 for the
> Site-BDII, previous Openldap versions caused instability issues.
> - BDII core 1.3.0: This version reduces the disk and memory footprint,
> to improve the stability of both Site-BDII and Top-BDII services. In
> this release Top-BDII cache is enabled by default, and information is
> cached for 12 hours (these parameters are configurable in YAIM).
>
> -------- Original Message --------
> Subject: [Operations] GEANT failure impacted site-bdii performance
> Date: Fri, 13 Apr 2012 12:44:23 +0100
> From: Gonçalo Borges <[log in to unmask]>
> Organisation: LIP
> To: [log in to unmask], "[log in to unmask]"
> <[log in to unmask]>,
> "[log in to unmask]" <[log in to unmask]>
>
> Hi Tiziana, Peter..
>
> Yesterday, the Portuguese NREN informed us that there was a problem with
> Geant Network, in particular, with a router in Genebra (12th April 2012
> around 16 CET)
>
> Coincidence or not, NGI_IBERGRID infrastructure faced at the same time
> huge problems at all sites with the site-bdii services. Several sites
> complained that yesterday (12th April around 16h CET) it was impossible
> to restart the site-bdii service. After every restart, immediately, it
> went to 100% cpu usage and it didn't answer to queries.
>
> Today, the GEANT situation has been re-established, but the site bdiis
> in the majority of the sites have not recovered and had to be restarted
> manually. Moreover, sites are still complaining that the site BDIIs are
> consuming a lot of memory. Here are 3 examples at 3 different sites:
>
> IFIC in Valencia:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME COMMAND
> 13712 ldap 18 0 4577m 1.0g 1.0g S 2.3 13.0 18:25.16 slapd
>
>
> UB-LCG2 in Barcelona (slapd is using 4 GB (800MB of Resident Memory, RAM)):
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> ldap 30483 0.2 84.0 4690792 861984 ? Ssl 09:37 0:13 /usr/sbin/slapd -f
> /etc/bdii/bdii-slapd.conf -h ldap://0.0.0.0:2170 -u ldap
> ldap 30493 0.0 0.3 124460 3276 ? S 09:37 0:02 /usr/bin/python
> /usr/sbin/bdii-update -c /etc/bdii/bdii.conf -d
>
>
> LIP in Lisbon:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 9227 ldap 18 0 4671m 1.0g 1.0g S 0.0 51.0 1:47.80 slapd
>
> Therefore, we do not understand:
> 1) Why is the service (which is supposed to be internal) affected by
> such GEANT issue...
> 2) Why, after the GEANT issue, site-bdii did not recover from it...
> 3) Why, after the manual restart, there is too much memory consumed
>
> Advice is needed in order to understand the situation, and to avoid
> future problems. The service should not be affected in this way in this
> kind of problems
>
> I've opened ticket
> https://ggus.eu/ws/ticket_info.php?ticket=81235
> to track the issue
>
> Best Regards
> Goncalo Borges
>
>
>
> --
> Tiziana Ferrari
> EGI.eu Operations
> Science Park 140, 1098 XG Amsterdam, NL
> m: 0031 (0)6 3037 2691
> <Attached Message Part.txt>_______________________________________________
> Noc-managers mailing list
> [log in to unmask]
> https://mailman.egi.eu/mailman/listinfo/noc-managers
__________________________________________________
Dr Elena Korolkova
Email: [log in to unmask]
Tel.: +44 (0)114 2223553
Fax: +44 (0)114 2223555
Department of Physics and Astronomy
University of Sheffield
Sheffield, S3 7RH, United Kingdom
|