On 29/08/13 11:36, Catalin Condurache wrote:
> Well, in the end it appeared that the DB Updating process sometimes can be very slow.
> And during that time, the BDII service offered by that node is unavailable, and SAM tests might start to fail which is a bit annoying. Yesterday, we had all three topBDII nodes affected by this symptom at the same time.
Last time remember seeing this, it was worse than the service being
unavailable - which should cause a failover - it returned a response
saying there were no resources.
>
> What I also noticed was that repeated 'bdii stop/start' could end up in faster bdii-update process (according with 'strace' output for that process).
>
> I'll fill a GGUS ticket next time we'll be hit again.
Chris
>
> Regards,
> Catalin
>
>
>
>
>
>> -----Original Message-----
>> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
>> On Behalf Of Maria Alandes Pradillo
>> Sent: 29 August 2013 10:15
>> To: [log in to unmask]
>> Subject: Re: [LCG-ROLLOUT] topBDII issues
>>
>> Dear Catalin,
>>
>> I think I havenīt understood the problem. Are you saying that the bdii-
>> update process is hanging?
>>
>> Could we follow this up in a GGUS ticket? Please, specify as usual the OS, bdii
>> package version you are running, the contents of /etc/bdii/bdii.conf and the
>> output of the top command (Just the first 6 lines).
>>
>> Thanks a lot,
>> Maria
>>
>>> -----Original Message-----
>>> From: LHC Computer Grid - Rollout [mailto:LCG-
>> [log in to unmask]]
>>> On Behalf Of Catalin Condurache
>>> Sent: 28 August 2013 15:23
>>> To: [log in to unmask]
>>> Subject: Re: [LCG-ROLLOUT] topBDII issues
>>>
>>> As an update to my issue
>>>
>>> While the node appears being stuck during the 'Logging errors' the only
>> 'active'
>>> related process was
>>>
>>> /usr/bin/python /usr/sbin/bdii-update -c /etc/bdii/bdii.conf -d
>>>
>>>
>>> [root@lcgbdii03 ~]# ps axfww|grep bdii
>>> 5523 pts/0 S+ 0:00 \_ grep bdii
>>> 31271 ? Ssl 0:32 /usr/sbin/slapd -f /etc/bdii/bdii-top-slapd.conf -h
>>> ldap://0.0.0.0:2170 -u ldap
>>> 31278 ? S 1:00 /usr/bin/python /usr/sbin/bdii-update -c
>> /etc/bdii/bdii.conf
>>> -d
>>> 4470 ? S 0:00 \_ sh -c ldapadd -d 256 -x -c -h localhost -p 2170 -D
>> o=glue
>>> -w d3VY8cwlr >/dev/null 2>/var/lib/bdii/add.err
>>>
>>> [root@lcgbdii03 ~]# ps axfww|grep slap
>>> 5526 pts/0 S+ 0:00 \_ grep slap
>>> 31271 ? Ssl 0:32 /usr/sbin/slapd -f /etc/bdii/bdii-top-slapd.conf -h
>>> ldap://0.0.0.0:2170 -u ldap
>>>
>>> [root@lcgbdii03 ~]# strace -p 31271
>>> Process 31271 attached - interrupt to quit futex(0x7f96f28bd9d0,
>>> FUTEX_WAIT, 31283, NULL^C <unfinished ...> Process 31271 detached
>>>
>>> [root@lcgbdii03 ~]# strace -p 31278
>>> Process 31278 attached - interrupt to quit write(4, "ronmentappname:
>>> VO-atlas- AtlasPh"..., 4096) = 4096 write(4, "info:
>>> InfoProviderHost=creamce.i"..., 4096) =
>>> 4096 write(4, "therinfo: InfoProviderHost=baaf0"..., 4096) = 4096
>>> write(4, "n- 16.6.2.2-i686-slc5-gcc43-opt_l"..., 4096) = 4096 write(4,
>>> "as-production- 17.2.0.4-i686-slc5"..., 4096) = 4096 write(4, "08-
>>> 28T12:34:53Z\nglue2entityother"..., 4096) = 4096 write(4,
>>> "Z\nglue2entityotherinfo: InfoProv"..., 4096) = 4096 write(4,
>>> "7383,GLUE2GroupID=resource,GLUE2"..., 4096) = 4096 write(4,
>>> "ntity\nobjectclass: GLUE2Applicat"..., 4096) = 4096 write(4,
>>> "vironment\nglue2entitycreationtim"..., 4096) = 4096 write(4,
>>> "E2ResourceID=clrccece02.in2p3.fr"..., 4096) = 4096 write(4,
>>> "lement_Manager\nglue2applicatione"..., 4096) = 4096 write(4, "-
>>> lcg.cr.cnaf.infn.it\nglue2applic"..., 4096) = 4096 write(4, "2013-08-
>>> 28T12:48:57Z\nglue2entity"..., 4096^C <unfinished ...> Process 31278
>>> detached
>>>
>>> [root@lcgbdii03 ~]# strace -p 4470
>>> Process 4470 attached - interrupt to quit wait4(-1, ^C <unfinished
>>> ...> Process
>>> 4470 detached
>>>
>>> [root@lcgbdii03 ~]#
>>>
>>>
>>> Regards,
>>> Catalin
>>>
>>>
>>>
>>>> -----Original Message-----
>>>> From: LHC Computer Grid - Rollout
>>>> [mailto:[log in to unmask]]
>>>> On Behalf Of Catalin Condurache
>>>> Sent: 28 August 2013 13:23
>>>> To: [log in to unmask]
>>>> Subject: [LCG-ROLLOUT] topBDII issues
>>>>
>>>> Hi,
>>>>
>>>> I am experiencing problems with the topBDII service at RAL. Two out
>>>> of three nodes (part of the lcgbdii.gridpp.rl.ac.uk alias) are
>>>> apparently hanging while 'logging errors'
>>>> (/var/log/bdii/bdii-update.log) and are not accessible for ldap queries.
>>>>
>>>> 2013-08-28 13:09:50,772: [DEBUG] Doing Fix
>>>> 2013-08-28 13:10:09,882: [DEBUG] Writing new_ldif to disk
>>>> 2013-08-28 13:10:10,486: [INFO] Reading old LDIF file ...
>>>> 2013-08-28 13:10:10,486: [DEBUG] Starting Diff
>>>> 2013-08-28 13:10:29,248: [DEBUG] Finished Diff
>>>> 2013-08-28 13:10:29,249: [DEBUG] Sorting Add Keys
>>>> 2013-08-28 13:10:30,551: [DEBUG] Writing ldif_add to disk
>>>> 2013-08-28 13:10:32,184: [DEBUG] Adding New Entries
>>>> 2013-08-28 13:10:32,520: [DEBUG] Logging Errors
>>>>
>>>>
>>>> Restarting the service (or even rebooting the nodes) didn't improve
>>>> the situation.
>>>>
>>>> In the past we correlated similar behaviour to network disruptions,
>>>> but no such thing today (as far as we know), and also a 'bdii restart'
>>>> used to work in the past.
>>>>
>>>> I am running bdii-5.2.17-2.el6.noarch
>>>>
>>>> Any help or idea much appreciated.
>>>>
>>>> Many thanks,
>>>> Catalin Condurache
>>>> RAL Tier1 Grid Services
>>>>
>>>> --
>>>> Scanned by iCritical.
>>> --
>>> Scanned by iCritical.
|