Well, in the end it appeared that the DB Updating process sometimes can be very slow.
And during that time, the BDII service offered by that node is unavailable, and SAM tests might start to fail which is a bit annoying. Yesterday, we had all three topBDII nodes affected by this symptom at the same time.
What I also noticed was that repeated 'bdii stop/start' could end up in faster bdii-update process (according with 'strace' output for that process).
I'll fill a GGUS ticket next time we'll be hit again.
Regards,
Catalin
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Maria Alandes Pradillo
> Sent: 29 August 2013 10:15
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] topBDII issues
>
> Dear Catalin,
>
> I think I havenīt understood the problem. Are you saying that the bdii-
> update process is hanging?
>
> Could we follow this up in a GGUS ticket? Please, specify as usual the OS, bdii
> package version you are running, the contents of /etc/bdii/bdii.conf and the
> output of the top command (Just the first 6 lines).
>
> Thanks a lot,
> Maria
>
> > -----Original Message-----
> > From: LHC Computer Grid - Rollout [mailto:LCG-
> [log in to unmask]]
> > On Behalf Of Catalin Condurache
> > Sent: 28 August 2013 15:23
> > To: [log in to unmask]
> > Subject: Re: [LCG-ROLLOUT] topBDII issues
> >
> > As an update to my issue
> >
> > While the node appears being stuck during the 'Logging errors' the only
> 'active'
> > related process was
> >
> > /usr/bin/python /usr/sbin/bdii-update -c /etc/bdii/bdii.conf -d
> >
> >
> > [root@lcgbdii03 ~]# ps axfww|grep bdii
> > 5523 pts/0 S+ 0:00 \_ grep bdii
> > 31271 ? Ssl 0:32 /usr/sbin/slapd -f /etc/bdii/bdii-top-slapd.conf -h
> > ldap://0.0.0.0:2170 -u ldap
> > 31278 ? S 1:00 /usr/bin/python /usr/sbin/bdii-update -c
> /etc/bdii/bdii.conf
> > -d
> > 4470 ? S 0:00 \_ sh -c ldapadd -d 256 -x -c -h localhost -p 2170 -D
> o=glue
> > -w d3VY8cwlr >/dev/null 2>/var/lib/bdii/add.err
> >
> > [root@lcgbdii03 ~]# ps axfww|grep slap
> > 5526 pts/0 S+ 0:00 \_ grep slap
> > 31271 ? Ssl 0:32 /usr/sbin/slapd -f /etc/bdii/bdii-top-slapd.conf -h
> > ldap://0.0.0.0:2170 -u ldap
> >
> > [root@lcgbdii03 ~]# strace -p 31271
> > Process 31271 attached - interrupt to quit futex(0x7f96f28bd9d0,
> > FUTEX_WAIT, 31283, NULL^C <unfinished ...> Process 31271 detached
> >
> > [root@lcgbdii03 ~]# strace -p 31278
> > Process 31278 attached - interrupt to quit write(4, "ronmentappname:
> > VO-atlas- AtlasPh"..., 4096) = 4096 write(4, "info:
> > InfoProviderHost=creamce.i"..., 4096) =
> > 4096 write(4, "therinfo: InfoProviderHost=baaf0"..., 4096) = 4096
> > write(4, "n- 16.6.2.2-i686-slc5-gcc43-opt_l"..., 4096) = 4096 write(4,
> > "as-production- 17.2.0.4-i686-slc5"..., 4096) = 4096 write(4, "08-
> > 28T12:34:53Z\nglue2entityother"..., 4096) = 4096 write(4,
> > "Z\nglue2entityotherinfo: InfoProv"..., 4096) = 4096 write(4,
> > "7383,GLUE2GroupID=resource,GLUE2"..., 4096) = 4096 write(4,
> > "ntity\nobjectclass: GLUE2Applicat"..., 4096) = 4096 write(4,
> > "vironment\nglue2entitycreationtim"..., 4096) = 4096 write(4,
> > "E2ResourceID=clrccece02.in2p3.fr"..., 4096) = 4096 write(4,
> > "lement_Manager\nglue2applicatione"..., 4096) = 4096 write(4, "-
> > lcg.cr.cnaf.infn.it\nglue2applic"..., 4096) = 4096 write(4, "2013-08-
> > 28T12:48:57Z\nglue2entity"..., 4096^C <unfinished ...> Process 31278
> > detached
> >
> > [root@lcgbdii03 ~]# strace -p 4470
> > Process 4470 attached - interrupt to quit wait4(-1, ^C <unfinished
> > ...> Process
> > 4470 detached
> >
> > [root@lcgbdii03 ~]#
> >
> >
> > Regards,
> > Catalin
> >
> >
> >
> > > -----Original Message-----
> > > From: LHC Computer Grid - Rollout
> > > [mailto:[log in to unmask]]
> > > On Behalf Of Catalin Condurache
> > > Sent: 28 August 2013 13:23
> > > To: [log in to unmask]
> > > Subject: [LCG-ROLLOUT] topBDII issues
> > >
> > > Hi,
> > >
> > > I am experiencing problems with the topBDII service at RAL. Two out
> > > of three nodes (part of the lcgbdii.gridpp.rl.ac.uk alias) are
> > > apparently hanging while 'logging errors'
> > > (/var/log/bdii/bdii-update.log) and are not accessible for ldap queries.
> > >
> > > 2013-08-28 13:09:50,772: [DEBUG] Doing Fix
> > > 2013-08-28 13:10:09,882: [DEBUG] Writing new_ldif to disk
> > > 2013-08-28 13:10:10,486: [INFO] Reading old LDIF file ...
> > > 2013-08-28 13:10:10,486: [DEBUG] Starting Diff
> > > 2013-08-28 13:10:29,248: [DEBUG] Finished Diff
> > > 2013-08-28 13:10:29,249: [DEBUG] Sorting Add Keys
> > > 2013-08-28 13:10:30,551: [DEBUG] Writing ldif_add to disk
> > > 2013-08-28 13:10:32,184: [DEBUG] Adding New Entries
> > > 2013-08-28 13:10:32,520: [DEBUG] Logging Errors
> > >
> > >
> > > Restarting the service (or even rebooting the nodes) didn't improve
> > > the situation.
> > >
> > > In the past we correlated similar behaviour to network disruptions,
> > > but no such thing today (as far as we know), and also a 'bdii restart'
> > > used to work in the past.
> > >
> > > I am running bdii-5.2.17-2.el6.noarch
> > >
> > > Any help or idea much appreciated.
> > >
> > > Many thanks,
> > > Catalin Condurache
> > > RAL Tier1 Grid Services
> > >
> > > --
> > > Scanned by iCritical.
> > --
> > Scanned by iCritical.
--
Scanned by iCritical.
|