Hello *,
Despite upgrading our glite-WN to glite-WN-version-3.1.27-0.x86_64 we have
been still observing a hundreds/day of segfaults of lcg-cp process around the
cluster.
What is interesting, only some of the VOs are affected by this issue and till
now we didn't know why.
Here are the facts:
- majority of segfaults affect biomed vo
- lcg-cp segfaults once in a while
- biomed is forcing use of lcg-bdii.cern.ch
We found that some of bdii machines from pool are causing crashes.
I have isolated one of them and here's the result:
(Note: you don't need any user proxy as failure happens before authorization)
# export LCG_GFAL_INFOSYS=128.142.198.40:2170
# lcg-cp -vv --vo biomed
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
file:/tmp/test33
Using grid catalog type: UNKNOWN
[INFO] BDII server: 128.142.198.40:2170/o=grid
[INFO] BDII filter: (&(GlueServiceType=lcg-file-catalog)(|
(GlueServiceAccessControlBaseRule=VO:biomed)
(GlueServiceAccessControlBaseRule=biomed)
(GlueServiceAccessControlRule=biomed)))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
Using grid catalog : lfc-biomed.in2p3.fr
VO name: biomed
Checksum type: None
Trying SURL
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 ...
[INFO] BDII filter: (|(GlueSEUniqueID=se01.isabella.grnet.gr)
(&(GlueServiceType=srm*)(GlueServiceEndpoint=*://se01.isabella.grnet.gr:*)))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
[INFO] BDII filter: (&(GlueSEAccessProtocolType=*)
(GlueChunkKey=GlueSEUniqueID=se01.isabella.grnet.gr))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
Segmentation fault (core dumped)
oops, let's see what GDB can tell us:
[biomed070@n10-4-31 ~]$ gdb --args lcg-cp -vv --vo biomed
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
file:/tmp/test33
GNU gdb Red Hat Linux (6.3.0.0-1.159.el4rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols
found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) r
Starting program: /opt/lcg/bin/lcg-cp -vv --vo biomed
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
file:/tmp/test33
Using grid catalog type: UNKNOWN
[INFO] BDII server: 128.142.198.40:2170/o=grid
[INFO] BDII filter: (&(GlueServiceType=lcg-file-catalog)(|
(GlueServiceAccessControlBaseRule=VO:biomed)
(GlueServiceAccessControlBaseRule=biomed)
(GlueServiceAccessControlRule=biomed)))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
[Thread debugging using libthread_db enabled]
[New Thread 182930638688 (LWP 15898)]
Using grid catalog : lfc-biomed.in2p3.fr
VO name: biomed
Checksum type: None
Trying SURL
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 ...
[INFO] BDII filter: (|(GlueSEUniqueID=se01.isabella.grnet.gr)
(&(GlueServiceType=srm*)(GlueServiceEndpoint=*://se01.isabella.grnet.gr:*)))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
[INFO] BDII filter: (&(GlueSEAccessProtocolType=*)
(GlueChunkKey=GlueSEUniqueID=se01.isabella.grnet.gr))
[INFO] Trying to use BDII: 128.142.198.40:2170/o=grid (timeout 60)
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182930638688 (LWP 15898)]
0x00000036d1468cf4 in _int_free () from /lib64/tls/libc.so.6
(gdb) bt
#0 0x00000036d1468cf4 in _int_free () from /lib64/tls/libc.so.6
#1 0x00000036d1469546 in free () from /lib64/tls/libc.so.6
#2 0x00000039a240c556 in ldap_msgfree () from /usr/lib64/libldap-2.2.so.7
#3 0x0000002a958d22fc in bdii_query_free (ld_ptr=0x7fbfffd850,
reply_ptr=0x545000)
at /home/condor/execute/dir_7875/userdir/org.glite.data.gfal/src/mds_ifce.c:310
#4 0x0000002a958d4038 in get_seap_info (host=Variable "host" is not
available.
)
at /home/condor/execute/dir_7875/userdir/org.glite.data.gfal/src/mds_ifce.c:1084
#5 0x0000002a958f384e in sfn_turlsfromsurls (nbfiles=1, sfns=0x7fbfffdc78,
protocols=0x7fbfffdb30, statuses=0x7fbfffdb28, errbuf=0x7fbfffdd30 "",
errbufsz=1024)
at /home/condor/execute/dir_7875/userdir/org.glite.data.gfal/src/sfn_ifce.c:148
#6 0x0000002a958f3a57 in sfn_getfilemd (nbfiles=672, surls=0x545000,
statuses=0x523190, errbuf=0x20090 <Address 0x20090 out of bounds>,
errbufsz=8, timeout=0)
at /home/condor/execute/dir_7875/userdir/org.glite.data.gfal/src/sfn_ifce.c:60
#7 0x0000002a958b3994 in gfal_ls (req=0x5230f0, errbuf=0x7fbfffdd30 "",
errbufsz=1024)
at /home/condor/execute/dir_7875/userdir/org.glite.data.gfal/src/gfal.c:1623
#8 0x0000002a9555dcc1 in lcg_cp4
(src_file=0x7fbfffefa7 "sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53",
dest_file=0x7fbffff011 "file:/tmp/test33", defaulttype=TYPE_NONE,
srctype=TYPE_NONE, dsttype=TYPE_NONE, nobdii=0, vo=0x7fbfffefa0 "biomed",
nbstreams=1,
conf_file=0x0, insecure=0, verbose=2, timeout=0,
src_spacetokendesc=0x5237b0 "rfio", dest_spacetokendesc=0x0,
cksmtype=GFAL_CKSM_NONE, errbuf=0x0, errbufsz=0)
at /home/condor/execute/dir_26429/userdir/org.glite.data.dm-util/src/lcg_cp.c:221
#9 0x00000000004015ec in main ()
------------------------------------------------------------------------------------
Please check if you can reproduce this problem at your site by issuing this
commands:
# export LCG_GFAL_INFOSYS=128.142.198.40:2170
# lcg-cp -vv --vo biomed
sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
file:/tmp/test33
Please don't reinstall/reconfigure 128.142.198.40 till problem is identified
by developers.
Best Regards
--
Lukasz Flis
|