On Sat, Dec 19, 2009 at 01:01:40PM +0100, Lukasz Flis wrote:
> Hello *,
>
> Despite upgrading our glite-WN to glite-WN-version-3.1.27-0.x86_64 we have
> been still observing a hundreds/day of segfaults of lcg-cp process around the
> cluster.
>
> What is interesting, only some of the VOs are affected by this issue and till
> now we didn't know why.
>
> Here are the facts:
> - majority of segfaults affect biomed vo
> - lcg-cp segfaults once in a while
> - biomed is forcing use of lcg-bdii.cern.ch
>
> We found that some of bdii machines from pool are causing crashes.
> I have isolated one of them and here's the result:
>
> (Note: you don't need any user proxy as failure happens before authorization)
>
> # export LCG_GFAL_INFOSYS=128.142.198.40:2170
> # lcg-cp -vv --vo biomed
> sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
> file:/tmp/test33
Hello Lukasz,
We are the administrators of se01.isabella.grnet.gr, and we have seen
this problem too, because it affected some of our non-biomed local
users. However I will have to disagree with your hypothesis, since I
don't think that the problem is related to the top level BDII.
Instead the segmentation fault occurs, regardless of the top level BDII
used, when the SURL used is of the type 'sfn://...' (classic SE SURL) .
Because se01.isabella.grnet.gr was formerly a Classic SE that was
migrated to DPM, there are a lot of 'sfn://' type SURLs registered on
various LFCs that point to it.
We have found that there is already a savvanah bug for GFAL-client
regarding memory corruption with 'sfn://' SURLs:
https://savannah.cern.ch/bugs/?func=detailitem&item_id=56373
and we have also found that the problem is resolved if we downgrade
GFAL-client to version GFAL-client-1.11.6-2.slc4.
Example:
[kyrginis@ui01 ~]$ rpm -q GFAL-client
GFAL-client-1.11.6-2.slc4
[kyrginis@ui01 ~]$ export LCG_GFAL_INFOSYS=128.142.198.40:2170
[kyrginis@ui01 ~]$ lcg-cp -v --vo biomed sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 file:/tmp/test33
[WARNING] specified VO and proxy VO are different!
Using grid catalog type: UNKNOWN
Using grid catalog : lfc-biomed.in2p3.fr
VO name: biomed
Checksum type: None
Trying SURL sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 ...
Source SE type: Classic SE
Source URL: sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
File size: 333571
Source URL for copy: gsiftp://se01.isabella.grnet.gr//storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53
Destination URL: file:/tmp/test33
# streams: 1
0 bytes 0.00 KB/sec avg 0.00 KB/sec inst
Transfer took 1000 ms
So lcg-cp is successful if GFAL-client is 1.11.6-2.slc4. After upgrading
to latest version:
[kyrginis@ui01 ~]$ rpm -q GFAL-client
GFAL-client-1.11.8-2.slc4
[kyrginis@ui01 ~]$ export LCG_GFAL_INFOSYS=128.142.198.40:2170
[kyrginis@ui01 ~]$ lcg-cp -v --vo biomed sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 file:/tmp/test33
[WARNING] specified VO and proxy VO are different!
Using grid catalog type: UNKNOWN
Using grid catalog : lfc-biomed.in2p3.fr
VO name: biomed
Checksum type: None
Trying SURL sfn://se01.isabella.grnet.gr/storage/biomed/generated/2008-11-03/filecac6cd8a-663a-4d81-8037-2256a3ec1f53 ...
*** glibc detected *** free(): invalid next size (fast): 0x08ccdbe0 ***
Aborted
Cheers,
Kyriakos Ginis
HG-01-GRNET/HG-06-EKT admin team
P.S. Please note that currently se01.isabella.grnet.gr happens to be under very
high load because of biomed data transfers, so this might affect any
tests you might attempt.
--
Kyriakos Ginis
Software Engineering Laboratory
National Technical University of Athens
|