Hi Elena,
The correspondence below regards your LHCb file transfer problems, but
unfortunately offers no solution.
James
-------- Original Message --------
Subject: Re: [Gridpp #47831] Failed transfers from Sheffield
Date: Fri, 3 Jul 2009 15:26:54 +0200
From: Andrew C. Smith <[log in to unmask]>
To: Nandakumar, R (Raja) <[log in to unmask]>
CC: lhcb-dirac (LHCb Grid) <[log in to unmask]>
References:
<[log in to unmask]>
<[log in to unmask]>
<[log in to unmask]>
Hi Raja,
If it transfer 30MB then stops then I doubt it is a problem with the
ports. No idea to be honest.
Cheers,
Andrew
> Hi Andrew,
>
> Thanks. Any ideas to help our colleagues at Sheffield debug the
> problem?
> Could it be a problem with some ports?
>
> Cheers,
> Raja.
>
>> -----Original Message-----
>> From: Andrew C. Smith [mailto:[log in to unmask]]
>> Sent: 03 July 2009 14:13
>> To: Nandakumar, R (Raja)
>> Cc: lhcb-dirac (LHCb Grid)
>> Subject: Re: [Gridpp #47831] Failed transfers from Sheffield
>>
>> Hi Raja,
>>
>> I am afraid that if the transfer starts correctly then just stops
>> this is an issue at the gfal/lcg_utils layer. Not sure what
>> we can do
>> about it.
>>
>> Cheers,
>> Andrew
>>
>> On 3 Jul 2009, at 15:07, Nandakumar, R (Raja) wrote:
>>
>>> Hi,
>>>
>>> Posting to lhcb-grid, as I donot know if this is an operations or
>>> development issue.
>>>
>>> The problem here refers to failed transfers from Sheffield
>> into RAL in
>>> production jobs. About 33% of the jobs failed in this mode at
>>> Sheffield
>>> about two weeks ago. No problems were found on the worker nodes at
>>> Sheffield. It may help to understand this problem as RAL is due to
>>> come
>>> out of downtime on Monday (on schedule still!) and pick up data
>>> from the
>>> Tier-2s again.
>>>
>>> Looking into the castor logs at RAL, Shaun found the following
>>> information - basically the transfer starts and runs for 5 seconds,
>>> transferring 25 - 27MB of data. It then stops and sleeps / whatever
>>> before the job is killed by the DIRAC watchdog. The files that were
>>> investigated by Shaun were :
>>>
>>> 2009-06-24 18:11:42 UTC dirac-jobexec.py/UploadOutputData INFO:
>>> Attempting
>>> rm.putAndRegister("/lhcb/MC/MC09/DST/
>>> 00004837/0038/00004837_00380430_3.d
>>> st","/home/pillhb03/globus-tmp.wn083.17259.0/
>>> https_3a_2f_2fwms203.cern.c
>>> h_3a9000_2fGNnAPRtTgOoP3DM3FSG4lA/
>>> 2974856/00004837_00380430_3.dst","RAL_
>>> MC-DST",guid=9011A304-E960-DE11-AAE8-00093D107A7F)
>>>
>>> 2009-06-24 17:40:07 UTC dirac-jobexec.py/UploadOutputData INFO:
>>> Attempting
>>> rm.putAndRegister("/lhcb/MC/MC09/DST/
>>> 00004837/0038/00004837_00380370_3.d
>>> st","/home/prdlhb90/globus-tmp.wn013.14975.0/
>>> https_3a_2f_2fwms216.cern.c
>>> h_3a9000_2f9JCkw18_5fcJ16Nbtug3OolQ/
>>> 2974796/00004837_00380370_3.dst","RA
>>> L_MC-DST",guid=8A42A4B4-E460-DE11-B9F6-00093D108F76)
>>>
>>> 2009-06-24 15:43:11 UTC dirac-jobexec.py/UploadOutputData INFO:
>>> Attempting
>>> rm.putAndRegister("/lhcb/MC/MC09/DST/
>>> 00004837/0037/00004837_00377756_3.d
>>> st","/home/pillhb03/globus-tmp.wn099.25571.0/
>>> https_3a_2f_2frb03.pic.es_3
>>> a9000_2fOUO0MGjiKmErBS1HPrCNNA/
>>> 2971932/00004837_00377756_3.dst","RAL_MC-
>>> DST",guid=566B9831-D460-DE11-9F60-00093D108D5C)
>>>
>>> Cheers,
>>> Raja.
>>>
>>> -----Original Message-----
>>> From: De Witt, S (Shaun) via RT [mailto:[log in to unmask]]
>>> Sent: 03 July 2009 12:39
>>> To: Nandakumar, R (Raja)
>>> Subject: [Gridpp #47831] Failed transfers from Sheffield
>>>
>>> For the 1st file, I can see the transfer successfully
>> started. In the
>>> 1st 5
>>> seconds 27,218,932 bytes were transferred and then no more
>> bytes were
>>> transferred until the connection was aborted at 20:27 local
>> time. This
>>> looks to
>>> me more like a client problem; either they are severely bandwidth
>>> limited or
>>> the client did not close the connection after the transfer (later
>>> probably more
>>> likely).
>>>
>>> The second file is similar, with xfer starting at 18:40:31 and
>>> 25400689
>>> bytes
>>> xferred in the 1st 6 secconds and then no more data being sent
>>> until the
>>> connection aborts at 20:24.
>>>
>>> Third file; Started xfer 16:43:29, 27683032 bytes in 5 secs, then
>>> nothing until
>>> abort at 18:31
>>>
>>> All xfers to different disk servers.
>>>
>>> Since we seem to have a pretty consistent pattern emerging here I
>>> won't
>>> bother
>>> looking any further. Since the problem seems to originate at only 1
>>> site, I
>>> suspect this is a client problem and not castor related. As such I
>>> close
>>> this
>>> ticket for now
>>>
>>>
>>> <URL: https://helpdesk.gridpp.rl.ac.uk:443/Ticket/Display.html?
>>> id=47831
>>>>
>>>
>>> --
>>> Scanned by iCritical.
>>
>>
> --
> Scanned by iCritical.
--
James Cullen
NorthGrid Deputy Technical Coordinator
|