JISCMail - DIRAC-USERS Archives

Hi Jens,

my problems were towards the end of the week (Saturday) and in the end I 
canceled those jobs as they did not lead to anywhere. So they will turn up as 
"cancelled"

Best wishes,
Lydia



  On Tue, 12 Jan 2016, Jensen, Jens (STFC,RAL,SC) wrote:

> Hi Lydia,
>
> If you see the problem again, you should be able to see it in the
> ftsmon. There were four timeouts to three disk servers on Monday the 4th
> during 15:00-16:00 but I didn't see any since then. If we see those
> again we need to investigate more closely but it seems to me they were
> likely due to some networking problem or some related type of blip.
>
> The other failures within the past seven days were either a
> cancelled-by-user or the ones that were killed at 3600 seconds. Maybe
> your client didn't exit properly after the server had terminated the
> transfer?
>
> Send us the id if you see another one.
>
> Cheers
> --jens
>
> On 12/01/2016 09:12, Lydia Heck wrote:
>>
>> Hi Jens,
>>
>> I am asking now for 36000 seconds, as I had the problem with the large
>> file.
>>
>> So that cannot be the problem in these transfers that "hang" after
>> having transfered everything and then just sitting there with 0k
>> transfer rate
>>
>> Lydia
>>
>>
>> On Mon, 11 Jan 2016, Jensen, Jens (STFC,RAL,SC) wrote:
>>
>>> Brian says it's just the timeout you ask for when you submit (with the
>>> --timeout switch). Or rather, 3600 is the timeout you get if you don't
>>> ask for one :-)
>>>
>>> It might be too low a limit but it is at least easy to fix by asking for
>>> a higher timeout.
>>>
>>> Cheers
>>> --jens
>>>
>>> On 11/01/2016 14:55, Lydia Heck wrote:
>>>>
>>>> Should this be forwarded to the support email?
>>>>
>>>> Lydia
>>>>
>>>>
>>>> On Mon, 11 Jan 2016, Jensen, Jens (STFC,RAL,SC) wrote:
>>>>
>>>>> Brian points out I missed a 3600 second timeout on the transfer (there
>>>>> is more thanone type of timeout). So it follows that the successful
>>>>> transfers at the same time would have taken less than one hour?
>>>>>
>>>>> On 11/01/2016 13:36, Jensen, Jens (STFC,RAL,SC) wrote:
>>>>>> On 11/01/2016 12:23, Lydia Heck wrote:
>>>>>>> once I had sent the previous response I realised that maybe I had
>>>>>>> made
>>>>>>> myself not clear: it the last 3 or 4 cancelled jobs that are of note
>>>>>>> here.
>>>>>>>
>>>>>> There are some which are disk server timeouts, and they are
>>>>>> attempting
>>>>>> to go to:
>>>>>> 2016-01-04T15:07:26    ***    130.246.179.46
>>>>>> 2016-01-04T15:21:30    ***    130.246.179.44
>>>>>> 2016-01-04T15:35:47    ***    130.246.179.47
>>>>>> 2016-01-04T15:50:48    ***    130.246.179.44
>>>>>>
>>>>>> These are the ones which seem to time out (and have ~7500 seconds
>>>>>> between submit time and start time, just more than two hours):
>>>>>>
>>>>>> https://lcgfts3.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/job/8abd37a1-e13f-45c5-9c98-7b3f2c475b8e
>>>>>>
>>>>>>
>>>>>> https://lcgfts3.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/job/919d9423-9f71-4b72-91de-f3cf8c38d44b
>>>>>>
>>>>>>
>>>>>> and this one whcih says it was canceled but is in the "FAILED" bucket
>>>>>> (maybe because it retried?)
>>>>>> https://lcgfts3.gridpp.rl.ac.uk:8449/fts3/ftsmon/#/job/d6b5299d-f2a7-433b-a22a-31add26682b3
>>>>>>
>>>>>>
>>>>>>
>>>>>> Looking at the logs they seem to transfer happily and then suddenly
>>>>>> time
>>>>>> out after precisely 60 minutes... to within a second (e.g.
>>>>>> starting at
>>>>>> 21:33:46 and getting killed at 22:33:46). Hmm...
>>>>>>
>>>>>> And the same log says
>>>>>>
>>>>>> Resetting global timeout thread to 33600 seconds
>>>>>>
>>>>>> so that's not it. And it's not the proxy because it's a happy long
>>>>>> lived
>>>>>> one.
>>>>>>
>>>>>> It certainly suggests a problem at the server end - getting killed in
>>>>>> the three thousand six hundreth second of the transfer is quite
>>>>>> suspicious...
>>>>>>
>>>>>> Cheers
>>>>>> -j
>>>>>
>>>
>