Hi Steve
Every few days we get one or two jobs which stay in the held state and the apparent reason is that it could not find status file. It seems like a race condition. It happens infrequently and across the VOs so I never put any serious effort into it.
Cheers
Kashif
>>>>-----Original Message-----
>>>>From: Testbed Support for GridPP member institutes [mailto:TB-
>>>>[log in to unmask]] On Behalf Of Stephen Jones
>>>>Sent: 26 March 2018 17:49
>>>>To: [log in to unmask]
>>>>Subject: Re: .gahp
>>>>
>>>>Hi Kashif,
>>>>
>>>>Thanks you so much for confirming what I thought. This is likely to be a bug,
>>>>perhaps, and not a config error since you have set your system up entirely
>>>>independently of our setup.
>>>>
>>>>I guess you'll have some stale (held) jobs in condor_q, that you need to get
>>>>rid of by hand. Please confirm if you notice this.
>>>>
>>>>I'd guess it's a race condition, since it occurs so infrequently. It may take a
>>>>bit of time to zero in on the source of it.
>>>>
>>>>Cheers,
>>>>
>>>>Ste
>>>>
>>>>
>>>>On 26/03/18 17:41, Kashif Mohammad wrote:
>>>>> HI Steve
>>>>>
>>>>> I tried this on SL6 ARC CE and got few instances like this; around 21
>>>>> in last 30 days
>>>>>
>>>>> adowLog.old:01/26/18 07:41:01 (10342571.0) (3558208):
>>>>ReliSock::put_file_with_permissions(): Failed to stat file
>>>>'/var/spool/arc/grid01/778NDmkY9yrnD0VBFmzXO77mABFKDmABFKDm9r
>>>>7VDmABFKDmsnieim/.gahp_complete ': No such file or
>>>>directory (errno: 2, si_error: 1)
>>>>> ShadowLog.old:01/26/18 07:41:01 (10342571.0) (3558208): DoUpload:
>>>>(Condor error code 13, subcode 2) SHADOW at 163.1.5.50 failed to send
>>>>file(s) to <163.1.5.112:44261>: error reading from /var/spool/arc/grid0
>>>>1/778NDmkY9yrnD0VBFmzXO77mABFKDmABFKDm9r7VDmABFKDmsnieim
>>>>/.gahp_complete: (errno 2) No such file or directory; STARTER failed to
>>>>receive file(s) from <163.1.5.50:21671>
>>>>> ShadowLog.old:01/26/18 07:41:01 (10342571.0) (3558208): Job 10342571.0
>>>>going into Hold state (code 13,2): Error from
>>>>[log in to unmask]: SHADOW at 163.1.5.50 failed to send
>>>>file(s) to <163.1.5.11 2:44261>: error reading from
>>>>/var/spool/arc/grid01/778NDmkY9yrnD0VBFmzXO77mABFKDmABFKDm9r7
>>>>VDmABFKDmsnieim/.gahp_complete: (errno 2) No such file or directory;
>>>>STARTER failed to receive file(s) from <163.1.5. 50:21671>
>>>>>
>>>>>
>>>>> Cheers
>>>>>
>>>>> Kashif
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>>>>>>>> [log in to unmask]] On Behalf Of Stephen Jones
>>>>>>>>> Sent: 26 March 2018 17:36
>>>>>>>>> To: [log in to unmask]
>>>>>>>>> Subject: .gahp
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Can someone who is running ARC/Condor please do this for me?
>>>>>>>>>
>>>>>>>>> # cd /var/log/condor/
>>>>>>>>> # grep .gahp ShadowLog*
>>>>>>>>>
>>>>>>>>> And let me know if anything like this pops out:
>>>>>>>>>
>>>>>>>>> ReliSock::put_file_with_permissions(): Failed to stat file
>>>>>>>>>
>>>>'/var/spool/arc/grid/u2FNDmPzfKsnKbMCrqsOzK9nABFKDmABFKDmnMU
>>>>>>>>> aDm9BFKDmwLXxtm/.gahp_complete':
>>>>>>>>> No such file or directory (errno: 2, si_error: 1)
>>>>>>>>> DoUpload: (Condor error code 13, subcode 2) SHADOW at
>>>>>>>>> 192.168.178.105 failed to send file(s) to <192.168.26.14:27452>:
>>>>>>>>> error reading from
>>>>>>>>>
>>>>/var/spool/arc/grid/u2FNDmPzfKsnKbMCrqsOzK9nABFKDmABFKDmnMUa
>>>>>>>>> Dm9BFKDmwLXxtm/.gahp_complete:
>>>>>>>>> (errno 2) No such file or directory; STARTER failed to receive
>>>>>>>>> file(s) from <138.253.178.105:9618> Job 208640.0 going into Hold
>>>>state (code 13,2):
>>>>>>>>> Error from
>>>>>>>>> [log in to unmask]: SHADOW at 192.168.178.105 failed to
>>>>>>>>> send
>>>>>>>>> file(s) to <192.168.26.14:27452>: error reading from
>>>>>>>>>
>>>>/var/spool/arc/grid/u2FNDmPzfKsnKbMCrqsOzK9nABFKDmABFKDmnMUa
>>>>>>>>> Dm9BFKDmwLXxtm/.gahp_complete:
>>>>>>>>> (errno 2) No such file or directory; STARTER failed to receive
>>>>>>>>> file(s) from <138.253.178.105:9618>
>>>>>>>>>
>>>>>>>>> PS: Using CentOS7, nordugrid-arc-5.4.1-1.el7.centos.x86_64 and
>>>>>>>>> condor-8.6.3-1.el7.x86_64
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Ste
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Steve Jones [log in to unmask]
>>>>>>>>> Grid System Administrator office: 220
>>>>>>>>> High Energy Physics Division tel (int): 43396
>>>>>>>>> Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
>>>>>>>>> University of Liverpool http://www.liv.ac.uk/physics/hep/
>>>>
>>>>
>>>>--
>>>>Steve Jones [log in to unmask]
>>>>Grid System Administrator office: 220
>>>>High Energy Physics Division tel (int): 43396
>>>>Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
>>>>University of Liverpool http://www.liv.ac.uk/physics/hep/
|