Hi, Kashif:
Before you make the upgrade, did you drain all all jobs in the
queue? or just shutdown the ARC daemons and run yum then start the
daemons again?
Cheers,Gang
On 29/06/2015 13:08, Kashif Mohammad wrote:
> Hi Raul/Gang
>
> Thanks for your useful suggestions. Fortunately ARC 15.03 update 1 appeared in the nick of time so I upgraded directly to nordugrid-arc-5.0.1-1. My observations are
>
> 1. Update was very smooth
>
> 2. Fix for WMS problem as mentioned in https://ggus.eu/index.php?mode=ticket_info&ticket_id=113745 is in this release
>
> 3. edited /usr/share/arc/Condor.pm # I think mainly LHCB needs this
>
> $lrms_queue{maxwalltime} = '6480';
> $lrms_queue{minwalltime} = '0';
> $lrms_queue{defaultwallt} = '2880';
> $lrms_queue{maxcputime} = '6480';
> $lrms_queue{mincputime} = '0';
> $lrms_queue{defaultcput} = '2880';
>
> 4. edited /usr/share/arc/glue-generator.pl
>
> GlueCECapability:Share=atlas:80
> GlueCECapability:Share=lhcb:10
> GlueCECapability:Share=other:10
>
>
> I haven't applied workaround to fix infoprovider crash yet. I want to see whether it happens again or not. Please let me know if you think that I have missed something.
>
>
> Cheers
>
> Kashif
>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>> [log in to unmask]] On Behalf Of RAUL H C LOPES
>> Sent: 26 June 2015 16:07
>> To: [log in to unmask]
>> Subject: Re: Bug in /usr/share/arc/Condor.pm lead to ARC infoprovider crash
>>
>> Hi,
>>
>> I've only seen the infoprovider crash at Brunel once. Isolated case I assumed
>> I had made some mistake. I wonder if RAL has seen it.
>>
>> The glue-generator.pl problem is an old one. I assumed that we're all
>> patching it.
>>
>> raul
>>
>> On 26/06/15 15:21, qing wrote:
>>> Hi, Kashif:
>>>
>>> At Glasgow we just rebuilt the ARC-CE from scratch because we are
>>> also changing the OS from SL6.4 to CentOS6.6, several issues currently
>>> in my mind:
>>>
>>> 1. concerning the WMS problem, in /usr/share/arc/submit-condor-job,
>>> you need to ensure in line 83 there is a '.' before '_condor_stdout$':
>>>
>>> if expr match "$joboption_stdout" '.*_condor_stdout$' >
>>> /dev/null; then
>>>
>>> otherwise single core job from WMS will encounter problem. I think
>>> ARC team will put this fix in the next release.
>>>
>>> 2. ensure $lrms_jobs{$id}{nodes} = [] in Condor.pm to avoid
>>> infoprovider crash, as indicated in my previous letter. ARC team will
>>> put this bug fix in next release.
>>>
>>> 3. If you want to publish fairshare between VOs, you need to hack
>>> /usr/share/arc/glue-generator.pl, at Glasgow we just added 3 lines
>>> after "GlueCECapability:
>>> CPUScalingReferenceSI00=$CPUSCALINGREFERENCESI00" line:
>>>
>>> GlueCECapability: Share=atlas:80
>>> GlueCECapability: Share=lhcb:10
>>> GlueCECapability: Share=other:10
>>>
>>> Cheers,Gang
>>>
>>> On 26/06/2015 14:50, RAUL H C LOPES wrote:
>>>> Hi Kashif,
>>>>
>>>> I've got 3 Arc-CEs in production. All on 5.0. The only problem was
>>>> that bug blocking submissions from WMS.
>>>> Solved.
>>>>
>>>> Thanks, raul
>>>>
>>>> On 26/06/15 14:28, Kashif Mohammad wrote:
>>>>> Hi
>>>>>
>>>>> On a related note, I am planning to upgrade from ARC 4.2 to ARC 5.0.
>>>>> Is there anything which I should be aware off? I have looked at the
>>>>> release note and it looks quite straight forward.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Kashif
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Testbed Support for GridPP member institutes [mailto:TB-
>>>>>> [log in to unmask]] On Behalf Of qing
>>>>>> Sent: 26 June 2015 12:02
>>>>>> To: [log in to unmask]
>>>>>> Subject: Bug in /usr/share/arc/Condor.pm lead to ARC infoprovider
>>>>>> crash
>>>>>>
>>>>>> Dear all:
>>>>>>
>>>>>> Some of you might notice that the BDII on Glasgow ARC-CEs
>>>>>> sometimes disappeared, which is due to random crashes on ARC
>>>>>> infoprovider.
>>>>>>
>>>>>> After discussing with the nordugrid ARC team,it's understood
>>>>>> that ARC does not process some messages returned from condor quite
>>>>>> well, thus makes the crash of infoprovider quite random.
>>>>>>
>>>>>> To fix this bug, a line in /usr/share/arc/Condor.pm needs to be
>>>>>> modified.
>>>>>> For ARC version 5.0.0 it's line 550, and for ARC version 4.2.0-1,
>>>>>> it's line 545.
>>>>>>
>>>>>> $lrms_jobs{$id}{nodes} = "";
>>>>>>
>>>>>> needs to be changed to:
>>>>>>
>>>>>> $lrms_jobs{$id}{nodes} = [];
>>>>>>
>>>>>> If you see "Can't use an undefined value as an ARRAY reference
>>>>>> at /usr/share/arc/ARC0mod.pm line 135." in infoprovider.log, it
>>>>>> means you are affected. Our site is heavily affected by this bug,
>>>>>> the infoprovider
>>>>>> on our ARC-CEs crashes many times in a day. We applied this change
>>>>>> yesterday morning and during the past 24 hours when site is fully
>>>>>> loaded, the infoprovider hasn't crashed for a single time on any of
>>>>>> the
>>>>>> 4 ARC-CEs, this ensures me that the change fixed the bug. However,
>>>>>> since such crash happens randomly so the situation maybe different
>>>>>> between sites, I leave it to you to decide whether applying this
>>>>>> bug fix or not.
>>>>>>
>>>>>> Cheers,Gang
|