Hi Raul/Gang
Thanks for your useful suggestions. Fortunately ARC 15.03 update 1 appeared in the nick of time so I upgraded directly to nordugrid-arc-5.0.1-1. My observations are
1. Update was very smooth
2. Fix for WMS problem as mentioned in https://ggus.eu/index.php?mode=ticket_info&ticket_id=113745 is in this release
3. edited /usr/share/arc/Condor.pm # I think mainly LHCB needs this
$lrms_queue{maxwalltime} = '6480';
$lrms_queue{minwalltime} = '0';
$lrms_queue{defaultwallt} = '2880';
$lrms_queue{maxcputime} = '6480';
$lrms_queue{mincputime} = '0';
$lrms_queue{defaultcput} = '2880';
4. edited /usr/share/arc/glue-generator.pl
GlueCECapability:Share=atlas:80
GlueCECapability:Share=lhcb:10
GlueCECapability:Share=other:10
I haven't applied workaround to fix infoprovider crash yet. I want to see whether it happens again or not. Please let me know if you think that I have missed something.
Cheers
Kashif
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of RAUL H C LOPES
> Sent: 26 June 2015 16:07
> To: [log in to unmask]
> Subject: Re: Bug in /usr/share/arc/Condor.pm lead to ARC infoprovider crash
>
> Hi,
>
> I've only seen the infoprovider crash at Brunel once. Isolated case I assumed
> I had made some mistake. I wonder if RAL has seen it.
>
> The glue-generator.pl problem is an old one. I assumed that we're all
> patching it.
>
> raul
>
> On 26/06/15 15:21, qing wrote:
> > Hi, Kashif:
> >
> > At Glasgow we just rebuilt the ARC-CE from scratch because we are
> > also changing the OS from SL6.4 to CentOS6.6, several issues currently
> > in my mind:
> >
> > 1. concerning the WMS problem, in /usr/share/arc/submit-condor-job,
> > you need to ensure in line 83 there is a '.' before '_condor_stdout$':
> >
> > if expr match "$joboption_stdout" '.*_condor_stdout$' >
> > /dev/null; then
> >
> > otherwise single core job from WMS will encounter problem. I think
> > ARC team will put this fix in the next release.
> >
> > 2. ensure $lrms_jobs{$id}{nodes} = [] in Condor.pm to avoid
> > infoprovider crash, as indicated in my previous letter. ARC team will
> > put this bug fix in next release.
> >
> > 3. If you want to publish fairshare between VOs, you need to hack
> > /usr/share/arc/glue-generator.pl, at Glasgow we just added 3 lines
> > after "GlueCECapability:
> > CPUScalingReferenceSI00=$CPUSCALINGREFERENCESI00" line:
> >
> > GlueCECapability: Share=atlas:80
> > GlueCECapability: Share=lhcb:10
> > GlueCECapability: Share=other:10
> >
> > Cheers,Gang
> >
> > On 26/06/2015 14:50, RAUL H C LOPES wrote:
> >> Hi Kashif,
> >>
> >> I've got 3 Arc-CEs in production. All on 5.0. The only problem was
> >> that bug blocking submissions from WMS.
> >> Solved.
> >>
> >> Thanks, raul
> >>
> >> On 26/06/15 14:28, Kashif Mohammad wrote:
> >>> Hi
> >>>
> >>> On a related note, I am planning to upgrade from ARC 4.2 to ARC 5.0.
> >>> Is there anything which I should be aware off? I have looked at the
> >>> release note and it looks quite straight forward.
> >>>
> >>> Thanks
> >>>
> >>> Kashif
> >>>
> >>>> -----Original Message-----
> >>>> From: Testbed Support for GridPP member institutes [mailto:TB-
> >>>> [log in to unmask]] On Behalf Of qing
> >>>> Sent: 26 June 2015 12:02
> >>>> To: [log in to unmask]
> >>>> Subject: Bug in /usr/share/arc/Condor.pm lead to ARC infoprovider
> >>>> crash
> >>>>
> >>>> Dear all:
> >>>>
> >>>> Some of you might notice that the BDII on Glasgow ARC-CEs
> >>>> sometimes disappeared, which is due to random crashes on ARC
> >>>> infoprovider.
> >>>>
> >>>> After discussing with the nordugrid ARC team,it's understood
> >>>> that ARC does not process some messages returned from condor quite
> >>>> well, thus makes the crash of infoprovider quite random.
> >>>>
> >>>> To fix this bug, a line in /usr/share/arc/Condor.pm needs to be
> >>>> modified.
> >>>> For ARC version 5.0.0 it's line 550, and for ARC version 4.2.0-1,
> >>>> it's line 545.
> >>>>
> >>>> $lrms_jobs{$id}{nodes} = "";
> >>>>
> >>>> needs to be changed to:
> >>>>
> >>>> $lrms_jobs{$id}{nodes} = [];
> >>>>
> >>>> If you see "Can't use an undefined value as an ARRAY reference
> >>>> at /usr/share/arc/ARC0mod.pm line 135." in infoprovider.log, it
> >>>> means you are affected. Our site is heavily affected by this bug,
> >>>> the infoprovider
> >>>> on our ARC-CEs crashes many times in a day. We applied this change
> >>>> yesterday morning and during the past 24 hours when site is fully
> >>>> loaded, the infoprovider hasn't crashed for a single time on any of
> >>>> the
> >>>> 4 ARC-CEs, this ensures me that the change fixed the bug. However,
> >>>> since such crash happens randomly so the situation maybe different
> >>>> between sites, I leave it to you to decide whether applying this
> >>>> bug fix or not.
> >>>>
> >>>> Cheers,Gang
|