On Thu, 17 Nov 2016, Alessandra Forti wrote:
> It just occurred to me that even though you didn't change the BDII on purpose
> you changed the system moving from CREAM to ARC as well as changing the WNs.
>
> The numbers in REBUS started to fluctuate wildly in May, before they were
> stable on 43243/2912 = 14.85. I've asked REBUS people to check, but that
> might explain some of the differences.
>
Right, while what I mentioned before was just for the ARC CE, we also had
back then the cream ce for the old cluster (which doesn't exist anymore
now).
cheers,
Marcus
> cheeers
> alessandra
>
>
> On 17/11/2016 14:33, Alessandra Forti wrote:
>> Hi Marcus,
>>
>> thanks for confirming. It is still not clear to me why REBUS sees this
>> wild variations for ECDF. I'll try to get an answer from them.
>>
>> cheers
>> alessandra
>>
>>
>> On 17/11/2016 14:20, Marcus Ebert wrote:
>> > Hi Alessandra,
>> >
>> >
>> > On Thu, 17 Nov 2016, Alessandra Forti wrote:
>> >
>> > > size of the site is manually inserted in the BDII. I agree in ECDF it
>> > > is variable but you really should put a meaningful value that averages
>> > > to a meaningful HS06 number. I thought you did that but ECDF is red
>> > > again. This time APEL is bigger than ATLAS. You seem to change the
>> > > capacity in the BDII every month [1] can you confirm that? You should
>> > > put values whose ratio is ~HS06 you publish.
>> > >
>> > No, I don't think it was changed every month. It was changed in October
>> > to make it consistent between the 2 numbers we report and to reflect the
>> > current worker node systems we run on (ringfenced nodes, general ECDF
>> > cluster, Openstack - all with different HepSpec and job slots/cores).
>> > (see below)
>> > This value should reflect the different systems we are running on in
>> > very good approximation now.
>> >
>> > > aforti@vm7>site=UKI-SCOTGRID-ECDF; ldapsearch -LLL -x -h
>> > > top-bdii.tier2.hep.manchester.ac.uk:2170 -b
>> > > "mds-vo-name=${site},mds-vo-name=local,o=grid" | perl -p00e 's/\r?\n
>> > > //g'|egrep -i 'bench|spec|logical'
>> > > GlueHostBenchmarkSF00: 0
>> > > GlueHostBenchmarkSI00: 0
>> > > objectClass: GlueHostBenchmark
>> > > GlueHostProcessorOtherDescription: Cores=8, Benchmark=12.9-HEP-SPEC06
>> > > GlueSubClusterLogicalCPUs: 528
>> > >
>> > That's the updated correct one. It was updated in October, so I think we
>> > should wait for the November numbers once the whole month is over.
>> > Cores and Hepspec are averaged over the different systems taking the
>> > different number of cores/machines into account we really run on.
>> >
>> >
>> > > ATM REBUS reports weird stuff not corresposnding to 12.9
>> > >
>> > > October: 111945/9570 =11. 69 <-- atlas claims 11.884 until August
>> > > included
>> > > September: 74195/7040 = 10.54 <-- atlas see 10.5 from September
>> > > onward in line with this numbers
>> > > October: 76167/7291=10.44 <-- similar enough
>> > > November: 6811/528 = 12.89 <-- this is ok if ATLAS sees it, but I
>> > > suspect numbers are not updated that often and it might be a
>> > > discrepancy again.
>> > >
>> > Atlas sees 10.5 because that's what we my mistake reported. We didn't
>> > updated the Glue value and only the one for APEL when we added new
>> > worker nodes. 10.5 was the wrong, too low value. Since we updated now
>> > the APEL and GLUE value to be consistent, there should be no
>> > reason/possibility that ATLAS sees something different for November.
>> >
>> > > so there are 3 points here
>> > >
>> > > 1) Do you update your numbers to maintain the HS06 ratio in the BDII
>> > > consistently? I don't think changing numbers monthly is a good idea
>> > > but they should at least match the HS06 value.
>> > No, we don't change monthly.
>> > We only looked into it because of the discrepancy you reported and found
>> > that a) that the 2 different values we report, Apel and Glue one, are
>> > not consisten with each other, b) both don't reflect the new hardware we
>> > are running on since a while for the SL6 analysis queue.
>> > That's why it was changed in October. Before I think the last change was
>> > in July when we got new machines to run on (differently configured for
>> > job slots than our ringfenced nodes which made a change neccessary)
>> > The change in October reflected the addition of the Openstack nodes for
>> > the SL6 queue.
>> >
>> > > 2) If you do that why rebus is reporting a different set of numbers
>> > > for example I'd expect Ocotber 7291*12.9 = 94053 not 76167
>> > We don't do that.
>> > It was changed in October, so probably that's why it's different since
>> > it was not the same for the whole month?
>> > I would expect that November onwards it should now correspond to 12.9
>> >
>> > > 3) ATLAS doesn't seem to update the HS06 often enough to have such
>> > > frequent changes. And TBF most sites usually don't change their size
>> > > every month.
>> > >
>> > As I said, we also don't do that.
>> >
>> >
>> > I think we should wait until the end of October to see if it will be
>> > green then and consistent.
>> > In any case, we will look through the published data using the scripts
>> > you published to make sure it will be consistent in the future.
>> >
>> >
>> > Cheers,
>> > Marcus
>> >
>> > > [2] http://tinyurl.com/j2fylyx
>> > >
>> > > On 17/11/2016 12:05, Marcus Ebert wrote:
>> > > > Thanks Alessandra,
>> > > >
>> > > > I think I understand now, also from previous discussions in the list
>> > > > here.
>> > > > Basically, it only tests if 2 values published by a site, both
>> > > > defined in
>> > > > the bdii and put in manually by the site, agree or not, but doesn't
>> > > > say
>> > > > anything about the correctness of the HEPSPEC value used.
>> > > > So it seems what really meaningfully can be compared is just the
>> > > > wallclock
>> > > > work from Atlas and APEL, if it's not scaled at a site.
>> > > >
>> > > > Wouldn't it be better then to split the plot in 2 different ones,
>> > > > - one for the ratio of wallclock hours Atlas/APEL to have a site
>> > > > check
>> > > > that both values published are consistent, and
>> > > > - second one only for the wallclock work ratio Atlas/APEL to see
>> > > > any
>> > > > differences between the reported wallclock work in APEL and the
>> > > > ATLAS
>> > > > records?
>> > > >
>> > > > If it shows for example "red" right now, it's not obvious just from
>> > > > the
>> > > > plot which of the 2 numbers are the problem.
>> > > >
>> > > >
>> > > > Cheers,
>> > > > Marcus
>> > > >
>> > > > On Tue, 8 Nov 2016, Alessandra Forti wrote:
>> > > >
>> > > > > Hi Marcus,
>> > > > > > > Thanks, I think I nearly understand it now. To fully
>> > > > understand, > could you please explain how HS06 in Atlas wallclock
>> > > > work is determined? > It isn't the same that > is used in APEL
>> > > > wallclock work, is it?
>> > > > > > the presentation I gave yesterday at the HEPSYSMAN gives the
>> > > > details
>> > > > > >
>> > > > https://indico.cern.ch/event/577279/contributions/2353919/attachments/1367099/2071452/20161107_hepsysman-accounting.pdf
>> > > > > > > in the specific today I've also started an FAQ
>> > > > > >
>> > > > https://twiki.cern.ch/twiki/bin/view/LCG/AccountingFAQ#How_are_the_ATLAS_numbers_in_SSB
>> > > > > > > cheers
>> > > > > alessandra
>> > > > > > On 01/11/2016 09:52, Marcus Ebert wrote:
>> > > > > > Hi Alessandra,
>> > > > > > > > On Tue, 1 Nov 2016, Alessandra Forti wrote:
>> > > > > > > > > > I'm not sure if I understand it or if it makes sense
>> > > > that way:
>> > > > > > > > Basically what you are saying is that the initial number
>> > > > values
>> > > > > > > > "HS06 on the atlas dashboard, HS06 in APEL, ratio,
>> > > > wallclock > > in > > ATLAS,
>> > > > > > > > wallclock in APEL, wallclock ratio"
>> > > > > > > > are really
>> > > > > > > > "wallclock work in the Atlas, wallclock work in APEL,
>> > > > ratio, > > > > wallclock
>> > > > > > > > work in Atlas (unscaled), wallclock work in APEL (maybe
>> > > > > > > > scaled)",
>> > > > > > > > isn't it?
>> > > > > > > the fields are
>> > > > > > > > ATLAS wallclock work (HS06*hours), APEL wallclock work >
>> > > > > (HS06*hours), > ratio, ATLAS wallclock (hours), APEL wallclock
>> > > > (hours > > maybe internally > scale), ratio
>> > > > > > > > > Thanks, I think I nearly understand it now. To fully
>> > > > understand, could > > you
>> > > > > > please explain how HS06 in Atlas wallclock work is determined?
>> > > > It > > isn't
>> > > > > > the same that is used in APEL wallclock work, is it?
>> > > > > > > > > > Cheers,
>> > > > > > Marcus
>> > > > > > > > > > > >
>> > >
>> > >
>> >
>>
>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|