On Thu, 17 Nov 2016, Alessandra Forti wrote: > It just occurred to me that even though you didn't change the BDII on purpose > you changed the system moving from CREAM to ARC as well as changing the WNs. > > The numbers in REBUS started to fluctuate wildly in May, before they were > stable on 43243/2912 = 14.85. I've asked REBUS people to check, but that > might explain some of the differences. > Right, while what I mentioned before was just for the ARC CE, we also had back then the cream ce for the old cluster (which doesn't exist anymore now). cheers, Marcus > cheeers > alessandra > > > On 17/11/2016 14:33, Alessandra Forti wrote: >> Hi Marcus, >> >> thanks for confirming. It is still not clear to me why REBUS sees this >> wild variations for ECDF. I'll try to get an answer from them. >> >> cheers >> alessandra >> >> >> On 17/11/2016 14:20, Marcus Ebert wrote: >> > Hi Alessandra, >> > >> > >> > On Thu, 17 Nov 2016, Alessandra Forti wrote: >> > >> > > size of the site is manually inserted in the BDII. I agree in ECDF it >> > > is variable but you really should put a meaningful value that averages >> > > to a meaningful HS06 number. I thought you did that but ECDF is red >> > > again. This time APEL is bigger than ATLAS. You seem to change the >> > > capacity in the BDII every month [1] can you confirm that? You should >> > > put values whose ratio is ~HS06 you publish. >> > > >> > No, I don't think it was changed every month. It was changed in October >> > to make it consistent between the 2 numbers we report and to reflect the >> > current worker node systems we run on (ringfenced nodes, general ECDF >> > cluster, Openstack - all with different HepSpec and job slots/cores). >> > (see below) >> > This value should reflect the different systems we are running on in >> > very good approximation now. >> > >> > > aforti@vm7>site=UKI-SCOTGRID-ECDF; ldapsearch -LLL -x -h >> > > top-bdii.tier2.hep.manchester.ac.uk:2170 -b >> > > "mds-vo-name=${site},mds-vo-name=local,o=grid" | perl -p00e 's/\r?\n >> > > //g'|egrep -i 'bench|spec|logical' >> > > GlueHostBenchmarkSF00: 0 >> > > GlueHostBenchmarkSI00: 0 >> > > objectClass: GlueHostBenchmark >> > > GlueHostProcessorOtherDescription: Cores=8, Benchmark=12.9-HEP-SPEC06 >> > > GlueSubClusterLogicalCPUs: 528 >> > > >> > That's the updated correct one. It was updated in October, so I think we >> > should wait for the November numbers once the whole month is over. >> > Cores and Hepspec are averaged over the different systems taking the >> > different number of cores/machines into account we really run on. >> > >> > >> > > ATM REBUS reports weird stuff not corresposnding to 12.9 >> > > >> > > October: 111945/9570 =11. 69 <-- atlas claims 11.884 until August >> > > included >> > > September: 74195/7040 = 10.54 <-- atlas see 10.5 from September >> > > onward in line with this numbers >> > > October: 76167/7291=10.44 <-- similar enough >> > > November: 6811/528 = 12.89 <-- this is ok if ATLAS sees it, but I >> > > suspect numbers are not updated that often and it might be a >> > > discrepancy again. >> > > >> > Atlas sees 10.5 because that's what we my mistake reported. We didn't >> > updated the Glue value and only the one for APEL when we added new >> > worker nodes. 10.5 was the wrong, too low value. Since we updated now >> > the APEL and GLUE value to be consistent, there should be no >> > reason/possibility that ATLAS sees something different for November. >> > >> > > so there are 3 points here >> > > >> > > 1) Do you update your numbers to maintain the HS06 ratio in the BDII >> > > consistently? I don't think changing numbers monthly is a good idea >> > > but they should at least match the HS06 value. >> > No, we don't change monthly. >> > We only looked into it because of the discrepancy you reported and found >> > that a) that the 2 different values we report, Apel and Glue one, are >> > not consisten with each other, b) both don't reflect the new hardware we >> > are running on since a while for the SL6 analysis queue. >> > That's why it was changed in October. Before I think the last change was >> > in July when we got new machines to run on (differently configured for >> > job slots than our ringfenced nodes which made a change neccessary) >> > The change in October reflected the addition of the Openstack nodes for >> > the SL6 queue. >> > >> > > 2) If you do that why rebus is reporting a different set of numbers >> > > for example I'd expect Ocotber 7291*12.9 = 94053 not 76167 >> > We don't do that. >> > It was changed in October, so probably that's why it's different since >> > it was not the same for the whole month? >> > I would expect that November onwards it should now correspond to 12.9 >> > >> > > 3) ATLAS doesn't seem to update the HS06 often enough to have such >> > > frequent changes. And TBF most sites usually don't change their size >> > > every month. >> > > >> > As I said, we also don't do that. >> > >> > >> > I think we should wait until the end of October to see if it will be >> > green then and consistent. >> > In any case, we will look through the published data using the scripts >> > you published to make sure it will be consistent in the future. >> > >> > >> > Cheers, >> > Marcus >> > >> > > [2] http://tinyurl.com/j2fylyx >> > > >> > > On 17/11/2016 12:05, Marcus Ebert wrote: >> > > > Thanks Alessandra, >> > > > >> > > > I think I understand now, also from previous discussions in the list >> > > > here. >> > > > Basically, it only tests if 2 values published by a site, both >> > > > defined in >> > > > the bdii and put in manually by the site, agree or not, but doesn't >> > > > say >> > > > anything about the correctness of the HEPSPEC value used. >> > > > So it seems what really meaningfully can be compared is just the >> > > > wallclock >> > > > work from Atlas and APEL, if it's not scaled at a site. >> > > > >> > > > Wouldn't it be better then to split the plot in 2 different ones, >> > > > - one for the ratio of wallclock hours Atlas/APEL to have a site >> > > > check >> > > > that both values published are consistent, and >> > > > - second one only for the wallclock work ratio Atlas/APEL to see >> > > > any >> > > > differences between the reported wallclock work in APEL and the >> > > > ATLAS >> > > > records? >> > > > >> > > > If it shows for example "red" right now, it's not obvious just from >> > > > the >> > > > plot which of the 2 numbers are the problem. >> > > > >> > > > >> > > > Cheers, >> > > > Marcus >> > > > >> > > > On Tue, 8 Nov 2016, Alessandra Forti wrote: >> > > > >> > > > > Hi Marcus, >> > > > > > > Thanks, I think I nearly understand it now. To fully >> > > > understand, > could you please explain how HS06 in Atlas wallclock >> > > > work is determined? > It isn't the same that > is used in APEL >> > > > wallclock work, is it? >> > > > > > the presentation I gave yesterday at the HEPSYSMAN gives the >> > > > details >> > > > > > >> > > > https://indico.cern.ch/event/577279/contributions/2353919/attachments/1367099/2071452/20161107_hepsysman-accounting.pdf >> > > > > > > in the specific today I've also started an FAQ >> > > > > > >> > > > https://twiki.cern.ch/twiki/bin/view/LCG/AccountingFAQ#How_are_the_ATLAS_numbers_in_SSB >> > > > > > > cheers >> > > > > alessandra >> > > > > > On 01/11/2016 09:52, Marcus Ebert wrote: >> > > > > > Hi Alessandra, >> > > > > > > > On Tue, 1 Nov 2016, Alessandra Forti wrote: >> > > > > > > > > > I'm not sure if I understand it or if it makes sense >> > > > that way: >> > > > > > > > Basically what you are saying is that the initial number >> > > > values >> > > > > > > > "HS06 on the atlas dashboard, HS06 in APEL, ratio, >> > > > wallclock > > in > > ATLAS, >> > > > > > > > wallclock in APEL, wallclock ratio" >> > > > > > > > are really >> > > > > > > > "wallclock work in the Atlas, wallclock work in APEL, >> > > > ratio, > > > > wallclock >> > > > > > > > work in Atlas (unscaled), wallclock work in APEL (maybe >> > > > > > > > scaled)", >> > > > > > > > isn't it? >> > > > > > > the fields are >> > > > > > > > ATLAS wallclock work (HS06*hours), APEL wallclock work > >> > > > > (HS06*hours), > ratio, ATLAS wallclock (hours), APEL wallclock >> > > > (hours > > maybe internally > scale), ratio >> > > > > > > > > Thanks, I think I nearly understand it now. To fully >> > > > understand, could > > you >> > > > > > please explain how HS06 in Atlas wallclock work is determined? >> > > > It > > isn't >> > > > > > the same that is used in APEL wallclock work, is it? >> > > > > > > > > > Cheers, >> > > > > > Marcus >> > > > > > > > > > > > >> > > >> > > >> > >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.