JISCMail - TB-SUPPORT Archives

On Thu, 17 Nov 2016, Alessandra Forti wrote:

> It just occurred to me that even though you didn't change the BDII on purpose 
> you changed the system moving from CREAM to ARC as well as changing the WNs.
>
> The numbers in REBUS started to fluctuate wildly in May, before they were 
> stable on 43243/2912 = 14.85. I've asked REBUS people to check, but that 
> might explain some of the differences.
>
Right, while what I mentioned before was just for the ARC CE, we also had 
back then the cream ce for the old cluster (which doesn't exist anymore 
now).

cheers,
  Marcus

> cheeers
> alessandra
>
>
> On 17/11/2016 14:33, Alessandra Forti wrote:
>>  Hi Marcus,
>>
>>  thanks for confirming. It is still not clear to me why REBUS sees this
>>  wild variations for ECDF. I'll try to get an answer from them.
>>
>>  cheers
>>  alessandra
>> 
>>
>>  On 17/11/2016 14:20, Marcus Ebert wrote:
>> >  Hi Alessandra,
>> > 
>> > 
>> >  On Thu, 17 Nov 2016, Alessandra Forti wrote:
>> > 
>> > >  size of the site is manually inserted in the BDII. I agree in ECDF it 
>> > >  is variable but you really should put a meaningful value that averages 
>> > >  to a meaningful HS06 number.  I thought you did that but ECDF is red 
>> > >  again. This time APEL is bigger than ATLAS. You seem to change the 
>> > >  capacity in the BDII every month [1] can you confirm that? You should 
>> > >  put values whose ratio is ~HS06 you publish.
>> > > 
>> >  No, I don't think it was changed every month. It was changed in October 
>> >  to make it consistent between the 2 numbers we report and to reflect the 
>> >  current worker node systems we run on (ringfenced nodes, general ECDF 
>> >  cluster, Openstack - all with different HepSpec and job slots/cores).
>> >  (see below)
>> >  This value should reflect the different systems we are running on in 
>> >  very good approximation now.
>> > 
>> > >  aforti@vm7>site=UKI-SCOTGRID-ECDF; ldapsearch -LLL -x -h 
>> > >  top-bdii.tier2.hep.manchester.ac.uk:2170 -b 
>> > >  "mds-vo-name=${site},mds-vo-name=local,o=grid" | perl -p00e 's/\r?\n 
>> > >  //g'|egrep -i 'bench|spec|logical'
>> > >  GlueHostBenchmarkSF00: 0
>> > >  GlueHostBenchmarkSI00: 0
>> > >  objectClass: GlueHostBenchmark
>> > >  GlueHostProcessorOtherDescription: Cores=8, Benchmark=12.9-HEP-SPEC06
>> > >  GlueSubClusterLogicalCPUs: 528
>> > > 
>> >  That's the updated correct one. It was updated in October, so I think we 
>> >  should wait for the November numbers once the whole month is over.
>> >  Cores and Hepspec are averaged over the different systems taking the 
>> >  different number of cores/machines into account we really run on.
>> > 
>> > 
>> > >  ATM REBUS reports weird stuff not corresposnding to 12.9
>> > > 
>> > >  October: 111945/9570 =11. 69 <-- atlas claims 11.884 until August 
>> > >  included
>> > >  September: 74195/7040 = 10.54  <-- atlas see 10.5 from September 
>> > >  onward in line with this numbers
>> > >  October: 76167/7291=10.44 <-- similar enough
>> > >  November: 6811/528 = 12.89 <-- this is ok if ATLAS sees it, but I 
>> > >  suspect numbers are not updated that often and it might be a 
>> > >  discrepancy again.
>> > > 
>> >  Atlas sees 10.5 because that's what we my mistake reported. We didn't 
>> >  updated the Glue value and only the one for APEL when we added new 
>> >  worker nodes. 10.5 was the wrong, too low value. Since we updated now 
>> >  the APEL and GLUE value to be consistent, there should be no 
>> >  reason/possibility that ATLAS sees something different for November.
>> > 
>> > >  so there are 3 points here
>> > > 
>> > >  1) Do you update your numbers to maintain the HS06 ratio in the BDII 
>> > >  consistently? I don't think changing numbers monthly is a good idea 
>> > >  but they should at least match the HS06 value.
>> >  No, we don't change monthly.
>> >  We only looked into it because of the discrepancy you reported and found 
>> >  that a) that the 2 different values we report, Apel and Glue one, are 
>> >  not consisten with each other, b) both don't reflect the new hardware we 
>> >  are running on since a while for the SL6 analysis queue.
>> >  That's why it was changed in October. Before I think the last change was 
>> >  in July when we got new machines to run on (differently configured for 
>> >  job slots than our ringfenced nodes which made a change neccessary)
>> >  The change in October reflected the addition of the Openstack nodes for 
>> >  the SL6 queue.
>> > 
>> > >  2) If you do that why rebus is reporting a different set of numbers 
>> > >  for example I'd expect Ocotber 7291*12.9 = 94053 not 76167
>> >  We don't do that.
>> >  It was changed in October, so probably that's why it's different since 
>> >  it was not the same for the whole month?
>> >  I would expect that November onwards it should now correspond to 12.9
>> > 
>> > >  3) ATLAS doesn't seem to update the HS06 often enough to have such 
>> > >  frequent changes. And TBF most sites usually don't change their size 
>> > >  every month.
>> > > 
>> >  As I said, we also don't do that.
>> > 
>> > 
>> >  I think we should wait until the end of October to see if it will be 
>> >  green then and consistent.
>> >  In any case, we will look through the published data using the scripts 
>> >  you published to make sure it will be consistent in the future.
>> > 
>> > 
>> >  Cheers,
>> >   Marcus
>> > 
>> > >  [2] http://tinyurl.com/j2fylyx
>> > > 
>> > >  On 17/11/2016 12:05, Marcus Ebert wrote:
>> > > >   Thanks Alessandra,
>> > > > 
>> > > >  I think I understand now, also from previous discussions in the list 
>> > > >  here.
>> > > >  Basically, it only tests if 2 values published by a site, both 
>> > > >  defined in
>> > > >  the bdii and put in manually by the site, agree or not, but doesn't 
>> > > >  say
>> > > >   anything about the correctness of the HEPSPEC value used.
>> > > >   So it seems what really meaningfully can be compared is just the 
>> > > >  wallclock
>> > > >   work from Atlas and APEL, if it's not scaled at a site.
>> > > > 
>> > > >   Wouldn't it be better then to split the plot in 2 different ones,
>> > > >   - one for the ratio of wallclock hours Atlas/APEL to have a site 
>> > > >  check
>> > > >   that both values published are consistent, and
>> > > >   - second one only for the wallclock work ratio Atlas/APEL to see 
>> > > >   any
>> > > >   differences between the reported wallclock work in APEL and the 
>> > > >   ATLAS
>> > > >   records?
>> > > > 
>> > > >  If it shows for example "red" right now, it's not obvious just from 
>> > > >  the
>> > > >   plot which of the 2 numbers are the problem.
>> > > > 
>> > > > 
>> > > >   Cheers,
>> > > >    Marcus
>> > > > 
>> > > >   On Tue, 8 Nov 2016, Alessandra Forti wrote:
>> > > > 
>> > > > >   Hi Marcus,
>> > > > > > >    Thanks, I think I nearly understand it now. To fully 
>> > > >  understand, >  could you please explain how HS06 in Atlas wallclock 
>> > > >  work is determined? >  It isn't the same that > is used in APEL 
>> > > >  wallclock work, is it?
>> > > > > >   the presentation I gave yesterday at the HEPSYSMAN gives the 
>> > > >  details
>> > > > > > 
>> > > >  https://indico.cern.ch/event/577279/contributions/2353919/attachments/1367099/2071452/20161107_hepsysman-accounting.pdf 
>> > > > > > >   in the specific today I've also started an FAQ
>> > > > > > 
>> > > >  https://twiki.cern.ch/twiki/bin/view/LCG/AccountingFAQ#How_are_the_ATLAS_numbers_in_SSB 
>> > > > > > >   cheers
>> > > > >   alessandra
>> > > > > >   On 01/11/2016 09:52, Marcus Ebert wrote:
>> > > > > >    Hi Alessandra,
>> > > > > > > >    On Tue, 1 Nov 2016, Alessandra Forti wrote:
>> > > > > > > > > >     I'm not sure if I understand it or if it makes sense 
>> > > >  that way:
>> > > > > > > >     Basically what you are saying is that the initial number 
>> > > >  values
>> > > > > > > >     "HS06 on the atlas dashboard, HS06 in APEL, ratio, 
>> > > > wallclock > > in > >     ATLAS,
>> > > > > > > >     wallclock in APEL, wallclock ratio"
>> > > > > > > >     are really
>> > > > > > > >     "wallclock work in the Atlas, wallclock work in APEL, 
>> > > >  ratio, > > > > wallclock
>> > > > > > > >     work in Atlas (unscaled), wallclock work in APEL (maybe 
>> > > > > > > >     scaled)",
>> > > > > > > >     isn't it?
>> > > > > > >    the fields are
>> > > > > > > >    ATLAS wallclock work (HS06*hours), APEL wallclock work > 
>> > > > >   (HS06*hours), >  ratio, ATLAS wallclock (hours), APEL wallclock 
>> > > >  (hours > > maybe internally >  scale), ratio
>> > > > > > > > >   Thanks, I think I nearly understand it now. To fully 
>> > > >  understand, could > >  you
>> > > > > >   please explain how HS06 in Atlas wallclock work is determined? 
>> > > > It > >   isn't
>> > > > > >    the same that is used in APEL wallclock work, is it?
>> > > > > > > > > >    Cheers,
>> > > > > >     Marcus
>> > > > > > > > > > > > 
>> > > 
>> > > 
>> > 
>> 
>
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.