JISCMail - TB-SUPPORT Archives

Hello

I wonder if we could clarify what the method is for the SAM tests used for the ATLAS metrics
(I am presume Steve is on the tb support list so I don't cc him but could do with his input)

Specifically I think it would be useful to know how often the tests are run - if it is a simple OR that is being used to calculate the availability.

If one looks at the history of CE_all_nagios then there are days that appear red but when one looks at each ce most seem green

I suspect that if there was no test on one CE within a period and there was a test on another that failed then the whole period is taken as failed. This means that if one has one very "unreliable" CE then one can get very punished in this setup. But it would be good to know if this suspicion was correct. The ops nagios tests are run very frequently and therefore give a much more refined picture of availability - as well as better alerts - is there a link to a nagios portal for the ATLAS tests?

Furthermore tests can even be counted if a CE is switched off for panda jobs or is fact perfectly fine for pilots and has some problem for WMS jobs.
It is very strange that the WMS mechanism is used for availability where jobs actually submitted through WMS would not be counted in the jobs column.
I could identify several periods where a site was unavailable according to this but ran large numbers of jobs - will those periods be discounted from the availability column as clearly the test is not modelling actual availability in those cases.

Was any decision made on whether to drop this column and just use jobs run? That makes most sense to me - you can't run jobs if you are unavailable so the real availability is implicitly folded in....

Wahid

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.