Oh great, another web page to look at. And it's all in javascript so I
cannot even try to parse it and generate alerts internally.
Oh well I guess, I'll just have to wait to be ticketed.
Chris.
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Elena Korolkova
> Sent: 29 February 2012 14:43
> To: [log in to unmask]
> Subject: Re: ATLAS Production Functional Testing (PFT)
>
> Hi Chris
>
> you can check
> http://panda.cern.ch/server/pandamon/query?dash=prod
> http://panda.cern.ch/server/pandamon/query?dash=analysis.
>
> Click on UK cloud.
>
> The procedure for exclusion will be in place from tomorrow.
>
> Cloud support will follow the problems
>
> Cheers
> Elena
> On 29 Feb 2012, at 14:23, Chris Brew wrote:
>
> > Hi Alastair,
> >
> > Is there an easy place I can see whether my site is on or offline for
> production and/or analysis?
> >
> > Any way I could query it programmatically eg with nagios?
> >
> > Thanks,
> > Chris.
> >
> > On 28 Feb 2012, at 12:53, "Alastair Dewhurst"
> <[log in to unmask]> wrote:
> >
> >> Hi
> >>
> >> ATLAS are introducing an automatic test for their production queues.
> >>
> >> There are 5 test jobs currently running.
> >> - One GEANT 4 simulation job running under three different ATLAS
> >> software release versions
> >> - One Reconstruction job
> >> - One Event Generation job
> >>
> >> Every 30 minutes 4 booleans are calculated:
> >> P1: Last three jobs from any single test have failed
> >> P2: Last two jobs from any single test and the last job from another
> >> test have failed
> >> P3: Last job from three separate tests have all failed
> >> P4: Last two jobs from all tests have succeeded
> >>
> >> If (P1 || P2 || P3) site will be blacklisted If (!P1 && !P2 && !P3
> &&
> >> P4) site will be unblacklisted
> >>
> >> Any blacklisted site will be set to 'test' mode which means normal
> jobs will not be submitted but test jobs will continue.
> >>
> >> These test jobs have been submitted to sites since the end of last
> year. The blacklisting has not been switched on yet but will be very
> shortly (The French Cloud was switched on today). To view your sites
> jobs:
> >>
> http://panda.cern.ch/server/pandamon/query?job=*&type=&days=1&jobsetI
> >> D=any&jobStatus=&site=&cplot=yes&plot=yes&processingType=gangarobot-
> p
> >> ft&cplot=yes&cloud=UK Then click on your site. Or you can just
> >> modify the url, Sheffield for example is:
> >>
> http://panda.cern.ch/server/pandamon/query?job=*&type=&days=1&jobsetI
> >> D=any&jobStatus=&site=&cplot=yes&plot=yes&processingType=gangarobot-
> p
> >> ft&cplot=yes&cloud=UK&computingSite=UKI-NORTHGRID-SHEF-HEP
> >>
> >> If you are experienced with using the panda monitor, the job types
> >> you are looking for are: processingType=gangarobot-pft
> >>
> >> To see if your site would be blacklisted listed you can check:
> >> http://hammercloud.cern.ch/hc/app/atlas/robot/incidents/?site=UKI-
> NOR
> >> THGRID-SHEF-HEP&severity=&q=&hours=
> >>
> >> The official place where ATLAS record queue changes is still here:
> >> http://panda.cern.ch/server/pandamon/query?mode=site&site=UKI-
> NORTHGR
> >> ID-SHEF-HEP
> >>
> >>
> >>
> >> In addition to this, ATLAS are also developing a method to
> automatically blacklist a site when it declares a (scheduled) downtime
> in the GOCDB. Currently, if you declare a downtime of your SE, ATLAS
> will automatically blacklist your space tokens preventing transfers
> from their while you are down. In development (and being tested by
> RAL) is a procedure that will also blacklist your site if you declare
> an outage on the CE.
> >>
> >> Currently, if an outage is declared we have to rely on shifters to
> blacklist and then test and unblacklist sites. What will happen is
> that when you declare a downtime the production queues for the site
> will be set offline (for ATLAS) 12 hours before. 6 hours before the
> ANALY queues will also be set offline. This actually works quite well
> as normally shorter running analysis jobs will fill the sites farm and
> waste as little CPU as possible before a downtime. When we tried this
> at RAL there were about 230 jobs still in the farm when our downtime
> started and were killed. This was less than 10% of the ATLAS jobs
> running 12 hours before.
> >>
> >> Once the downtime is over, the site will be set to test and the
> automatic test jobs will set the site back online when you are passing
> jobs. (This is already the case for analysis queues at sites).
> Currently in discussion is a proposal to avoid this automatic procedure
> for downtimes that are under a certain length and any site feedback on
> what should happen would be welcome.
> >>
> >>
> >> Hope people find this useful.
> >>
> >> Alastair
>
> __________________________________________________
> Dr Elena Korolkova
> Email: [log in to unmask]
> Tel.: +44 (0)114 2223553
> Fax: +44 (0)114 2223555
> Department of Physics and Astronomy
> University of Sheffield
> Sheffield, S3 7RH, United Kingdom
|