Hi Chris
There is an API for the site status board which also has this information on. I am trying to find out where the documentation is and if it is accessible without an atlas grid certificate.
A once a day email is currently sent out to the [log in to unmask] which lists all open GGUS and savannah tickets. It also lists sites which have space tokens that are close to full and sites that are offline.
Alastair
On 29 Feb 2012, at 15:21, Chris Brew wrote:
> Oh great, another web page to look at. And it's all in javascript so I
> cannot even try to parse it and generate alerts internally.
>
> Oh well I guess, I'll just have to wait to be ticketed.
>
> Chris.
>
>> -----Original Message-----
>> From: Testbed Support for GridPP member institutes [mailto:TB-
>> [log in to unmask]] On Behalf Of Elena Korolkova
>> Sent: 29 February 2012 14:43
>> To: [log in to unmask]
>> Subject: Re: ATLAS Production Functional Testing (PFT)
>>
>> Hi Chris
>>
>> you can check
>> http://panda.cern.ch/server/pandamon/query?dash=prod
>> http://panda.cern.ch/server/pandamon/query?dash=analysis.
>>
>> Click on UK cloud.
>>
>> The procedure for exclusion will be in place from tomorrow.
>>
>> Cloud support will follow the problems
>>
>> Cheers
>> Elena
>> On 29 Feb 2012, at 14:23, Chris Brew wrote:
>>
>>> Hi Alastair,
>>>
>>> Is there an easy place I can see whether my site is on or offline for
>> production and/or analysis?
>>>
>>> Any way I could query it programmatically eg with nagios?
>>>
>>> Thanks,
>>> Chris.
>>>
>>> On 28 Feb 2012, at 12:53, "Alastair Dewhurst"
>> <[log in to unmask]> wrote:
>>>
>>>> Hi
>>>>
>>>> ATLAS are introducing an automatic test for their production queues.
>>>>
>>>> There are 5 test jobs currently running.
>>>> - One GEANT 4 simulation job running under three different ATLAS
>>>> software release versions
>>>> - One Reconstruction job
>>>> - One Event Generation job
>>>>
>>>> Every 30 minutes 4 booleans are calculated:
>>>> P1: Last three jobs from any single test have failed
>>>> P2: Last two jobs from any single test and the last job from another
>>>> test have failed
>>>> P3: Last job from three separate tests have all failed
>>>> P4: Last two jobs from all tests have succeeded
>>>>
>>>> If (P1 || P2 || P3) site will be blacklisted If (!P1 && !P2 && !P3
>> &&
>>>> P4) site will be unblacklisted
>>>>
>>>> Any blacklisted site will be set to 'test' mode which means normal
>> jobs will not be submitted but test jobs will continue.
>>>>
>>>> These test jobs have been submitted to sites since the end of last
>> year. The blacklisting has not been switched on yet but will be very
>> shortly (The French Cloud was switched on today). To view your sites
>> jobs:
>>>>
>> http://panda.cern.ch/server/pandamon/query?job=*&type=&days=1&jobsetI
>>>> D=any&jobStatus=&site=&cplot=yes&plot=yes&processingType=gangarobot-
>> p
>>>> ft&cplot=yes&cloud=UK Then click on your site. Or you can just
>>>> modify the url, Sheffield for example is:
>>>>
>> http://panda.cern.ch/server/pandamon/query?job=*&type=&days=1&jobsetI
>>>> D=any&jobStatus=&site=&cplot=yes&plot=yes&processingType=gangarobot-
>> p
>>>> ft&cplot=yes&cloud=UK&computingSite=UKI-NORTHGRID-SHEF-HEP
>>>>
>>>> If you are experienced with using the panda monitor, the job types
>>>> you are looking for are: processingType=gangarobot-pft
>>>>
>>>> To see if your site would be blacklisted listed you can check:
>>>> http://hammercloud.cern.ch/hc/app/atlas/robot/incidents/?site=UKI-
>> NOR
>>>> THGRID-SHEF-HEP&severity=&q=&hours=
>>>>
>>>> The official place where ATLAS record queue changes is still here:
>>>> http://panda.cern.ch/server/pandamon/query?mode=site&site=UKI-
>> NORTHGR
>>>> ID-SHEF-HEP
>>>>
>>>>
>>>>
>>>> In addition to this, ATLAS are also developing a method to
>> automatically blacklist a site when it declares a (scheduled) downtime
>> in the GOCDB. Currently, if you declare a downtime of your SE, ATLAS
>> will automatically blacklist your space tokens preventing transfers
>> from their while you are down. In development (and being tested by
>> RAL) is a procedure that will also blacklist your site if you declare
>> an outage on the CE.
>>>>
>>>> Currently, if an outage is declared we have to rely on shifters to
>> blacklist and then test and unblacklist sites. What will happen is
>> that when you declare a downtime the production queues for the site
>> will be set offline (for ATLAS) 12 hours before. 6 hours before the
>> ANALY queues will also be set offline. This actually works quite well
>> as normally shorter running analysis jobs will fill the sites farm and
>> waste as little CPU as possible before a downtime. When we tried this
>> at RAL there were about 230 jobs still in the farm when our downtime
>> started and were killed. This was less than 10% of the ATLAS jobs
>> running 12 hours before.
>>>>
>>>> Once the downtime is over, the site will be set to test and the
>> automatic test jobs will set the site back online when you are passing
>> jobs. (This is already the case for analysis queues at sites).
>> Currently in discussion is a proposal to avoid this automatic procedure
>> for downtimes that are under a certain length and any site feedback on
>> what should happen would be welcome.
>>>>
>>>>
>>>> Hope people find this useful.
>>>>
>>>> Alastair
>>
>> __________________________________________________
>> Dr Elena Korolkova
>> Email: [log in to unmask]
>> Tel.: +44 (0)114 2223553
>> Fax: +44 (0)114 2223555
>> Department of Physics and Astronomy
>> University of Sheffield
>> Sheffield, S3 7RH, United Kingdom
|