Hi,

Here's the CMS report - for today it's in writing, because I am still finding my way around.

* General news:

CMS wants to roll out xrootd everywhere.
The official statement follows:

  In the very short term, we want to get *all* T2 sites to implement the xrootd fallback mechanism, as described at

https://twiki.cern.ch/twiki/bin/view/Main/ConfiguringFallback

It's really easy to do (just add a couple of lines to two files), and it is to your benefit, as it will protect jobs against storage problems at your site.  Do this now!  Let's thank the sites that already have it working:

T2_CH_CERN
T2_CN_Beijing
T2_DE_*
T2_EE_Estonia
T2_ES_CIEMAT
T2_IT_*
T2_UK_London_IC
T2_UK_SGrid_RALPP
T2_US_*

In the medium term (I'd say next few months) we want all sites to join the xrootd federation, as described in links at

https://twiki.cern.ch/twiki/bin/view/Main/CmsXrootdArchitecture

Please start to consider how this could be done at your site.  And in the longer term (after the end of the run) we want to think about changing our data management organization, such that we'll define space as either managed or unmanaged, and we run the managed space by automatically subscribing any new dataset, and letting Victor progressively clean up datasets that aren't getting used.  It's a bit of a paradigm shift and will require some more work to implement, but it could give us some more flexibility and efficiency in the analysis computing.

[end of official statement]

* UK news

I've enabled a PheDex (file transfer) debug instances from Imperial to QMUL, which is working. We also had already one setup to RHUL which after applying the magic DPM command ( dpns-ls -R /dpm/brunel.ac.uk/home/cms/store | awk -F':' '/dpm/ {print $1}' | xargs -i dpns-setacl -m d:g:cms/Role=cmsphedex:rwx,d:
m:rwx,g:cms/Role=cmsphedex:rwx,m:rwx {}   ) also works. I am going to setup one for Glasgow next, to first order so I learn how to set these up from scratch. I'll let Glasgow know if/when I get anywhere. These debug transfers should not interfere with anything on the SE.

Brunel came out with an awful site performance wrt CMS recently. Disentangling what went wrong threw up
a bunch of interesting observations (some might be relevant to non-CMS sites, keep reading, apologies if the Savannah tickets aren't readable to everyone):

* In there site readiness table CMS treats the GOCDB status "Warning" as unscheduled downtime (Personally, I don't know what the point of 'Warning' is - it's either working or it doesn't). This will be corrected (going forward only) after Raul filed a ticket:
https://savannah.cern.ch/support/?133644
     
* There has been a suggestion that the checksumming of large files causes a backlog in the data transfer to Brunel and that checksumming on the fly might solve the problem.
https://savannah.cern.ch/support/?133588
and
https://ggus.eu/tech/ticket_show.php?ticket=88431 (ticket by Raul against DPM)
This might need input from the DPM experts. How does Atlas deal with this ?

* CMS runs a nagios, but I haven't worked out how to get it to tell me if  something is broken (does the Atlas nagios do this ?) So right now, we often only hear of a problem when we get a ticket from the (often badly trained - if you can read the ticket you will see what I mean) shifter - even though in this case we caught the error earlier by accident, but the shifters themselves clearly don't check the nagios logs.
https://savannah.cern.ch/support/?133738
It would help if this was automated.

* CMS MC production seems to block a site even if the failure is  limited to a subcluster. (at least that's how I read the ticket above). I need to follow this up.


Cheers,

Daniela


On 13 November 2012 10:18, Jeremy Coles <[log in to unmask]> wrote:
Dear All

Today's ops meeting agenda can be found at http://indico.cern.ch/conferenceDisplay.py?confId=213691. In addition to the standing items we will look at the (initial) proposal for the post-EMI software lifecycle (mentioned by Alessandra at HEPSYSMAN), discuss the new perfosonar end-point registration request and note observed impacts from the T1 power cut last week with a view to formulating wider lessons learned.

For minutes: Mark=3 Sam=4 Ewan=4 Alessandra=4 Kashif=5 Duncan=5 Wahid=5 Daniela=5 Chris=5 Pete=5 Matt=5 Jeremy=5 Catalin=5.

regards,
Jeremy



--
Sent from the pit of despair

-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/