On 02/06/14 17:26, Jeremy Coles wrote:
> Dear All,
>
> The T2 figures for May are available. Please could all site admins
> check that the results for their site are understandable/correct and
> let me know if a re-computation is being requested. In terms of sites
> missing the 90% targets:
>
> ALICE
> (http://sam-reports.web.cern.ch/sam-reports/2014/201405/wlcg/WLCG_All_Sites_ALICE_May2014.pdf):
> No follow up needed.
>
> ATLAS
> (http://sam-reports.web.cern.ch/sam-reports/2014/201405/wlcg/WLCG_All_Sites_ATLAS_May2014.pdf):
>
> QMUL (63%:63%)
I believe most of the downtime is because the atlas deletion service
failed - resulting in DATADISK filling up. As a result of datadisk being
full, the tests to write to datadisk failed (as you'd expect), and
deletion tests (to delete a file that wasn't there because it hadn't
been written) failed too. GGUS
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105872 talks about
this - and acknowledges a bug in the deletion service.
There has also been a small amount of downtime due to a StoRM bug with
gridhttps requests causing /tmp to fill up (actually a bug in an
underlying library). We also turned off gridhttps for a while to try and
avoid this problem.
https://ggus.eu/index.php?mode=ticket_info&ticket_id=105361
Unfortunately confusion between the two issues and the fact that files
in other space tokens were being deleted caused us to take longer to
diagnose it than we should have done.
Comparison of availability:
http://dashb-atlas-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=test&time[]=individual&granularity[]=daily&starttime=2014-05-01+00%3A00%3A00&endtime=2014-06-02+00%3A00%3A00&profile=ATLAS_CRITICAL&group=ATLAS_Cloud_UK&site[]=UKI-LT2-QMUL&flavour[]=SRMv2&disabledFlavours=true&metric[]=All+Metrics&metric[]=org.atlas.SRM-VODel&metric[]=org.atlas.SRM-VOGet&metric[]=org.atlas.SRM-VOPut&metric[]=org.atlas.SRM-VODel&metric[]=org.atlas.SRM-VOGet&metric[]=org.atlas.SRM-VOPut&disabledMetrics=true&host[]=se03.esc.qmul.ac.uk
And disk space on datadisk:
http://bourricot.cern.ch/dq2/accounting/cloud_view/UKSITES/30/
show unavailability between 20th May and 1st June - exactly coinciding
with the excessive disk usage as reported by SRM (It would seem that DQ2
assumes that files marked for deletion have actually been deleted).
I guess we should request a recomputation excluding tests that involved
writing to datadisk, or were dependent on tests that wrote to datadisk.
More simply, that period could be excluded.
Chris
> UCL (82%:82%)
> Sheffield (71%:71%)
> Sussex (83%:87%).
>
> CMS
> (http://sam-reports.web.cern.ch/sam-reports/2014/201405/wlcg/WLCG_All_Sites_CMS_May2014.pdf):
>
> RALPP (85%:85%).
>
> LHCb
> (http://sam-reports.web.cern.ch/sam-reports/2014/201405/wlcg/WLCG_All_Sites_CMS_May2014.pdf):
>
> Sheffield (70%:70%) - 44% unknown
> Birmingham (55%:55%) - 70% unknown
> EFDA-JET (0%:0%).
>
>
> As usual, if your site appears in this list then within the next 4
> days please could you let me know, or point me towards an already
> recorded list of, the main issues encountered. I suspect for some it
> will be the EMI-3 upgrades.
>
> Many thanks,
> Jeremy
>
>
>
>
> Begin forwarded message:
>
>> *From: *WLCG Office <[log in to unmask] <mailto:[log in to unmask]>>
>> *Subject: * *T2 Reliability & Availability - May 2014*
>> *Date: *2 June 2014 13:21:18 BST
>> *To: *"project-wlcg-cb (Members of the WLCG CB)"
>> <[log in to unmask] <mailto:[log in to unmask]>>
>> *Cc: *"project-lcg-gdb (LCG - Grid Deployment Board)"
>> <[log in to unmask] <mailto:[log in to unmask]>>,
>> "sam-support (SAM support)" <[log in to unmask]
>> <mailto:[log in to unmask]>>, "[log in to unmask]
>> <mailto:[log in to unmask]>" <[log in to unmask]
>> <mailto:[log in to unmask]>>, "[log in to unmask]
>> <mailto:[log in to unmask]>" <[log in to unmask]
>> <mailto:[log in to unmask]>>
>>
>> Dear all,
>>
>> The draft T2 reliability & availability reports for May 2014 are now
>> available at:
>>
>> http://sam-reports.web.cern.ch/sam-reports/2014/201405/wlcg/ under
>> titles starting with "WLCG_All_Sites..."
>>
>> Please verify your data and send any comments to WLCG Office by Weds
>> 11 June. We have recalculated the availabilities for the ATLAS & CMS
>> US sites due to the CRL issue as agreed with ATLAS and CMS
>> representatives.
>>
>> Any requests for recomputation must be submitted via GGUS within the
>> next 10 calendar days; full detailshere
>> <http://tomtools.cern.ch/confluence/display/SAMDOC/Availability+Re-computation+Policy>.
>>
>> Kind regards,
>> Cath
>>
>>
>> -----------------------------------------------
>> WLCG Office
>> IT Dept - CERN
>> CH-1211 Genève, Switzerland
>> www.cern.ch/wlcg <http://www.cern.ch/wlcg>
>
|