Hi,
yesterday GDB report. You can also find it in the wiki
https://www.gridpp.ac.uk/wiki/GDB_8th_September_2010
cheers
alessandra
10:00 Introduction 30' (John Gordon)
- Pilot jobs and glexec has been installed and tested at the
T1. SAM/Nagios tests have been set up. Some pre testing has been
done experiments should be able to test on a larger scale. Next
month there should be more progress.
- Cream CE: sites don't find it a problem, atlas seems to still
have a problem. Condor has provided a new version that should
solve the problem but hasn't been tested yet.
- Virtualisation: there are some activities going on.
Experiments should be more involved now.
- Information system stability: there are top level BDIIs that
are too old and cannot publish the latest information.
- Installed capacity: there is a page on how sites should
publish their cpu and storage resources. T1 should go and
checked the installed capacity published by the new gstat2.0
page. For the UK sites
http://gstat-wlcg.cern.ch/apps/capacities/sites
Fair shares aren't included yet the info is about raw capacity.
The inforation is dynamic at the moment but will become static,
with a DB behind, in the future.
Deadline for T1 by the end of September and end of October for
T2.
- SAM tests are being phased out and now regional Nagios have
taken over although there is still a central instance.
- Director General has issued a technical plan that has to go
back to funding agencies. See John slides for it.
10:30 Shared Software Areas 30' (Elisa Lanciotti, Ian Collier)
- Results obtained using CernVM-FS at PIC for software
distribution. Setup includes a squid cache to reduce the WAN
latency and local cache on the WNs. On the testbed there wasn't
a dependence on the number of job and the job execution time is
the same if not better than NFS.
- Lhcb has more than 4M small files and they get touched very
often this is also one of the reason the NFS servers get brought
down (GridKa complains).
- RAL made some preliminary scalability tests and they managed
to have 800 concurrent jobs but haven't looked at the job
efficiency in detail. This could be achieved with any caching
system (AFS proposed by Manchester since 2001 and currently used
by CERN ;-)) but CERNVM-FS main advantages is that when a file
is in the catalogue it doesn't get copied again even if it has
multiple references (i.e. belong to more than one release). Next
step is to scale to thousands of jobs. The main problem is how
the main service at CERN will be supported if sites move and
there is a skeleton of security framework that is currently
being worked on (chksum and validation done by the sgm
account?). One of the things to look at is how much cache you
need on the clients and how big has to be the squid cache.
Another big advantage is that this model eliminates the need of
publishing tags in the info system.
- Most of the work gets done at CERN were the experiments take
care of installing the master release and then sites just use
caches. There is a still a validation problem under discussion.
- General consensus is that this is promising and request for
CernVM-FS be supported from the experiments should come through.
11:00 Accounting 30' (Cristina Del Cano Novales)
- Apel is monitored by nagios now
- Apel has moved to ActiveMQ as transport layer and RGMA needs
to be decommissioned. only 30 sites are using ActiveMQ
publisher. 190 are still using RGMA.
- MPI parser support will be introduced.
- Regionalization will have to be flexible. Either publish to
the central service or setup their accounting service and
publish a summary to the central database. But they should do it
all through the same ActiveMQ interface.
- Nikhef didn't use the APEL parser injected the info directly
in RGMA. Need to find a way to do the same.
- The schema still support only grid jobs. Is there anything on
the horizon that includes local jobs?
The problem is that for APEL everything is a local job unless it
is joined with grid information.
- There is a accounting requirements collection going on in
EGI. User level accounting was a VO requirement that doesn't
seem to be needed anymore as experiments have their own way to
keep track of whom has submitted the jobs.
11:30 Middleware 30' (Andrew Elwell)
- Batch systems will have a best effort support from a number
of sites
- There is not a centralised group that can take decisions in
EGI on what is considered critical, and to set dates when things
should be dropped. In particular there is no date yet to drop
Glite3.1 services. We will have to rely on WLCG base line
release mechanism to understand what can be published.
- lcg-vomscerts needs to be updated by the 10th of September.
- A number of updates and fixes was listed.
- To maintain uptodate with the situation use the work plan
tracking in savannah: http://bit.ly/22we3i
12:00 Network Monitoring 20' (John Shade)
- Most important thing is that now there is a prototype
monitoring dashboard to monitor OPNs and they are working also
on having historical data.
http://sonar1.munich.cnm.dfn.de/lhcopn-dashboard/cgi-bin-auto/cnm-table.cgi
- It is not yet well supported. DANTE withdraw SARA and CERN
picked up the task.
14:00 - 16:25 Experiment Operations
14:00 Alice 30' (Latchezar Betev)
- Quite happy with the data collection.
- Waiting for RAL to upgrade castor because new version is
"rock solid"
- Tier2 storage is holding fine
- Analysis 5% of grid resources ~250 users
- Machine is stable
14:30 LHCb 30' (Roberto Santinelli)
- Analysis running up to the computing model expectations.
- Running xrootd
- Using crea CE in production testing direct job submission
- Analysis running at some T2s with DPM is under evaluation but
it puts a strain on the central system
- glexec is not in production but there are nagios tests in
place.
- Distribution is per run rather than per file (1 run == 1
site)
- Prototyping HC tests
- Data taking: quite impressive integrated luminosity as well.
- List of GGUS tickets and analysis of the problems. The number
of tickets has doubled and part of this is because lhcb has
started to use heavily a number of services at Tier1s.
- Shared SW area is one of the biggest "distributed" problems.
Lyon has had problems since June 8th and looked at using AFS
like CERN. Atlas jobs compiling the problems are competing for
the resource.
Complain about the number of small files that get touched and
bring NFS down from gridka again.
Proposed Site ID card at least for T1.
- CreamCE also has had a number of issues since June. The
current release seems more stable.
- RAL disk servers frequent crashes needed some memory tuning.
Storage is still the most vulnerable component.
15:10 CMS 30' (Ian Fisk)
- Site are in pretty good shape
- Lots of work done to make T2-T2 transfers reliable but this
is paying off increasing data availability
- Quality of transfers from CERN remarkably high
- Detailed list of major issues with T1
- Data for analysis exceeded 2 PB and physics groups manage
their space at T2
- Analysis has slowed down after ICHEP and during August but is
ramping up again
- Cream-CE in use latest version solves a lot of problems
- Working on pilot factories based on condor glide-ins
- Asked for a savannah-GGUS bridge which is working fine
although a lot of added features have been requested.
15:40 ATLAS 30' (Simone Campana, Stephan Jezequel)
- Interesting slides about DQ2 vs Info Sys discrepancies
reported about the storage. Most of the problems come from the
fact that IS uses this equation used = total - unused. Where
total is all the disk installed whether functional or not. This
causes often to publish negative numbers when sites put data
disks offline or in read-only state. A solution needs to be
agreed. (For reference https://gus.fzk.de/ws/ticket_info.php?ticket=54818)
- T1 availability is mostly ok apart from Nikhef/Sara which had
heavy problems with storage and oracle.
- Few GGUS tickets for T1
- A number of iterations have been required to converge on a
reliable checksum computation for storm sites.
- Storage is still a bit flaky. Big differences of
responsiveness and reliability between T1s there should be a
detailed comparison in the future.
- RAL prolonged problems with storage.
- SARA prolonged problems with LFC (generated by oracle)
- Actions identified to solve or minimize both problems