Performance of the site is much improved with our new SE,
se03.esc.qmul.ac.uk connected to a 1Gbit port on our external link. We
now see transfers that peak at over 90 MB/s according to FTS[1]
Lots of data has transferred:
[root@fe08 storm_3]# du --summarize -h * ; du --summarize -h atlas/* ;
du --summarize -h
6.7T atlas
248G cms
8.0K dteam
4.0K lhcb
24K ops
4.0K vo.londongrid.ac.uk
2.7T atlas/atlasdatadisk
4.0K atlas/atlasgroupdisk
10M atlas/atlaslocalgroupdisk
3.9T atlas/atlasmcdisk
6.6G atlas/atlasproddisk
158G atlas/atlasscratchdisk
7.0T .
We have also resolved network congestion issues - several machines with
high internal net traffic (se03,se02,ce03, fe06 (DNS),fe07 (NAT)) shared
a single 1Gbit link to the internal switch. Moving them all to the
central switch has eliminated packet loss to the DNS server.
A filled disk on one of the worker nodes also led to jobs failing
The cluster is now full of jobs - with 2000 cms jobs running (though
only 8 atlas jobs).
Outstanding issues.
-------------------
Instead of using the "file" protocol and 10*10Gbit links to the storage,
jobs are still using the gridftp and rfio protocols. This means they
talk via a single 1Gbit link to the storage. Even worse that link is
also used for transmitting incoming data to the storage.
As jobs have been failing using rfio - presumably poor performance
resulting in timeouts, I have turned it off.
Chris
[1]
http://ganglia.gridpp.rl.ac.uk/cgi-bin/ganglia-fts/fts-graph.pl?g=All&r=day&s=huge&t=bytes&f=UKILT2QMUL&v=All&h=Services_Grid/lcgfts0421.gridpp.rl.ac.uk
|