For context:
Duncan mentioned an interest in a "hotfiles" tool for the dpm toolkit,
so I was testing my implementation on the last two or so weeks of data
in the DPM (which includes the HC test).
The results are interesting:
Hot files for period of 10 days.
Pfn Number of gets Filesize
disk041.gla.scotgrid.ac.uk:/gridstore4/atlas/2009-07-28/user09.JohannesElmsheuser.ganga.voatlas49_1248773583.lib._000010.lib.tgz.15496071.0 13966 1.423223M
disk032.gla.scotgrid.ac.uk:/gridstore1/atlas/2008-12-14/DBRelease-6.3.1.tar.gz.5487017.0 8860 272.372005M
disk061.gla.scotgrid.ac.uk:/gridstore0/atlas/2009-07-06/DBRelease-7.1.1.tar.gz.15099765.0 8520 273.544275M
disk051.gla.scotgrid.ac.uk:/gridstore1/atlas/2009-07-24/EVNT.076420._000002.pool.root.2.15383493.0 8518 131.759203M
disk034.gla.scotgrid.ac.uk:/gridstore2/atlas/2009-07-21/user09.JohannesElmsheuser.ganga.voatlas49_1248165653.lib._000011.lib.tgz.15257454.0 8266 File
no longer exists on DPM
disk038.gla.scotgrid.ac.uk:/gridstore2/atlas/2009-07-21/user09.JohannesElmsheuser.ganga.voatlas49_1248164115.lib._000002.lib.tgz.15257206.0 6202 File
no longer exists on DPM
disk054.gla.scotgrid.ac.uk:/gridstore1/cms/2009-05-20/TenEvents.root.12500863.0 429 257.235K
disk039.gla.scotgrid.ac.uk:/gridstore0/atlas/2009-07-20/test.100000.PythiaWqqJet_Ptcut.evgen.EVNT._00001.pool.root.200709.608.15242588.0 412 96.586479M
disk033.gla.scotgrid.ac.uk:/gridstore4/atlas/2009-07-08/fileee8e3f7c-3568-4f95-b334-e252ce5df41f.15116045.0 322 File
no longer exists on DPM
disk054.gla.scotgrid.ac.uk:/gridstore0/ops/2009-07-08/fileaaa7e9bf-2e1f-4517-bd4b-b07a0c040288.15116071.0 302 File
no longer exists on DPM
The output should be self-explanatory - the pfn is the "real" location
of the file on the pool in question, not the SURL for it, the number
of gets is measured by number of logged requests by DPM, and the
filesize is pulled from the namespace - so files that no longer exist
can't be given a filesize.
I note that the top 6 hot files are all massively hotter than any
other files on the DPM in the last 10 days, and are all relatively
small.
Indeed, if we assume that the other ganga.voatlas...lib.tgz... files
are the same size as the first one, then 3 of the top 6 hot files are
less than 2 Mb in size!
(Note that disk041 suffered from load hotspots during the last HC, as
did 034 and 038 intermittently, so this is has a genuine effect on the
system load.)
Does anyone know what these files are (not the DBRelease or the EVNT,
but the ganga.voatlas lib ones)?
Sam
2009/7/30 Alastair Dewhurst <[log in to unmask]>:
> Hi all
>
> The Hammer Cloud test seem to go reasonably well. The overall efficiency
> was 89% (completed jobs / [completed jobs + failed jobs]). Further details
> can be found:
> http://gangarobot.cern.ch/hc/540/test/
>
> Would all sites that took part try to produce a plot of their throughput.
> (The throughput is a plot of the jobs efficiency vs the number of jobs
> running.) It is a useful metric of the capacity of your site to perform
> analysis. It can also be used to see if there is an optimal number of jobs
> to run to maximize the sites throughput. Example plots can be found in
> figure 12 and 13 of the Glasgow STEP09 wash up report
> http://tinyurl.com/lg62jq.
>
> Sites ranked in order of event rate (the average number of AOD events
> processed by each job per second):
> OX = 12.4
> CAM = 11.5
> RALPP = 10.5
> LIV = 10.5
> GLASGOW = 10.4
> BHAM = 10.3
> QMUL = 10.1
> RHUL = 8.6
> SHEF = 7.5
> MANC2 = 6.2
> MANC1 = 5.1
> LANC = 3.7
> Having an average event rate of over 10 is excellent with above 8 being
> good.
>
> Lancaster have already commented that their low event rate was due to:
> "throttling of the number of job slots, usually MAXJOB=20 but we've played a
> little and things get worse above this. As I've said, the bottleneck is our
> LAN."
>
>
> Sites ranked in order of error rate (failed jobs / [failed jobs + completed
> jobs]):
> MANC1 = 83%
> SHEF = 27%
> BHAM = 20%
> MANC2 = 19%
> RHUL = 17%
> GLASGOW = 9%
> OX = 9%
> RALPP = 5%
> LANC = 2%
> QMUL = < 1%
> CAM = < 1%
> LIV < 1%
> Would all sites please comment on their error rate and especially if it is
> larger than 5%.
>
> Glasgow have already commented that the vast majority of their errors were
> caused in a 1 hour period when there was a DNS glitch.
>
>
> We aim to run another hammer cloud test next week and details will be
> emailed out later.
>
> Thanks.
>
> Alastair (with the help of Graeme Stewart)
>
|