Most of the RALPP failures occurred last night and I think were due to
our SE being unstable following a dCache update, which has now been
rolled back.
Cheers,
Rob
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Alastair Dewhurst
> Sent: Thursday, July 30, 2009 12:18 PM
> To: [log in to unmask]
> Subject: Hammer Cloud review 28th - 30th July
>
> Hi all
>
> The Hammer Cloud test seem to go reasonably well. The
> overall efficiency was 89% (completed jobs / [completed jobs
> + failed jobs]). Further details can be found:
> http://gangarobot.cern.ch/hc/540/test/
>
> Would all sites that took part try to produce a plot of their
> throughput. (The throughput is a plot of the jobs efficiency
> vs the number of jobs running.) It is a useful metric of the
> capacity of your site to perform analysis. It can also be
> used to see if there is an optimal number of jobs to run to
> maximize the sites throughput. Example plots can be found in
> figure 12 and 13 of the Glasgow STEP09 wash up report
> http://tinyurl.com/lg62jq.
>
> Sites ranked in order of event rate (the average number of
> AOD events processed by each job per second):
> OX = 12.4
> CAM = 11.5
> RALPP = 10.5
> LIV = 10.5
> GLASGOW = 10.4
> BHAM = 10.3
> QMUL = 10.1
> RHUL = 8.6
> SHEF = 7.5
> MANC2 = 6.2
> MANC1 = 5.1
> LANC = 3.7
> Having an average event rate of over 10 is excellent with
> above 8 being good.
>
> Lancaster have already commented that their low event rate was due
> to: "throttling of the number of job slots, usually MAXJOB=20
> but we've played a little and things get worse above this. As
> I've said, the bottleneck is our LAN."
>
>
> Sites ranked in order of error rate (failed jobs / [failed
> jobs + completed jobs]):
> MANC1 = 83%
> SHEF = 27%
> BHAM = 20%
> MANC2 = 19%
> RHUL = 17%
> GLASGOW = 9%
> OX = 9%
> RALPP = 5%
> LANC = 2%
> QMUL = < 1%
> CAM = < 1%
> LIV < 1%
> Would all sites please comment on their error rate and
> especially if it is larger than 5%.
>
> Glasgow have already commented that the vast majority of
> their errors were caused in a 1 hour period when there was a
> DNS glitch.
>
>
> We aim to run another hammer cloud test next week and details
> will be emailed out later.
>
> Thanks.
>
> Alastair (with the help of Graeme Stewart)
>
--
Scanned by iCritical.
|