Print

Print


Graeme Stewart wrote, On 25/11/08 15:19:
> Hi Greig
>
> I just looked and the jobs are very poor in CPU efficiency (15-25%).
> Yes, the jobs were reading directly using rfio.
>   

Hold on, I've been investigating a bit more by looking into the stdout 
of the jobs that you are running. So you are using rfio, but only after 
first performing an lcg-gt on the SURL of the file. This is 
automatically going to make your job less efficient. This is because the 
job needs to spend time querying the BDII to get the relevant 
information to obtain the TURL. You should perhaps think about doing a 
bulk query for this information, although I'm not sure if the lcg_utils 
supports this. There are also additional GSI authentication steps 
involved with each SRM interaction.

At Edinburgh I just set things up so that I use rfio:/dpm/path/to/file 
to access data which bypasses any SRM communication. This is really only 
possible because I know how things are setup in here.

Performing the SRM communications and extra GSI steps will also explain 
why the DPM head node is really busy.

Cheers,
Greig

> Event/sec is one of the outputs you'll see in the final analysis.
>
> Although the DPM servers were crusing - low load, excellent data
> output rates, the headnode was suffering very high CPU load. This is
> surprising as the headnode should only be contacted for the open step
> and it hands off to the disk server.
>
> Puzzling...
>
> Graeme
>
> On Tue, Nov 25, 2008 at 3:33 PM, Greig A. Cowan <[log in to unmask]> wrote:
>   
>> Hi Graeme,
>>
>> Do you have numbers for how CPU efficient these analysis jobs were? It will
>> be interesting to see how IO-bound they were. You're using rfio, right?
>>
>> Putting things into a physics perspective, it would also be interesting to
>> know how many events were processed by each analysis job per unit time.
>>
>> Anyway, looks like it's been a good exercise so far and the DPM disk servers
>> don't look all that loaded going by your ganglia.
>>
>> Cheers,
>> Greig
>>
>> Graeme Stewart wrote, On 25/11/08 13:11:
>>     
>>> Brian,
>>>
>>> Glasgow is here:
>>>
>>>
>>> http://svr031.gla.scotgrid.ac.uk/ganglia/?c=DPM%20Storage&m=&r=hour&s=by%20hostname&hc=4
>>>
>>> Preliminary results:
>>>
>>> "We had 40 jobs (just 40!) running on the
>>> cluster before lunch sucking data out of our DPM at ~600MB/s, which is
>>> 15MB/job (75Hz for 200kB AOD????).
>>>
>>> Currently we're running 85 jobs and hitting 1GB/s from our storage,
>>> which is about the limit (9 servers x 1Gb).
>>>
>>> This means we have saturated our i/o capacity with a cluster which is
>>> 15% full of analysis jobs.
>>>
>>> I think we need more network cards and bigger switches. I am astonished.
>>>
>>> Graeme"
>>>
>>>
>>> On Tue, Nov 25, 2008 at 9:37 AM, Davies, BGE (Brian)
>>> <[log in to unmask]> wrote:
>>>
>>>       
>>>> I have been collecting ganglia endpoints for those sites which publish
>>>> so as to be able to look at loads.
>>>> I have found all but Liverpool and RHUL for today's tests
>>>> Does anyone know have a link to these (If they are already in the gridpp
>>>> wiki then I can not find them...)
>>>> Brian
>>>>
>>>> -----Original Message-----
>>>> From: Testbed Support for GridPP member institutes
>>>> [mailto:[log in to unmask]] On Behalf Of Graeme Stewart
>>>> Sent: 24 November 2008 21:27
>>>> To: [log in to unmask]
>>>> Subject: Re: Analysis challenge in the UK tomorrow
>>>>
>>>> On Mon, Nov 24, 2008 at 2:44 PM, Graeme Stewart
>>>> <[log in to unmask]> wrote:
>>>>
>>>>         
>>>>> Dear All
>>>>>
>>>>> We intend to start an ATLAS analysis challenge tomorrow at the
>>>>> following UK sites:
>>>>>
>>>>> UKI-LT2-RHUL
>>>>> UKI-NORTHGRID-LANCS-HEP
>>>>> UKI-NORTHGRID-LIV-HEP
>>>>> UKI-NORTHGRID-SHEF-HEP
>>>>> UKI-SCOTGRID-GLASGOW
>>>>> UKI-SOUTHGRID-OX-HEP
>>>>> UKI-SOUTHGRID-RALPP
>>>>>
>>>>> This will involve the submission of several hundred 'real' ATLAS
>>>>> analysis jobs via the WMS. We would kindly ask the sites to keep an
>>>>> eye on their systems during this test and report any problems they
>>>>> see. In particular we should like you to be alert for saturation of
>>>>> the network between your storage and the worker nodes. If you can grab
>>>>>      any ganglia plots of activity or any other interesting metrics from
>>>>> your side we would be grateful.
>>>>>
>>>>> The jobs should be submitted in the morning (probably about 10am) but
>>>>> I will send another alert when this actually happens.
>>>>>
>>>>>           
>>>> Hi
>>>>
>>>> The jobs are set to go at 9am tomorrow (UK time), so gulp down that
>>>> coffee quickly :-)
>>>>
>>>> Dan has setup some trial monitoring here:
>>>>
>>>> http://gangarobot.cern.ch/st/
>>>>
>>>> where results will be posted as the jobs finish.
>>>>
>>>> Cheers
>>>>
>>>> Graeme
>>>>
>>>> --
>>>> Dr Graeme Stewart              http://www.physics.gla.ac.uk/~graeme/
>>>> Department of Physics and Astronomy, University of Glasgow, Scotland
>>>> --
>>>> Scanned by iCritical.
>>>>
>>>>
>>>>         
>>>
>>>
>>>       
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>     
>
>
>
>   

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.