Begin forwarded message:
> From: Rolf Seuster <[log in to unmask]>
> Date: 3 April 2012 20:24:46 GMT+01:00
> To: I Ueda <[log in to unmask]>
> Cc: Elena Korolkova <[log in to unmask]>, "[log in to unmask]" <[log in to unmask]>, "[log in to unmask]" <[log in to unmask]>, "Andrej Filipcic" <[log in to unmask]>, "Testbed Support for GridPP member institutes" <[log in to unmask]>, Rolf Seuster <[log in to unmask]>, "atlas-mgt-adc-five (Management of ATLAS Distributed Computing)" <[log in to unmask]>, Alessandro De Salvo <[log in to unmask]>
> Subject: Re: Memory on Linux / Atlas memory survey.
>
> Dear all,
>
> since I was put in cc now, I feel, I should add comments. Please find them below.
>
> On 03/04/12 07:45 PM, I Ueda wrote:
>> I put more relevant people in CC
>>
>>
>>>> It sounds like these Atlas Reco jobs have a peak footprint of 3.5 ish GB. The _important_ question is if sites will kill jobs like that. (Glasgow won't).
>>>>
>> Yes, that is one of the questions we asked.
>> We used the word "vmem" because Maui/Torque/PBS can be configured with a limit on vmem
>> and it would be understandable for many people.
>>
>>
>>>> The next important question is if those jobs will kill everything on the box. We, as site admins, consider this an important point.
>>>>
>> Right. ("everything" including themselves)
>>
>> Our observation is that those ATLAS reco jobs might use>3.5GB of
>> address space for the whole or large parts of the job, but it's not accessed/used
>> for most of the time. Only at the beginning and the end.
>> i.e. the jobs would not drive worker nodes into heavy swapping.
>> this was explained at the GDB if you look at the original slides.
>>
>> Then, if the worker nodes at Glasgow have no problem running those jobs,
>> the answer from Glasgow to the survey, according to your reply, would be no limit (or>=4GB) for
>> the ATLAS share of the capacity (in HS06, and optionally the jobs slots if applicable).
>>
>>
> I would like to add to this, that a) I did a test, which was reported in the talk at the GDB,
> that is, on a 24GB machine withalso 24GB of swap, I was running N reconstruction jobs
> in parallel. The machine has 8 real cores, with hyperthreading on. The throughput of events
> for the whole machine was stable from 16 jobs to 24 jobs, only after that swapping slowed down
> the throughout. This was for real data, and should be re-done with new data and MC we
> will be using in 2012. But the summary won't change that much, ATLAS reco software is
> using only parts of the memory during the event loop, much of it won't be toughed for
> most of the job, except during start and finished (what we call initialize and finalize - roughly
> estimated maybe the first 10 and last 5 mins of the job)
>
> b) the processing of heavy ion data at Tier0 and at T1 had a memory footprint not too far off what
> we will use in 2012. At T0 we didn't see an increase in job failures due to memory problems.
> Machines at T0 have 2GB/core RAM. I don't know how much swap they have, probably about the
> same as physical RAM.
>
>>>> If you _need_ us to have so much swap, as is being suggested, then this is entirely the wrong approach, and _will not work_.
>>>>
>> If this comment comes from the sentence in the GDB slide "SWAP should be set to 2*RAM size",
>> it has been removed from the ATLAS requirements as the sentence was not well prepared by
>> the time of the last GDB.
>> If you mean those jobs using 2GB of resident memory and keeping 4GB of address space would cause
>> a problem at your site, then your answer to the survey would be different from the above.
>>
>>
> I don't know how much swap is now requested, but personally, I regard a machine with 8 cores, and the size of the swap
> less than the size of the physical ram as not optimally configured. Unfortunately, we miss experience with machines with
> much more cores, like a 12 real core, 24 hyperthreaded core. Possibly one could get away with less swap space. But - as I
> said, we miss experience with that.
>
>
>
>>
>>>> The whole process reads very much as if someone has assumed that 'VMem' = 'Physical RAM used + Swap space used' - which is false.
>>>>
>> Someone in CC may comment about the swap space.
>>
>> regards, ueda
>>
>>
>>
>>
>> On 3 Apr 2012, at 13:13, Elena Korolkova wrote:
>>
>>
>>> Can Atlas comment on this, please
>>>
>>> many thanks
>>> Elena
>>>
>>> On 3 Apr 2012, at 11:39, Stuart Purdie wrote:
>>>
>>>
>>>> There's a number of different types of memory that we can discuss.
>>>>
>>>>
>>>> There is:
>>>>
>>>> Physical memory used
>>>> Physical memory available
>>>> Virtual memory used
>>>> Virtual memory available
>>>> Address space used
>>>> Address space available
>>>> Swap space used
>>>> Swap space available.
>>>>
>>>> _All_ of these numbers are different. Some of them are functions of the node, and some of them are per process values. To ask about certain parts of these, without understanding how they relate to each other, is going to end up with numbers that don't make sense.
>>>>
>>>> The term 'VMem', _as measured by top_ is the 'Address space used', where 'used' means 'mapped', as in mmap / malloc sense.
>>>>
>>>> Note that 'Virtual Memory' != 'Swap space', as the kernel has more facilities for juggling memory than just swap space. In particular, 'Virtual Memory'> 'Swap space', for all practical workloads.
>>>>
>>>> It is useful to have the concept of a 'working set' of memory - how much the job has to keep in memory at one point in time. Note that it is very common for a job to have a working set smaller than the total mapped Address Space.
>>>>
>>>> --
>>>>
>>>> It sounds like these Atlas Reco jobs have a peak footprint of 3.5 ish GB. The _important_ question is if sites will kill jobs like that. (Glasgow won't).
>>>>
>>>> The next important question is if those jobs will kill everything on the box. We, as site admins, consider this an important point.
>>>>
>>>> If Atlas _really_ expect to drive worker nodes into heavy swapping, then that's going to kill _everything_ on the worker node. Once swapping starts, everything gets a lot slower. This means that the walltime limits of jobs will be hit long before the job is near complete.
>>>>
>>>> If Atlas expect these reco jobs to spend a minute or so with a working set of 3GB, then this is extremely unlikely to cause problems, and probably wont swap. Even though the job is going to be useing more the usual 2GB per core.
>>>>
>>>> If you _need_ us to have so much swap, as is being suggested, then this is entirely the wrong approach, and _will not work_.
>>>>
>>>> --
>>>>
>>>> The whole process reads very much as if someone has assumed that 'VMem' = 'Physical RAM used + Swap space used' - which is false.
>>>>
>>>> This is not just a technical point (although it is frustrating to get asked questions that clearly demonstrate the asked don't understand what they are asking for) - it is that if we _need_ that much swap, then without special handling of those jobs they will kill everything on the worker node. We don't want that, hence having to drive into the midst of the issue in order to find out what is actually going to happen.
>>>>
>>> __________________________________________________
>>> Dr Elena Korolkova
>>> Email: [log in to unmask]
>>> Tel.: +44 (0)114 2223553
>>> Fax: +44 (0)114 2223555
>>> Department of Physics and Astronomy
>>> University of Sheffield
>>> Sheffield, S3 7RH, United Kingdom
>>>
>>>
>>>
>>>
>>>
>>
>
__________________________________________________
Dr Elena Korolkova
Email: [log in to unmask]
Tel.: +44 (0)114 2223553
Fax: +44 (0)114 2223555
Department of Physics and Astronomy
University of Sheffield
Sheffield, S3 7RH, United Kingdom
|