Hi Alessandra,
The problem that we have may be site specific, in general the multi-core Atlas jobs we see often use all the CPU they request and unlike Andrew and CMS below I don’t think we have any issue with CPU utilisation. What does happen is we often have gaps that are too small to fit an 8core 16GB payload in but bigger than 1core (i.e. we have some nodes with 24cores that can take 2x8core and 4x1core based on production memory requirements).
For instance looking at Panda at them moment we have 654 single core and 289 multicore running (approx 3K for production work), this combined with the Analysis and we're pretty full. The issue is we have 589 activated multicore and 0 activated single core so when the single core finished we’ll have idle CPU’s because we can’t fit multicore left into the slots. From your previous email I had believed the production level should be 80% multicore and 20% single core but this often doesn’t appear to be the case.
In general the concern for us is CPU that was previously being used by ATLAS is no longer being used because of multicore size mismatches (this may mean we can use it for other VOs but ATLAS will overall get less CPU).
Thanks,
Gareth
> On 14 Aug 2015, at 09:29, Andrew Lahiff <[log in to unmask]> wrote:
>
> Hi Alessandra,
>
> For CMS the problem we see at RAL is that CMS frequently doesn't make use of the multi-core slots that our batch system has allocated them - there are regular periods of many hours when most of the CMS cores almost totally idle, which can be 1000-2000 cores. Note that since CMS generally just runs single-core jobs within multi-core pilots we don't have lots of empty pilots - the problem is that each pilot doesn't have enough single-core jobs running within it (e.g. we could have 1-2 jobs running per 8-core slot).
>
> Ideally what we should do is to detect allocated but under-utilised resources, and allow a special class of jobs to run on these, e.g. ATLAS event service jobs. If/when the real owner of the resources starts using these resources, the oversubscribed jobs are preempted.
>
> Regards,
> Andrew.
>
> ________________________________________
> From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Alessandra Forti [[log in to unmask]]
> Sent: Friday, August 14, 2015 7:35 AM
> To: [log in to unmask]
> Subject: Multicore inefficiencies
>
> Hi,
>
> at the Ops meeting there were some worries about inefficiencies due to
> multicore. Could sites that are seeing this send me let me know what
> their problems are?
>
> cheers
> alessandra
>
> --
> Respect is a rational process. \\//
|