For CMS the problem we see at RAL is that CMS frequently doesn't make use of the multi-core slots that our batch system has allocated them - there are regular periods of many hours when most of the CMS cores almost totally idle, which can be 1000-2000 cores. Note that since CMS generally just runs single-core jobs within multi-core pilots we don't have lots of empty pilots - the problem is that each pilot doesn't have enough single-core jobs running within it (e.g. we could have 1-2 jobs running per 8-core slot).
Ideally what we should do is to detect allocated but under-utilised resources, and allow a special class of jobs to run on these, e.g. ATLAS event service jobs. If/when the real owner of the resources starts using these resources, the oversubscribed jobs are preempted.
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of Alessandra Forti [[log in to unmask]]
Sent: Friday, August 14, 2015 7:35 AM
To: [log in to unmask]
Subject: Multicore inefficiencies
at the Ops meeting there were some worries about inefficiencies due to
multicore. Could sites that are seeing this send me let me know what
their problems are?
Respect is a rational process. \\//