JISCMail - CCP4BB Archives

Dear Veronica

The answer depends to a large extent on your own, or your group's, mode of working. Specifically, are you interested in the fastest turn-around of individual jobs, or are you interested the highest throughput of many jobs running simultaneously, i.e. draining the batch queues as quickly as possible? The latter of course assumes that you and/or your group will normally be submitting a sufficient number of jobs to keep the batch queues filled.

To take an example let's say you have a 4-node cluster each with 8 CPUs. The fastest turn-around for a single job, assuming of course that the parallelised code is available, is probably to run it in parallel over as many CPUs as possible, at least on a single node. However as Peter pointed out the speed-up is unlikely to be linear for multi-threaded or MPI jobs, so for example you may see only a 4 times speed-up when run on an 8-CPU node, and using MPI over additional nodes may or may not improve on that, depending on the application. All the same, any speed-up is better than no speed-up if all you want are individual jobs to finish in the shortest possible times.

Now let's say you submit 32 independent jobs to the batch queue. Assume for simplicity that they all take around the same time T when run individually on a single CPU: what's the best way to run them for the fastest throughput?

If you disable parallelisation and just run 1 job at a time per node, the first 4 jobs will finish at about time T, then the 2nd set of 4 jobs in a further time T, up to the 8th set of 4 jobs, so 8T in all. That's because you used only 1/8th of the available computing power (7 out of 8 CPUs per node were never used).

Now say you turn on parallelisation and again run 4 jobs at a time with each job multi-threaded over 8 CPUs. Each job is unlikely to finish in time T/8 for the reason above: let's say it takes T/4. If your batch-queue manager is set up like mine you have to request a fixed number of CPUs to be allocated to the job and also you must tell the program that that is the maximum no. of CPUs it can use. Those CPUs are allocated to you and won't be assigned to another job for the entire duration of your job. Of course you could dispense with the batch queues and just run everything in background, but a free-for-all is unlikely to be the most efficient or fairest way of working. So in this case the first 4 jobs finish around T/4, the 2nd set of 4 another T/4, up to the 8th set of 4, so 2T altogether (again a factor of 2 because half the CPUs were allocated to the job but effectively unused).

Now say you turn parallelisation back off and simply run all 32 jobs simultaneously, 8 per node across the 4 nodes. Each job is independent of the others so they will all finish at around time T. The main reason for non-linearity in a multi-threaded job is that the threads usually have to synchronise at certain points in the code, and anyway it may not be possible to run multi-threaded for some portions, so some threads are forced to wait for others to catch up, and this waiting wastes CPU power. Independent jobs don't have to wait for synchronisation (I'm assuming that the memory bandwidth is sufficient so that contention for shared memory is not significant and there's sufficient RAM per node to run one job on each CPU).

The fastest throughput in the case of frequently-full job queues, if RAM is not an issue, is therefore obtained by disabling within-job parallelisation and simply running multiple jobs simultaneously over all available CPUs, with exactly one job per CPU (running more than one job per CPU is likely to be penalised by frequent context-switching in the OS). In that situation it's irrelevant whether a particular code is parallelised since you're better off without it! Indeed in that situation use of parallelisation in batch jobs could be regarded as anti-social since the batch-queue manager may have allocated 8 CPUs to your job but you are only effectively using 4 of them, so you may be preventing other jobs from making better use of the other 4. Of course the cluster may not be fully used all the time so in that case you may benefit from using the spare capacity by enabling multi-threading.

Cheers

-- Ian