On Wed, Mar 15, 2017 at 10:09:13PM +0100, Bjoern Forsberg wrote:
> Any version after 2.0.4 should not display this stall. If you are
> sure you are running at least 2.0.4 and are observing that
>
> - relion stops at the end of an expectation step without
> reporting an error
> - one GPU is running at 100%, but all others are not doing anything
> - one CPU thread per MPI is running at 100%
>
> then please check the issue linked below and post your output there.
> If you do not see all the above symptoms, it something else, and I'd
> be grateful if you create new issue on the 3dem/relion github page,
> so that we can sort it out.
>
git built version as of yesterday/git d401f24
1+8 cores/single gpu run:
slurm script:
[tru@visu1 slurm.d]$ more 9cpus-1gpu-run-nih-c057-201703015-d401f24-openmpi-1.10.6-libltdl-CC-61.sh
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 9
#SBATCH --exclusive
#SBATCH --gres=gpu:4 --constraint=CC61
export OMP_NUM_THREADS=1
export TMPDIR=/local-storage/tru/relion2/tmpdir
module purge
module load slurm/16.05.9-hdf5-1.8-hwloc-1.8-munge-0.5.11
module load relion/git/201703015-d401f24-openmpi-1.10.6-libltdl-CC-61
export OMPI_MCA_btl="self,sm,tcp"
CPUS=9
cd /local-storage/tru/relion_benchmark || exit 1
#for GPUS in "0" "1" "2" "3"
for GPUS in "0" "1" "2" "3"
do
dropcache.sh
NAME=relion-git-201703015-d401f24-openmpi-1.10.6-libltdl-CC-61-cpu${CPUS}gpus$(echo ${GPUS}|tr ':' '.')-OMP_NUM_THREADS1
mkdir -p relion/release/${NAME}
(time mpirun -n ${CPUS} `which relion_refine_mpi` --o relion/release/${NAME} \
--i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 \
--dont_combine_weights_via_disc --scratch_dir ${TMPDIR} --pool 100 --ctf \
--ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent \
--zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 \
--norm --scale --j 2 --random_seed 0 \
--gpu ${GPUS} ) > ${NAME}.out-`date +'%Y%m%d-%H%M'` 2>&1
done
mpirun -n 9 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/relion-git-201703015-d401f24-openmpi-1.10.6-libltdl-CC-61-cpu9gpus0-OMP_NUM_THREADS1 --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --dont_combine_weights_via_disc --scratch_dir /local-storage/tru/relion2/tmpdir --pool 100 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --j 2 --random_seed 0 --gpu 0
output file attached: relion-git-201703015-d401f24-openmpi-1.10.6-libltdl-CC-61-cpu9gpus0-OMP_NUM_THREADS1.out-20170316-0907
[tru@c057 relion_benchmark]$ nvidia-smi
Thu Mar 16 10:26:04 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) Off | 0000:02:00.0 Off | N/A |
| 23% 34C P2 79W / 250W | 9542MiB / 12189MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN X (Pascal) Off | 0000:03:00.0 Off | N/A |
| 23% 17C P8 9W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN X (Pascal) Off | 0000:81:00.0 Off | N/A |
| 23% 19C P8 8W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN X (Pascal) Off | 0000:82:00.0 Off | N/A |
| 23% 15C P8 8W / 250W | 2MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 997 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1187MiB |
| 0 998 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1223MiB |
| 0 999 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1187MiB |
| 0 1000 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1191MiB |
| 0 1001 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1189MiB |
| 0 1002 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1187MiB |
| 0 1003 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1187MiB |
| 0 1004 C ....10.6-libltdl-CC-61/bin/relion_refine_mpi 1187MiB |
+-----------------------------------------------------------------------------+
[tru@c057 relion_benchmark]$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SOC SOC 0-7
GPU1 PHB X SOC SOC 0-7
GPU2 SOC SOC X PHB 8-15
GPU3 SOC SOC PHB X 8-15
Legend:
X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
top reports:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P TIME COMMAND
999 tru 20 0 40.8g 2.9g 110m S 100.0 1.2 83:34.96 13 83:34 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
1003 tru 20 0 40.8g 2.9g 110m S 99.0 1.2 83:34.96 15 83:34 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/relio
1001 tru 20 0 40.8g 2.9g 110m S 99.0 1.2 83:32.09 10 83:32 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/relio
997 tru 20 0 40.9g 2.9g 111m S 100.0 1.1 82:45.63 9 82:45 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
998 tru 20 0 40.9g 2.9g 110m S 100.0 1.1 82:21.05 1 82:21 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
1004 tru 20 0 40.8g 2.9g 110m S 100.0 1.2 82:18.99 0 82:18 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
1000 tru 20 0 41.0g 2.9g 110m S 100.0 1.2 82:18.11 3 82:18 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
1002 tru 20 0 40.9g 3.0g 110m S 100.0 1.2 82:17.71 2 82:17 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
996 tru 20 0 2979m 2.4g 10m R 100.0 1.0 76:50.42 6 76:50 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/release/reli
1784 tru 20 0 27860 1816 1100 R 1.0 0.0 0:00.38 5 0:00 top
994 tru 20 0 149m 3556 2368 S 0.0 0.0 0:00.17 1 0:00 mpirun -n 9 /c6/shared/relion/git-d401f24-openmpi-1.10.6-libltdl-CC-61/bin/relion_refine_mpi --o relion/r
931 tru 20 0 111m 2052 1060 S 0.0 0.0 0:00.13 0 0:00 sshd: tru@pts/0
932 tru 20 0 11576 1964 1436 S 0.0 0.0 0:00.05 9 0:00 -bash
977 tru 20 0 9208 1352 1116 S 0.0 0.0 0:00.00 1 0:00 /bin/sh /var/run/slurm/slurmd.state/job15485628/slurm_script
991 tru 20 0 9208 728 488 S 0.0 0.0 0:00.00 0 0:00 /bin/sh /var/run/slurm/slurmd.state/job15485628/slurm_script
stracing the lowest cpu used mpi process:
[tru@c057 relion_benchmark]$ strace -f -p 996
Process 996 attached with 2 threads
[pid 1012] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 996] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 996] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 996] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 996] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) = 0 (Timeout)
[pid 996] poll([{fd=5, events=POLLIN}, {fd=11, events=POLLIN}], 2, 0) = 0 (Timeout)
...forever...
Cheers
Tru
--
Dr Tru Huynh | http://www.pasteur.fr/research/bis
mailto:[log in to unmask] | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France
|