Hey all,
We've noticed on our Infiniband cluster that occasionally (approximately one in four times) there is a failure to either mkdir a relion_volatile directory on the local scratch disk, or chmod that directory:
ERROR: cannot execute: mkdir -m 0777 -p /lscratch/26309293/relion_volatile/
File: /usr/local/apps/RELION2.0/git/relion-2.0.1/src/exp_model.cpp line: 627
ERROR: cannot execute: chmod 0777 /lscratch/26316320/relion_volatile/Class2D_92.quick_ibfdr_run_lock223
File: /usr/local/apps/RELION2.0/git/relion-2.0.1/src/exp_model.cpp line: 635
If we either run on ethernet nodes, or dumb down the message passer on Infiniband nodes (mpirun -mca btl self,sm,tcp), the errors never occur.
Has anyone seen issues with system calls over Infiniband?
David Hoover
HPC @ NIH
|