hi,
we had this problem a longer time ago on frontend user pcs.
since centos7 it is fine now.
the kernel-nvidia-* display driver might be the problem.
we downloaded the nvidia display driver from the nvidia website and rebuilt/installed the display driver by hand incl. reboot of the pc.
cheers,
wolfgang
----- Original Message -----
From: "Yehuda Goldgur" <[log in to unmask]>
To: "Mailinglist CCPEM" <[log in to unmask]>
Sent: Friday, 19 July, 2019 17:48:05
Subject: [ccpem] NVIDIA driver/library mismatch
Hello,
I am running relion 3.0.7 on a Centos 7 cluster of nodes with NVIDIA 1080 GPUs. Cuda version is 10.1 and openmpi 3.1.0, slurm scheduler. If a job crashes, one or more nodes go down in sinfo, and report NVIDIA driver/library mismatch. Reinstalling cuda with subsequent slurmd restart brings them back to life, until the next instance. Please advise...
Thank you,
Yehuda
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
########################################################################
To unsubscribe from the CCPEM list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCPEM&A=1
|