Hi Claudiu,
The problem is not in mpi-start or Open MPI. The output and error files
show that you have 4 slots available for running the job and they are
used correctly. The problem happens before that and I'm still not sure
of what could be, but it looks like a submit_filter problem.
Can you try this on the CE (changing the submit_filter path if necessary)?
echo "#PBS -l nodes=64" | perl /var/spool/pbs/submit_filter.pl
The output should look like this:
#PBS -l nodes=4:ppn=16
If not, then there is a problem there, probably because the pbsnodes -a
does not show the expected output.
Regards,
Enol.
On 11/26/2010 09:21 AM, Claudiu Demian wrote:
> Hi,
>
> So, I've set the correct time and timezone on all the nodes, they have
> all connected to NTP and set their correct time, and I ran the job again.
>
> The output files are here:
>
> http://ui01.mosigrid.utcluj.ro/~demi/demi_JMslU4dk9WUce0NHAJV5GA/mpi-start.err
> http://ui01.mosigrid.utcluj.ro/~demi/demi_JMslU4dk9WUce0NHAJV5GA/mpi-start.out
>
>
> This is the showq on the CE during the run:
>
> # showq
> ACTIVE JOBS--------------------
> JOBNAME USERNAME STATE PROC REMAINING
> STARTTIME
>
> 9508 see145 Running 1 2:04:32:56 Thu Nov 25
> 14:42:39
> 9225 ops164 Running 1 2:23:39:57 Fri Nov 26
> 09:49:40
> 9261 ops164 Running 1 2:23:39:57 Fri Nov 26
> 09:49:40
> 9581 ops053 Running 64 2:23:59:20 Fri Nov 26
> 10:09:03
>
> 4 Active Jobs 67 of 608 Processors Active (11.02%)
> 5 of 38 Nodes Active (13.16%)
>
> IDLE JOBS----------------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
>
> 0 Idle Jobs
>
> BLOCKED JOBS----------------
> JOBNAME USERNAME STATE PROC WCLIMIT
> QUEUETIME
>
>
> Total Jobs: 4 Active Jobs: 4 Idle Jobs: 0 Blocked Jobs: 0
>
>
> This is a qstat -f of the job:
>
> # qstat -f 9581
> Job Id: 9581.ce01.mosigrid.utcluj.ro
> Job_Name = cream_262902103
> Job_Owner = [log in to unmask]
> job_state = R
> queue = ops
> server = ce01.mosigrid.utcluj.ro
> Checkpoint = u
> ctime = Fri Nov 26 10:09:02 2010
> Error_Path = ce01.mosigrid.utcluj.ro:/dev/null
> exec_host =
> wn60.mosigrid.utcluj.ro/3+wn60.mosigrid.utcluj.ro/2+wn60.mosig
> rid.utcluj.ro/1+wn60.mosigrid.utcluj.ro/0
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = n
> mtime = Fri Nov 26 10:09:03 2010
> Output_Path = ce01.mosigrid.utcluj.ro:/dev/null
> Priority = 0
> qtime = Fri Nov 26 10:09:02 2010
> Rerunable = True
> Resource_List.cput = 48:00:00
> Resource_List.neednodes = 64
> Resource_List.nodect = 64
> Resource_List.nodes = 64
> Resource_List.walltime = 72:00:00
> session_id = 14635
> Shell_Path_List = /bin/bash
> stagein =
> [log in to unmask]:/opt/glite/
> var/cream_sandbox/ops/_DC_RO_DC_RomanianGRID_O_UTCluj_CN_Claudiu_Demia
> n_ops_Role_NULL_Capability_NULL_ops053/26/CREAM262902103/CREAM26290210
> 3_jobWrapper.sh,
> [log in to unmask]:/opt/glite/var/cream_sa
> ndbox/ops/_DC_RO_DC_RomanianGRID_O_UTCluj_CN_Claudiu_Demian_ops_Role_N
> ULL_Capability_NULL_ops053/proxy/12907589352E845776wms2Eipb2Eac2Ers114
> 87935922563
> stageout =
> [log in to unmask]:/opt
> /glite/var/cream_sandbox/ops/_DC_RO_DC_RomanianGRID_O_UTCluj_CN_Claudi
> u_Demian_ops_Role_NULL_Capability_NULL_ops053/26/CREAM262902103/Standa
> rdOutput,
> [log in to unmask]:/opt/glite
> /var/cream_sandbox/ops/_DC_RO_DC_RomanianGRID_O_UTCluj_CN_Claudiu_Demi
> an_ops_Role_NULL_Capability_NULL_ops053/26/CREAM262902103/StandardErro
> r
> substate = 42
> Variable_List = PBS_O_HOME=/home/ops053,PBS_O_LANG=en_US.UTF-8,
> PBS_O_LOGNAME=ops053,
> PBS_O_PATH=/opt/edg/bin:/opt/glite/bin:/opt/globus/bin:/opt/lcg/bin:/
> usr/local/bin:/bin:/usr/bin:/home/ops053/bin,
> PBS_O_MAIL=/var/spool/mail/ops053,PBS_O_SHELL=/bin/sh,
> PBS_SERVER=ce01.mosigrid.utcluj.ro,
> PBS_O_HOST=ce01.mosigrid.utcluj.ro,PBS_O_WORKDIR=/opt/glite/var/tmp,
> PBS_O_QUEUE=ops
> euser = ops053
> egroup = ops
> hashname = 9581.ce01.mosigrid.utcluj.ro
> queue_rank = 951
> queue_type = E
> etime = Fri Nov 26 10:09:02 2010
> submit_args = /tmp/cream_262902103
> start_time = Fri Nov 26 10:09:03 2010
> start_count = 1
>
>
> I am waiting for the changes to propagate through the whole cluster and
> reconfigure all the WNs (just to be on the safe side), and redo the
> test. If the results change, I will post them here.
>
> Cheers,
> CLaudiu
|