On Mon, Sep 03, 2007 at 10:16:27PM +0200, Torsten Harenberg wrote:
> Dear all,
>
> this goes out to a maui expert (after I fail to find a solution with
> google & Co.).
>
> We're running a small cluster of 25 machines with 2 dual core cpus
> each - so 100 cpu cores in total.
>
> The CE is running a standard set of torque packages
>
> [root@grid-ce root]# rpm -qa | grep torque
> torque-2.1.6-1cri_sl3_2st
> torque-docs-2.1.6-1cri_sl3_2st
> torque-mom-2.1.6-1cri_sl3_2st
> torque-client-2.1.6-1cri_sl3_2st
> torque-server-2.1.6-1cri_sl3_2st
> torque-devel-2.1.6-1cri_sl3_2st
> [root@grid-ce root]# rpm -qa | grep maui
> maui-server-3.2.6p17-1_sl3
> maui-client-3.2.6p17-1_sl3
> maui-3.2.6p17-1_sl3
> [root@grid-ce root]#
>
> However, we cannot manage to let maui use all of them:
>
> [root@grid-ce root]# qstat -q
>
> server: grid-ce.physik.uni-wuppertal.de
>
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> large -- 48:00:00 72:00:00 -- 0 0 6 E R
> medium -- 08:00:00 12:00:00 -- 0 0 6 E R
> short -- 02:00:00 03:00:00 -- 0 0 8 E R
> an_shrt -- -- -- -- 0 0 -- D S
> an_med -- -- -- -- 0 0 -- D S
> an_long -- -- -- -- 0 0 -- D S
> dg_long -- 48:00:00 72:00:00 -- 73 561 -- E R
> dg_med -- 08:00:00 12:00:00 -- 14 287 -- E R
> dg_short -- 02:00:00 03:00:00 -- 3 200 -- E R
> ----- -----
> 90 1048
>
> [root@grid-ce root]# diagnose -n
> diagnosing node table (5120 slots)
> Name State Procs Memory Disk
> Swap Speed Opsys Arch Par Load Res
> Classes Network Features
>
> grid-wn1.physik.uni- Drained 0:1 1003:1003 1:1
> 1918:1918 1.00 linux [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn2.physik.uni- Busy 0:1 971:971 1:1
> 1874:1874 1.00 linux [NONE] DEF 0.99 001 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn3.physik.uni- Busy 0:1 1003:1003 1:1
> 1903:1903 1.00 linux [NONE] DEF 1.00 001 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn4.physik.uni- Down 0:1 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn5.physik.uni- Down 0:1 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn6.physik.uni- Idle 1:1 1003:1003 1:1
> 2921:2921 1.00 linux [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn7.physik.uni- Down 0:1 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> grid-wn8.physik.uni- Down 0:1 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
> [medium [DEFAULT] [lcgpro]
> dgrid-wn01.physik.un Busy 0:4 3946:3946 1:1
> 6366:6366 1.00 linux [NONE] DEF 1.99 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn01.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn02.physik.un Busy 0:4 3946:3946 1:1
> 5778:5778 1.00 linux [NONE] DEF 3.00 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid][dgridEL4]
> WARNING: node 'dgrid-wn02.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn03.physik.un Busy 0:4 3946:3946 1:1
> 6386:6386 1.00 linux [NONE] DEF 2.06 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn03.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn04.physik.un Busy 0:4 3946:3946 1:1
> 7600:7600 1.00 linux [NONE] DEF 0.00 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn05.physik.un Busy 0:4 3946:3946 1:1
> 5751:5751 1.00 linux [NONE] DEF 2.03 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn06.physik.un Busy 0:4 3946:3946 1:1
> 6358:6358 1.00 linux [NONE] DEF 2.14 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn07.physik.un Busy 0:4 3946:3946 1:1
> 7576:7576 1.00 linux [NONE] DEF 0.04 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn08.physik.un Busy 0:4 3946:3946 1:1
> 6359:6359 1.00 linux [NONE] DEF 2.10 002 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn08.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 2)
> dgrid-wn09.physik.un Busy 0:4 3946:3946 1:1
> 6440:6440 1.00 linux [NONE] DEF 2.00 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn10.physik.un Busy 0:4 3946:3946 1:1
> 6432:6432 1.00 linux [NONE] DEF 2.04 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn10.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn11.physik.un Busy 0:4 3946:3946 1:1
> 6354:6354 1.00 linux [NONE] DEF 2.07 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn11.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn12.physik.un Busy 0:4 3946:3946 1:1
> 6973:6973 1.00 linux [NONE] DEF 0.99 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn13.physik.un Busy 0:4 3946:3946 1:1
> 6346:6346 1.00 linux [NONE] DEF 2.02 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn14.physik.un Busy 0:4 3946:3946 1:1
> 7586:7586 1.00 linux [NONE] DEF 0.00 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn15.physik.un Busy 0:4 3946:3946 1:1
> 6147:6147 1.00 linux [NONE] DEF 2.02 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn16.physik.un Busy 0:4 3946:3946 1:1
> 6194:6194 1.00 linux [NONE] DEF 2.02 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn16.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 3)
> dgrid-wn17.physik.un Busy 0:4 3946:3946 1:1
> 7094:7094 1.00 linux [NONE] DEF 1.12 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn18.physik.un Busy 0:4 3946:3946 1:1
> 6390:6390 1.00 linux [NONE] DEF 2.10 002 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn18.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 2)
> dgrid-wn19.physik.un Busy 0:4 3946:3946 1:1
> 6360:6360 1.00 linux [NONE] DEF 2.12 003 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> WARNING: node 'dgrid-wn19.physik.uni-wuppertal.de' has more
> processors utilized than dedicated (4 > 2)
> dgrid-wn20.physik.un Busy 0:4 3946:3946 1:1
> 6924:6924 1.00 linux [NONE] DEF 1.06 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn21.physik.un Busy 0:4 3946:3946 1:1
> 5789:5789 1.00 linux [NONE] DEF 2.12 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn22.physik.un Busy 0:4 3946:3946 1:1
> 6952:6952 1.00 linux [NONE] DEF 1.11 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn23.physik.un Busy 0:4 3946:3946 1:1
> 6357:6357 1.00 linux [NONE] DEF 2.00 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn24.physik.un Busy 0:4 3946:3946 1:1
> 7558:7558 1.00 linux [NONE] DEF 0.05 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn25.physik.un Busy 0:4 3946:3946 1:1
> 7517:7517 1.00 linux [NONE] DEF 0.05 004 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn26.physik.un Down 0:4 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn27.physik.un Down 0:4 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn28.physik.un Down 0:4 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> dgrid-wn29.physik.un Down 0:4 1:1 1:1
> 10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
> [medium [DEFAULT] [dgrid]
> ----- --- 1:124 102638:102638 37:37
> 174283:174283
>
> Total Nodes: 37 (Active: 27 Idle: 1 Down: 9)
>
> [root@grid-ce root]# checknode -v dgrid-wn08.physik.uni-wuppertal.de
>
>
> checking node dgrid-wn08.physik.uni-wuppertal.de
>
> State: Busy (in current state for 00:05:21)
> Configured Resources: PROCS: 4 MEM: 3946M SWAP: 6358M DISK: 1M
> Utilized Resources: PROCS: 4
> Dedicated Resources: PROCS: 2
> Opsys: linux Arch: [NONE]
> Speed: 1.00 Load: 2.000
> Location: Partition: DEFAULT Frame/Slot: 1/1
> Network: [DEFAULT]
> Features: [dgrid]
> Attributes: [Batch]
> Classes: [short 4:4][large 4:4][medium 4:4][dg_long 2:4][dg_med
> 4:4][dg_short 4:4][an_shrt 0:4][an_med 0:4][an_long 0:4]
>
> Total Time: INFINITY Up: INFINITY (98.41%) Active: 82:10:57:22
> (45.86%)
>
> Reservations:
> Job '213000'(x1) -10:51:06 -> 2:13:08:54 (3:00:00:00)
> Job '213001'(x1) -10:42:40 -> 2:13:17:20 (3:00:00:00)
> JobList: 213000,213001
>
>
>
> See the problem: "Dedicated Resources: PROCS: 2" and "WARNING: node
> 'dgrid-wn08.physik.uni-wuppertal.de' has more processors utilized
> than dedicated (4 > 2)"
> -- > I couldn't figure out *why* this is set.
>
> I already tried to overwrite some parameters in maui.cfg, just to
> understand the mechanism:
>
> NODEAVAILABILITYPOLICY DEDICATED:PROCS
> NODELOADPOLICY ADJUSTPROCS
>
> NODECFG[DEFAULT] MAXLOAD=4.5
> NODECFG[DEFAULT] MAXPROC=4
>
> But still... 10 cores are left idle.
>
> Any hint is very much appreciated, as we spent hours already...
>
> Best regards,
>
> Torsten
Does 'pbsnodes -a' shows 'np = 4' for all nodes? What about max_load
and ideal_load settings in /var/spool/pbs/mom_priv/config on all WNs?
--
Kyriakos Ginis
|