Dear all,
this goes out to a maui expert (after I fail to find a solution with
google & Co.).
We're running a small cluster of 25 machines with 2 dual core cpus
each - so 100 cpu cores in total.
The CE is running a standard set of torque packages
[root@grid-ce root]# rpm -qa | grep torque
torque-2.1.6-1cri_sl3_2st
torque-docs-2.1.6-1cri_sl3_2st
torque-mom-2.1.6-1cri_sl3_2st
torque-client-2.1.6-1cri_sl3_2st
torque-server-2.1.6-1cri_sl3_2st
torque-devel-2.1.6-1cri_sl3_2st
[root@grid-ce root]# rpm -qa | grep maui
maui-server-3.2.6p17-1_sl3
maui-client-3.2.6p17-1_sl3
maui-3.2.6p17-1_sl3
[root@grid-ce root]#
However, we cannot manage to let maui use all of them:
[root@grid-ce root]# qstat -q
server: grid-ce.physik.uni-wuppertal.de
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
large -- 48:00:00 72:00:00 -- 0 0 6 E R
medium -- 08:00:00 12:00:00 -- 0 0 6 E R
short -- 02:00:00 03:00:00 -- 0 0 8 E R
an_shrt -- -- -- -- 0 0 -- D S
an_med -- -- -- -- 0 0 -- D S
an_long -- -- -- -- 0 0 -- D S
dg_long -- 48:00:00 72:00:00 -- 73 561 -- E R
dg_med -- 08:00:00 12:00:00 -- 14 287 -- E R
dg_short -- 02:00:00 03:00:00 -- 3 200 -- E R
----- -----
90 1048
[root@grid-ce root]# diagnose -n
diagnosing node table (5120 slots)
Name State Procs Memory Disk
Swap Speed Opsys Arch Par Load Res
Classes Network Features
grid-wn1.physik.uni- Drained 0:1 1003:1003 1:1
1918:1918 1.00 linux [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn2.physik.uni- Busy 0:1 971:971 1:1
1874:1874 1.00 linux [NONE] DEF 0.99 001 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn3.physik.uni- Busy 0:1 1003:1003 1:1
1903:1903 1.00 linux [NONE] DEF 1.00 001 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn4.physik.uni- Down 0:1 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn5.physik.uni- Down 0:1 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn6.physik.uni- Idle 1:1 1003:1003 1:1
2921:2921 1.00 linux [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn7.physik.uni- Down 0:1 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
grid-wn8.physik.uni- Down 0:1 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_1:1][large_1:1]
[medium [DEFAULT] [lcgpro]
dgrid-wn01.physik.un Busy 0:4 3946:3946 1:1
6366:6366 1.00 linux [NONE] DEF 1.99 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn01.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn02.physik.un Busy 0:4 3946:3946 1:1
5778:5778 1.00 linux [NONE] DEF 3.00 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid][dgridEL4]
WARNING: node 'dgrid-wn02.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn03.physik.un Busy 0:4 3946:3946 1:1
6386:6386 1.00 linux [NONE] DEF 2.06 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn03.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn04.physik.un Busy 0:4 3946:3946 1:1
7600:7600 1.00 linux [NONE] DEF 0.00 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn05.physik.un Busy 0:4 3946:3946 1:1
5751:5751 1.00 linux [NONE] DEF 2.03 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn06.physik.un Busy 0:4 3946:3946 1:1
6358:6358 1.00 linux [NONE] DEF 2.14 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn07.physik.un Busy 0:4 3946:3946 1:1
7576:7576 1.00 linux [NONE] DEF 0.04 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn08.physik.un Busy 0:4 3946:3946 1:1
6359:6359 1.00 linux [NONE] DEF 2.10 002 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn08.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 2)
dgrid-wn09.physik.un Busy 0:4 3946:3946 1:1
6440:6440 1.00 linux [NONE] DEF 2.00 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn10.physik.un Busy 0:4 3946:3946 1:1
6432:6432 1.00 linux [NONE] DEF 2.04 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn10.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn11.physik.un Busy 0:4 3946:3946 1:1
6354:6354 1.00 linux [NONE] DEF 2.07 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn11.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn12.physik.un Busy 0:4 3946:3946 1:1
6973:6973 1.00 linux [NONE] DEF 0.99 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn13.physik.un Busy 0:4 3946:3946 1:1
6346:6346 1.00 linux [NONE] DEF 2.02 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn14.physik.un Busy 0:4 3946:3946 1:1
7586:7586 1.00 linux [NONE] DEF 0.00 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn15.physik.un Busy 0:4 3946:3946 1:1
6147:6147 1.00 linux [NONE] DEF 2.02 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn16.physik.un Busy 0:4 3946:3946 1:1
6194:6194 1.00 linux [NONE] DEF 2.02 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn16.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 3)
dgrid-wn17.physik.un Busy 0:4 3946:3946 1:1
7094:7094 1.00 linux [NONE] DEF 1.12 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn18.physik.un Busy 0:4 3946:3946 1:1
6390:6390 1.00 linux [NONE] DEF 2.10 002 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn18.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 2)
dgrid-wn19.physik.un Busy 0:4 3946:3946 1:1
6360:6360 1.00 linux [NONE] DEF 2.12 003 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
WARNING: node 'dgrid-wn19.physik.uni-wuppertal.de' has more
processors utilized than dedicated (4 > 2)
dgrid-wn20.physik.un Busy 0:4 3946:3946 1:1
6924:6924 1.00 linux [NONE] DEF 1.06 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn21.physik.un Busy 0:4 3946:3946 1:1
5789:5789 1.00 linux [NONE] DEF 2.12 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn22.physik.un Busy 0:4 3946:3946 1:1
6952:6952 1.00 linux [NONE] DEF 1.11 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn23.physik.un Busy 0:4 3946:3946 1:1
6357:6357 1.00 linux [NONE] DEF 2.00 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn24.physik.un Busy 0:4 3946:3946 1:1
7558:7558 1.00 linux [NONE] DEF 0.05 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn25.physik.un Busy 0:4 3946:3946 1:1
7517:7517 1.00 linux [NONE] DEF 0.05 004 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn26.physik.un Down 0:4 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn27.physik.un Down 0:4 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn28.physik.un Down 0:4 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
dgrid-wn29.physik.un Down 0:4 1:1 1:1
10:10 1.00 DEFAUL [NONE] DEF 0.00 000 [short_4:4][large_4:4]
[medium [DEFAULT] [dgrid]
----- --- 1:124 102638:102638 37:37
174283:174283
Total Nodes: 37 (Active: 27 Idle: 1 Down: 9)
[root@grid-ce root]# checknode -v dgrid-wn08.physik.uni-wuppertal.de
checking node dgrid-wn08.physik.uni-wuppertal.de
State: Busy (in current state for 00:05:21)
Configured Resources: PROCS: 4 MEM: 3946M SWAP: 6358M DISK: 1M
Utilized Resources: PROCS: 4
Dedicated Resources: PROCS: 2
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 2.000
Location: Partition: DEFAULT Frame/Slot: 1/1
Network: [DEFAULT]
Features: [dgrid]
Attributes: [Batch]
Classes: [short 4:4][large 4:4][medium 4:4][dg_long 2:4][dg_med
4:4][dg_short 4:4][an_shrt 0:4][an_med 0:4][an_long 0:4]
Total Time: INFINITY Up: INFINITY (98.41%) Active: 82:10:57:22
(45.86%)
Reservations:
Job '213000'(x1) -10:51:06 -> 2:13:08:54 (3:00:00:00)
Job '213001'(x1) -10:42:40 -> 2:13:17:20 (3:00:00:00)
JobList: 213000,213001
See the problem: "Dedicated Resources: PROCS: 2" and "WARNING: node
'dgrid-wn08.physik.uni-wuppertal.de' has more processors utilized
than dedicated (4 > 2)"
-- > I couldn't figure out *why* this is set.
I already tried to overwrite some parameters in maui.cfg, just to
understand the mechanism:
NODEAVAILABILITYPOLICY DEDICATED:PROCS
NODELOADPOLICY ADJUSTPROCS
NODECFG[DEFAULT] MAXLOAD=4.5
NODECFG[DEFAULT] MAXPROC=4
But still... 10 cores are left idle.
Any hint is very much appreciated, as we spent hours already...
Best regards,
Torsten
--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> <>
<> Torsten Harenberg [log in to unmask] <>
<> Bergische Universitaet <>
<> FB C - Physik Tel.: +49 (0)202 439-3521 <>
<> Gaussstr. 20 Fax : +49 (0)202 439-2811 <>
<> 42097 Wuppertal <>
<> <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>
|