-------- Original Message --------
Hi,
I have installed a mini test cluster with torque and maui. We
have used maui/torque for years on our grid cluster and now we
are upgrading to torque 2.5.7 and maui 3.3-4. Unfortunately
with this new combination maui doesn't seem to work correctly.
When I submit jobs and it behaves as if there weren't any free
resources. Even when I tried to install only torque and maui
with a bare minimum configuration I got the same behaviour,
i.e.
1) When I submit the jobs just remain queued
[root@<server>
maui]# qstat -an1
<server>:
Req'd Req'd Elap
Job ID Username Queue
Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- --------
---------------- ------ ----- --- ------ ----- - -----
10.<server> aforti long
pbs-vm3.sh -- -- -- -- -- Q --
--
11.s<server> aforti long
pbs-vm3.sh -- -- -- -- -- Q --
--
2) If I run qrun <jobid> the job runs so I assume the
problem is not between torque server and torque mom.
3) When I use showq on the old versions displayed the WCLimit
of the default queue now it displays 0 at first and then it
changes it by itself to 100 days
[root@<server> maui]#
showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC
REMAINING STARTTIME
0 Active Jobs 0 of 16 Processors Active
(0.00%)
0 of 1 Nodes Active
(0.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME
10 aforti Idle 1
99:23:59:59 Tue Oct 9 15:32:13
11 aforti Idle 1
99:23:59:59 Tue Oct 9 16:39:09
2 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC
WCLIMIT QUEUETIME
Total Jobs: 2 Active Jobs: 0 Idle Jobs: 2
Blocked Jobs: 0
4) Checkjob <jobid> just tells me the job cannot be run
in the default partition without any particular reason
[.....]
PE: 1.00 StartPriority: 120
cannot select job 10 for partition DEFAULT (Class)
5) Checknode can see the node free if it wasn't clear from
other commands
[root@<server> maui]#
!checkno
checknode <node>
checking node <node>
State: Idle (in current state for 00:55:10)
Configured Resources: PROCS: 16 MEM: 23G SWAP: 31G
DISK: 1M
Utilized Resources: SWAP: 202M
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [lcgpro]
Attributes: [Batch]
Classes: [DEFAULT 1:1]
Total Time: 3:06:35 Up: 3:06:24 (99.90%) Active:
00:00:10 (0.09%)
Reservations:
NOTE: no reservations on node
6) When I use showbf -v though it says my nodes are blocked by
reservations despite checknode clearly telling me there are no
reservations on that node. In our local maui.cfg there is a
reservation for 1 proc I'm not sure why it blocks the whole
node
[root@<server2>
server_logs]# showbf -v
backfill window (user: 'root' group: 'root'
partition: ALL) Tue Oct 9 17:08:59
3 procs available with no timelimit
node <node2> is blocked by reservation sft.0.0
in INFINITY
But to be sure I removed it and even when I remove the
reservation and reduce the maui.cfg to the default version
without anything in it it tells me the node is blocked by
"reservation NONE in INFINITY"
[root@<server>
maui]# showbf -v
backfill window (user: 'root' group: 'root'
partition: ALL) Tue Oct 9 17:37:58
16 procs available with no timelimit
node <node> is blocked by reservation NONE
in INFINITY
I'm not sure how to proceed because the log files
don't tell me anything and all the references I have
found to a similar problem have remained unanswered.
Thanks for any help here are the rpms I used
maui-3.3-4.el5
maui-client-3.3-4.el5
maui-server-3.3-4.el5
torque-2.5.7-7.el5
torque-client-2.5.7-7.el5
torque-server-2.5.7-7.el5
libtorque-2.5.7-7.el5
the maui.cfg
#
# MAUI configuration example
# @(#)maui.cfg David Groep 20031015.1
# for MAUI version 3.2.5
#
SERVERHOST <server>
ADMIN1 root
ADMINHOST <server>
RMTYPE[0] PBS
RMHOST[0] <server>
RMSERVER[0] <server>
SERVERPORT 40559
SERVERMODE NORMAL
# Set PBS server polling interval. Since we have many
short jobs
# and want fast turn-around, set this to 10 seconds
(default: 2 minutes)
RMPOLLINTERVAL 00:00:10
# a max. 10 MByte log file in a logical location
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
and Torque config
create queue long
set queue long queue_type = Execution
set queue long acl_hosts = localhost
set queue long acl_hosts += <server>
set queue long resources_max.cput = 48:00:00
set queue long resources_max.walltime = 72:00:00
set queue long acl_group_enable = True
set queue long acl_groups = aforti
set queue long enabled = True
set queue long started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = <server>
set server acl_hosts += localhost
set server default_queue = long
set server log_events = 511
set server mail_from = adm
set server next_job_number = 12
--
Facts aren't facts if they come from the wrong people. (Paul Krugman)