Hi experts,
i'm trying to tame a gLite 3.2 torque/maui server in order to migrate our site there and split the LRMS part from the CE(s). In theory this should work but...
First of all i'm using quattor and QWG templates for this but i'm just stating this for completeness shake as i'll show you that this has nothing to do with quattor's config. So i installed the following versions:
# rpm -qa | sort |egrep -e "torque|maui"
glite-yaim-torque-client-4.0.3-1
glite-yaim-torque-server-4.0.4-1
glite-yaim-torque-utils-4.0.4-1
lcg-info-dynamic-maui-2.0.3-2
maui-3.2.6p21-snap.1234905291.5.el5
maui-client-3.2.6p21-snap.1234905291.5.el5
maui-server-3.2.6p21-snap.1234905291.5.el5
ncm-maui-1.1.3-1
torque-2.3.6-2cri.el5
torque-client-2.3.6-2cri.el5
torque-drmaa-2.3.6-1cri.sl5
torque-drmaa-docs-2.3.6-1cri.sl5
torque-pam-2.3.6-1cri.sl5
torque-server-2.3.6-2cri.el5
My torque's config looks like the following:
# qmgr -c 'p s'
#
# Create queues and set their attributes.
#
#
# Create and define queue vo1
#
create queue vo1
set queue vo1 queue_type = Execution
set queue vo1 max_queuable = 20
set queue vo1 max_running = 20
set queue vo1 resources_max.walltime = 80:00:00
set queue vo1 enabled = True
set queue vo1 started = True
#
# Create and define queue vo2
#
create queue vo2
set queue vo2 queue_type = Execution
set queue vo2 max_queuable = 200
set queue vo2 resources_max.walltime = 80:00:00
set queue vo2 enabled = True
set queue vo2 started = True
....
#
# Create and define queue vo20
#
create queue vo20
set queue vo20 queue_type = Execution
set queue vo20 max_queuable = 200
set queue vo20 resources_max.walltime = 80:00:00
set queue vo20 enabled = True
set queue vo20 started = True
#
# Create and define queue localqueue1
#
create queue localqueue1
set queue localqueue1 queue_type = Execution
set queue localqueue1 max_queuable = 200
set queue localqueue1 resources_max.walltime = 250:00:00
set queue localqueue1 resources_default.neednodes = localqueue1
set queue localqueue1 resources_default.pmem = 921600kb
set queue localqueue1 acl_group_enable = True
set queue localqueue1 acl_groups = localgroup1
set queue localqueue1 enabled = True
set queue localqueue1 started = True
#
# Create and define queue cpg
#
create queue localqueue2
set queue localqueue2 queue_type = Execution
set queue localqueue2 max_queuable = 200
set queue localqueue2 resources_max.walltime = 2400:00:00
set queue localqueue2 resources_default.neednodes = localqueue2
set queue localqueue2 resources_default.pmem = 921600kb
set queue localqueue2 acl_group_enable = True
set queue localqueue2 acl_groups = localgroup2
set queue localqueue2 enabled = True
set queue localqueue2 started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = localhost
set server acl_hosts += server.f.q.d.n
set server managers = [log in to unmask]
set server operators = [log in to unmask]
set server default_queue = undefined
set server log_events = 255
set server mail_from = adm
set server query_other_jobs = True
set server resources_available.nodect = 2048
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = lcgpro
set server node_pack = False
set server job_stat_rate = 300
set server poll_jobs = True
set server log_level = 0
set server mom_job_sync = True
set server mail_domain = never
set server log_file_max_size = 50000000
set server log_file_roll_depth = 100
set server next_job_number = 0
set server server_name = server.f.q.d.n
And then my Maui config:
# cat /var/spool/maui/maui.cfg
#Server main parameters
ADMIN1 root
ADMIN3 edguser
ADMIN_HOST server.f.q.d.n
DEFERCOUNT 12
DEFERTIME 00:10:00
ENABLEMULTIREQJOBS true
ENFORCERESOURCELIMITS ON
JOBAGGREGATIONTIME 00:00:10
JOBPRIOACCRUALPOLICY FULLPOLICY
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 100000000
LOGFILEROLLDEPTH 10
LOGLEVEL 0
NODEALLOCATIONPOLICY MAXBALANCE
NODEPOLLFREQUENCY 5
RMPOLLINTERVAL 00:01:00
SERVERHOST server.f.q.d.n
SERVERMODE NORMAL
SERVERPORT 40559
# Resource manager parameters
RMCFG[base] TYPE=PBS
# Job priority parameters
FSDECAY 0.95
FSDEPTH 28
FSGROUPWEIGHT 20
FSINTERVAL 24:00:00
FSPOLICY DEDICATEDPS
FSWEIGHT 1
QUEUETIMEWEIGHT 0
XFACTORWEIGHT 10
# Site specific parameters
# Node partitions are used keep jobs confined to appropriate nodes.
# By default, allow access to NO partitions.
SYSCFG[base] PLIST=
# Define parameters and partitions for each VO (group).
GROUPCFG[DEFAULT] FSTARGET=1+ PLIST=DEFAULT
GROUPCFG[vo1] FSTARGET=1+ PLIST=DEFAULT
GROUPCFG[vo2] FSTARGET=1+ PLIST=DEFAULT
....
GROUPCFG[vo20] FSTARGET=1+ PLIST=DEFAULT
Starting pbs_server and maui server doesn't reveal any error BUT:
a) The execution of "showstats" throughs "Segmentation fault":
# showstats
Segmentation fault
b) The execution of "showconfig" doesn't actually print the requested config (see above):
# showconfig
NODELOADPOLICY ADJUSTSTATE
JOBNODEMATCHPOLICY[1]
JOBMAXSTARTTIME[1] INFINITY
METAMAXTASKS[1] 0
NODESETPOLICY[1] [NONE]
NODESETATTRIBUTE[1] [NONE]
NODESETLIST[1]
NODESETDELAY[1] 00:00:00
NODESETPRIORITYTYPE[1] MINLOSS
NODESETTOLERANCE[1] 0.00
# Priority Weights
XFMINWCLIMIT[1] 00:00:00
RMAUTHTYPE[0] CHECKSUM
CLASSCFG[vo20] DEFAULT.FEATURES=[NONE]
QOSPRIORITY[0] 0
QOSQTWEIGHT[0] 0
QOSXFWEIGHT[0] 0
QOSTARGETXF[0] 0.00
QOSTARGETQT[0] 00:00:00
QOSFLAGS[0]
QOSPRIORITY[1] 0
QOSQTWEIGHT[1] 0
QOSXFWEIGHT[1] 0
QOSTARGETXF[1] 0.00
QOSTARGETQT[1] 00:00:00
QOSFLAGS[1]
RESDEPTH 24
SCHEDCFG[] MODE=NORMAL SERVER=server.f.q.d.n:40559
# RM MODULES: PBS SSS WIKI NATIVE
TYPE=PBS
SIMEXITITERATION -1
Note here that maui looks like it is able to find only ONE pbs queue (the last one that appears at pbs's config).
3) the command checknode is again not listing all classes (queues) of pbs:
# checknode wnX
checking node wnX.f.q.d.n
State: Idle (in current state for 16:30:34)
Configured Resources: DISK: 140G
Utilized Resources: DISK: 5913M
Dedicated Resources: [NONE]
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.010
Network: [DEFAULT]
Features: [lcgpro]
Attributes: [Batch]
Classes: [vo20 4:4]
Total Time: 1:20:00:31 Up: 1:19:55:58 (99.83%) Active: 00:00:00 (0.00%)
Reservations:
NOTE: no reservations on node
4) Finally Maui's SR don't appear as they normally should at the checknode command i.e. for a user reservation i get:
Reservations:
User '.0.0'(x1) -INFINITY -> INFINITY ( INFINITY)
Blocked Resources@-00:13:16 Procs: 2/2 (100.00%)
Instead of:
Reservations:
User 'name_of_reservation.0.0'(x1) -INFINITY -> INFINITY ( INFINITY)
Blocked Resources@-00:13:16 Procs: 2/2 (100.00%)
5) The qstat -Q command lists all queues correctly
I tried similar config on 2 sites and the both suffer from the same issues. They both are hosted on XEN VMs (i don't think that this is the issue). Till now i've tried SL54 and SL55 (both in x86_64 arch). I also tried the i386 version of torque and maui with the same results.
Finally i created a VMWare VM on my laptop and installed 'glite-TORQUE_server':
Installing : torque 1/12
Installing : maui 2/12
Installing : maui-client 3/12
Installing : maui-server 4/12
Installing : torque-server 5/12
Installing : torque-client 6/12
Installing : glite-yaim-core 7/12
Installing : glite-yaim-torque-utils 8/12
Installing : glite-yaim-torque-server 9/12
Installing : glite-version 10/12
Installing : edg-pbs-utils 11/12
Installing : glite-TORQUE_server 12/12
# rpm -qa | sort |egrep -e "torque|maui"
glite-yaim-torque-server-4.0.4-1
glite-yaim-torque-utils-4.0.4-1
maui-3.2.6p21-snap.1234905291.5.el5
maui-client-3.2.6p21-snap.1234905291.5.el5
maui-server-3.2.6p21-snap.1234905291.5.el5
torque-2.3.6-2cri.el5
torque-client-2.3.6-2cri.el5
torque-server-2.3.6-2cri.el5
Without touching the default configs, i started Torque and Maui and... i get the same results i.e.:
# showstats
Segmentation fault
Did anyone else faced this? I didn't find any savannah ticket but this probably deserves one.
Regards,
Christos
PS: Sorry for the long mail.
|