Hi,
I'm trying to figure out why Maui isn't able to schedule more than ca 3500 jobs in our cluster right now.
[root@torque-v-1 out]# diagnose -t
DEFAULT [test 4968:4968]
So as you can see maui sees just below 5000 cores (all are configured for all queues).
[root@torque-v-1 out]# qstat -q
server: torque-v-1.local
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
test -- 01:00:00 02:00:00 -- 0 0 -- E R
long -- 48:00:00 72:00:00 -- 3457 1049 -- E R
short -- 01:00:00 02:00:00 -- 11 0 -- E R
----- -----
3468 1049
Trying to see free cores I get:
[root@torque-v-1 out]# showbf
12 procs available for 2:23:21:34
It seems maui is only capable of scheduling on just finished slots. Going through maui log I see:
10/17 19:39:44 MPBSWorkloadQuery(base,JCount,SC)
10/17 19:39:57 INFO: active PBS job 844980 has been removed from the queue. assuming successful completion
10/17 19:39:57 INFO: active PBS job 844985 has been removed from the queue. assuming successful completion
10/17 19:39:57 INFO: active PBS job 845044 has been removed from the queue. assuming successful completion
10/17 19:39:57 INFO: active PBS job 845047 has been removed from the queue. assuming successful completion
10/17 19:39:57 INFO: active PBS job 845078 has been removed from the queue. assuming successful completion
10/17 19:39:57 INFO: 4506 PBS jobs detected on RM base
10/17 19:39:57 INFO: jobs detected: 4506
10/17 19:39:57 INFO: total jobs selected (ALL): 1036/4506 [State: 3469][Hold: 1]
10/17 19:39:57 INFO: total jobs selected (ALL): 1036/4506 [State: 3469][Hold: 1]
10/17 19:39:57 INFO: total jobs selected in partition ALL: 1036/1036
10/17 19:39:57 INFO: total jobs selected in partition ALL: 1036/1036
10/17 19:39:57 INFO: total jobs selected in partition DEFAULT: 1036/1036
10/17 19:39:57 MRMJobStart(843451,Msg,SC)
10/17 19:39:57 MPBSJobStart(843451,base,Msg,SC)
10/17 19:39:57 MPBSJobModify(843451,Resource_List,Resource,wn-v-6032.local)
10/17 19:39:57 MPBSJobModify(843451,Resource_List,Resource,1)
10/17 19:39:57 INFO: job '843451' successfully started
10/17 19:39:57 MRMJobStart(843452,Msg,SC)
10/17 19:39:57 MPBSJobStart(843452,base,Msg,SC)
10/17 19:39:57 MPBSJobModify(843452,Resource_List,Resource,wn-v-5636.local)
10/17 19:39:57 MPBSJobModify(843452,Resource_List,Resource,1)
10/17 19:39:57 INFO: job '843452' successfully started
10/17 19:39:57 MRMJobStart(843453,Msg,SC)
10/17 19:39:57 MPBSJobStart(843453,base,Msg,SC)
10/17 19:39:57 MPBSJobModify(843453,Resource_List,Resource,wn-v-5456.local)
10/17 19:39:57 MPBSJobModify(843453,Resource_List,Resource,1)
10/17 19:39:57 INFO: job '843453' successfully started
10/17 19:39:57 ERROR: cannot create reservation for job '843453'
10/17 19:39:57 ERROR: cannot start job '843453' in partition DEFAULT
10/17 19:39:57 MJobPReserve(843453,DEFAULT,ResCount,ResCountRej)
10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
10/17 19:39:57 MJobPReserve(843456,DEFAULT,ResCount,ResCountRej)
10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
10/17 19:39:57 MJobPReserve(843457,DEFAULT,ResCount,ResCountRej)
10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
10/17 19:39:57 MJobPReserve(843458,DEFAULT,ResCount,ResCountRej)
10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
As you can see it frees up some slots from finished jobs and schedules new jobs, but then gets an error that it can't create the reservation and that continues for quite a long time. Also most maui commands tend to fail with lost communication with server, took me some time to get all the outputs that I'm showing you.
As you can see those messages make up majority of Maui log:
[root@torque-v-1 log]# wc -l maui.log
15577 maui.log
[root@torque-v-1 log]# grep " ALERT: cannot create reservation in MJobReserve" maui.log|wc -l
4115
You should multiply that by 2 as it attempts the reservation and I only grepped the error message. So it makes up over 50% of all log entries.
Ideas how to debug further? We're using the torque+maui from EPEL repository and we have seen a few weeks ago torque reaching as high as 4930 running jobs so it's probably nothing pre-compiled that's limiting, more likely some running condition on some MOM that is blocking things. I'm contemplating running a parallel ssh that does service pbs_mom restart on all nodes that aren't down or offline to see if this will clear the blockage, but I'd prefer some way to debug this and pin-point the source of the problems. I've checked that the highest load on servers is 25 and we're using nodes with 32 cores that have configured optimal load as 32 and max load as 37 so load shouldn't be a limiting factor and we're using load based scheduling policy.
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why we do it"
-- Richard P. Feynman
|