Anyone? I'm stuck with this still...
server: torque-v-1.local
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
test -- 01:00:00 02:00:00 -- 0 0 -- E R
long -- 48:00:00 72:00:00 -- 3695 2486 -- E R
short -- 01:00:00 02:00:00 -- 0 1 -- E R
----- -----
3695 2487
10/24 14:28:45 MPBSJobModify(921883,Resource_List,Resource,1)
10/24 14:28:45 INFO: job '921883' successfully started
10/24 14:28:45 MRMJobStart(921884,Msg,SC)
10/24 14:28:45 MPBSJobStart(921884,base,Msg,SC)
10/24 14:28:45 MPBSJobModify(921884,Resource_List,Resource,wn-v-5456.local)
10/24 14:28:45 MPBSJobModify(921884,Resource_List,Resource,1)
10/24 14:28:45 INFO: job '921884' successfully started
10/24 14:28:45 ERROR: cannot create reservation for job '921884'
10/24 14:28:45 ERROR: cannot start job '921884' in partition DEFAULT
10/24 14:28:45 MJobPReserve(921884,DEFAULT,ResCount,ResCountRej)
10/24 14:28:45 ALERT: cannot create reservation in MJobReserve
help would really be appreciated...
On 17.10.2012, at 19:45, Mario Kadastik <[log in to unmask]> wrote:
> Hi,
>
> I'm trying to figure out why Maui isn't able to schedule more than ca 3500 jobs in our cluster right now.
>
> [root@torque-v-1 out]# diagnose -t
> DEFAULT [test 4968:4968]
>
> So as you can see maui sees just below 5000 cores (all are configured for all queues).
>
> [root@torque-v-1 out]# qstat -q
>
> server: torque-v-1.local
>
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> test -- 01:00:00 02:00:00 -- 0 0 -- E R
> long -- 48:00:00 72:00:00 -- 3457 1049 -- E R
> short -- 01:00:00 02:00:00 -- 11 0 -- E R
> ----- -----
> 3468 1049
>
> Trying to see free cores I get:
> [root@torque-v-1 out]# showbf
> 12 procs available for 2:23:21:34
>
> It seems maui is only capable of scheduling on just finished slots. Going through maui log I see:
>
> 10/17 19:39:44 MPBSWorkloadQuery(base,JCount,SC)
> 10/17 19:39:57 INFO: active PBS job 844980 has been removed from the queue. assuming successful completion
> 10/17 19:39:57 INFO: active PBS job 844985 has been removed from the queue. assuming successful completion
> 10/17 19:39:57 INFO: active PBS job 845044 has been removed from the queue. assuming successful completion
> 10/17 19:39:57 INFO: active PBS job 845047 has been removed from the queue. assuming successful completion
> 10/17 19:39:57 INFO: active PBS job 845078 has been removed from the queue. assuming successful completion
> 10/17 19:39:57 INFO: 4506 PBS jobs detected on RM base
> 10/17 19:39:57 INFO: jobs detected: 4506
> 10/17 19:39:57 INFO: total jobs selected (ALL): 1036/4506 [State: 3469][Hold: 1]
> 10/17 19:39:57 INFO: total jobs selected (ALL): 1036/4506 [State: 3469][Hold: 1]
> 10/17 19:39:57 INFO: total jobs selected in partition ALL: 1036/1036
> 10/17 19:39:57 INFO: total jobs selected in partition ALL: 1036/1036
> 10/17 19:39:57 INFO: total jobs selected in partition DEFAULT: 1036/1036
> 10/17 19:39:57 MRMJobStart(843451,Msg,SC)
> 10/17 19:39:57 MPBSJobStart(843451,base,Msg,SC)
> 10/17 19:39:57 MPBSJobModify(843451,Resource_List,Resource,wn-v-6032.local)
> 10/17 19:39:57 MPBSJobModify(843451,Resource_List,Resource,1)
> 10/17 19:39:57 INFO: job '843451' successfully started
> 10/17 19:39:57 MRMJobStart(843452,Msg,SC)
> 10/17 19:39:57 MPBSJobStart(843452,base,Msg,SC)
> 10/17 19:39:57 MPBSJobModify(843452,Resource_List,Resource,wn-v-5636.local)
> 10/17 19:39:57 MPBSJobModify(843452,Resource_List,Resource,1)
> 10/17 19:39:57 INFO: job '843452' successfully started
> 10/17 19:39:57 MRMJobStart(843453,Msg,SC)
> 10/17 19:39:57 MPBSJobStart(843453,base,Msg,SC)
> 10/17 19:39:57 MPBSJobModify(843453,Resource_List,Resource,wn-v-5456.local)
> 10/17 19:39:57 MPBSJobModify(843453,Resource_List,Resource,1)
> 10/17 19:39:57 INFO: job '843453' successfully started
> 10/17 19:39:57 ERROR: cannot create reservation for job '843453'
> 10/17 19:39:57 ERROR: cannot start job '843453' in partition DEFAULT
> 10/17 19:39:57 MJobPReserve(843453,DEFAULT,ResCount,ResCountRej)
> 10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
> 10/17 19:39:57 MJobPReserve(843456,DEFAULT,ResCount,ResCountRej)
> 10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
> 10/17 19:39:57 MJobPReserve(843457,DEFAULT,ResCount,ResCountRej)
> 10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
> 10/17 19:39:57 MJobPReserve(843458,DEFAULT,ResCount,ResCountRej)
> 10/17 19:39:57 ALERT: cannot create reservation in MJobReserve
>
> As you can see it frees up some slots from finished jobs and schedules new jobs, but then gets an error that it can't create the reservation and that continues for quite a long time. Also most maui commands tend to fail with lost communication with server, took me some time to get all the outputs that I'm showing you.
>
> As you can see those messages make up majority of Maui log:
> [root@torque-v-1 log]# wc -l maui.log
> 15577 maui.log
> [root@torque-v-1 log]# grep " ALERT: cannot create reservation in MJobReserve" maui.log|wc -l
> 4115
>
> You should multiply that by 2 as it attempts the reservation and I only grepped the error message. So it makes up over 50% of all log entries.
>
> Ideas how to debug further? We're using the torque+maui from EPEL repository and we have seen a few weeks ago torque reaching as high as 4930 running jobs so it's probably nothing pre-compiled that's limiting, more likely some running condition on some MOM that is blocking things. I'm contemplating running a parallel ssh that does service pbs_mom restart on all nodes that aren't down or offline to see if this will clear the blockage, but I'd prefer some way to debug this and pin-point the source of the problems. I've checked that the highest load on servers is 25 and we're using nodes with 32 cores that have configured optimal load as 32 and max load as 37 so load shouldn't be a limiting factor and we're using load based scheduling policy.
>
> Mario Kadastik, PhD
> Researcher
>
> ---
> "Physics is like sex, sure it may have practical reasons, but that's not why we do it"
> -- Richard P. Feynman
Mario Kadastik, PhD
Researcher
---
"Physics is like sex, sure it may have practical reasons, but that's not why we do it"
-- Richard P. Feynman
|