Hi,
we've had a user contact us because his jobs were aborted on our cluster.
What appears to be the problem is that although the cluster was setup
with a maximum wall clock time of 72 hours the limit on cpu time was 48
hours. The accounting file shows the relevant jobs terminating with an
Error_Status of 271, Resource_list.cput of 48:00:00 and Resource_used.cput
a little higher
Since the amount of time available is advertised for each queue is
adverised via:
GlueCEPolicyMaxCPUTime: 2880
This is presumably requestable but is there any reason why it should be
much lower than the wall_clock time? These values appear to be hardcoded
in YAIM's config_torque_server so presumably someone put some thought
into the choice.
As we are planning to upgrade to 2.6.0 next week is there any reason
not to override the defaults when we upgrade and make cpu and wall clock
time limits the same.
William Hay, UCL-CCC System Administrator
|