Print

Print


Hello all,

I changed the MaxTotalJobs variable to 10 but there was no improvement.

Then I searched into the the logs under /var/spool/pbs on the CE and on the WNs to see if
some problem was reported for those jobs, and I found that every job
that failed was in one of our workernodes which was outdated for some reason.

After updating the wn the problem is solved.

thanks a lot for your help.
cheers

On Jan 5, 2008 1:07 PM, Burke, S (Stephen) <[log in to unmask]> wrote:
LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Maarten Litmaath
said:
> So, you should increase GlueCEPolicyMaxTotalJobs to 10 or so,
> or remove the limit altogether.  That should solve the
> problem for "ops".

Of course it's the limit in the batch system that needs to be changed,
the glue attribute is just a reflection of it.

> Also for other VOs you should increase
> GlueCEPolicyMaxTotalJobs by a lot, otherwise they will
> experience the same problem sooner or later...

Some sites have experienced cases where a single user/VO floods the site
with a huge number of jobs, so a reasonable max jobs limit is useful.

Stephen

(I don't know about happy, but I think 2008 will at least be an
interesting new year :)