Hello all,
I changed the MaxTotalJobs variable to 10 but there was no improvement.
Then I searched into the the logs under /var/spool/pbs on the CE and on the WNs to see if
some problem was reported for those jobs, and I found that every job
that failed was in one of our workernodes which was outdated for some reason.
After updating the wn the problem is solved.
thanks a lot for your help.
cheers
LHC Computer Grid - Rollout
> [mailto:[log in to unmask]] On Behalf Of Maarten Litmaath
said:> So, you should increase GlueCEPolicyMaxTotalJobs to 10 or so,Of course it's the limit in the batch system that needs to be changed,
> or remove the limit altogether. That should solve the
> problem for "ops".
the glue attribute is just a reflection of it.Some sites have experienced cases where a single user/VO floods the site
> Also for other VOs you should increase
> GlueCEPolicyMaxTotalJobs by a lot, otherwise they will
> experience the same problem sooner or later...
with a huge number of jobs, so a reasonable max jobs limit is useful.
Stephen
(I don't know about happy, but I think 2008 will at least be an
interesting new year :)