On Sat, 5 Jan 2008, Maarten Litmaath wrote:
> On Fri, 4 Jan 2008, Andoena Balla wrote:
>
> > In our site we get some strange random errors.
> > These errors arise randomly without any reason.
> >
> > You can see here an example:
> >
> > https://lcg-sam.cern.ch:8443/sam/sam.py?funct=TestResult&nodename=ce101.grid.ucy.ac.cy&vo=OPS&testname=CE-sft-job&testtimestamp=1199391197
> >
> > We have tried all the solutions that are provided here:
> > http://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error
> > with no luck.
> >
> > I have seen that when this error happens, the job (coming from OPS vo)
> > gets stuck into the queue and cannot run.
> >
> > Does anyone have an idea of what could the problem be?
> > Maybe a maui/torque misconfiguration?
>
> Possible: do you have a non-trivial configuration? Do you have a special
> configuration for "ops"?
I found a special setting for "ops" indeed:
GlueCEPolicyMaxRunningJobs: 3
GlueCEPolicyMaxTotalJobs: 3
The value for GlueCEPolicyMaxTotalJobs is too low. A single SAM instance
usually will not send another job to a site when the previous job has not
finished yet, but there are at least 3 SAM instances being used...
So, you should increase GlueCEPolicyMaxTotalJobs to 10 or so, or remove
the limit altogether. That should solve the problem for "ops".
Also for other VOs you should increase GlueCEPolicyMaxTotalJobs by a lot,
otherwise they will experience the same problem sooner or later...
|