Hi Mario, Ilja,
> > Anyway, the exact details are available from this ggus ticket:
> > https://gus.fzk.de/pages/ticket_details.php?ticket=35655
> >
> > I have increased the maxproc settings of both "marshal"s as it
> > seemed to be somehow related to the error ( Globus error 94: the
> > jobmanager does not accept any new requests (shutting down)), will
> > see if it helps.
> >
> > Any other ideas are still very welcome!
It appears that the failing jobs were in fact successfully submitted
to Torque. For example, in /opt/edg/var/gatekeeper/grid-jobmap_20080505
(spaces replaced with newlines for clarity):
"localUser=11860"
"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/CN=582979/CN=Judit Novak"
"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
"userFQAN=/ops/Role=NULL/Capability=NULL"
"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
"lrmsID=42444.oberon.hep.kbfi.ee"
"timestamp=2008-05-05 10:07:18"
The job then may have been reported in such a way that the lcgpbs job
manager considered the job as having failed. For example, the 'W' state
is treated like that. In that case you would see a cancellation (qdel)
request in the Torque logs. Can you check what happened to job 42444?
|