Well actually we may have figured out the problem. It seems two
workernodes had problems with stageout, but not something one would
notice immediately out of hand. We have isolated them and now SAM
tests seem to be running fine (but we'll have to wait a bit longer to
make sure this was the problem indeed). We also ran a separate test of
a job on one of that workernodes and the logging information came back
with exactly the known error so we do hope we have isolated it now. We
will know in about 24h if all the SAM tests run through nicely.
Mario
On May 5, 2008, at 3:36 PM, <[log in to unmask]> <[log in to unmask]
> wrote:
> Hi Mario, Ilja,
>
>>> Anyway, the exact details are available from this ggus ticket:
>>> https://gus.fzk.de/pages/ticket_details.php?ticket=35655
>>>
>>> I have increased the maxproc settings of both "marshal"s as it
>>> seemed to be somehow related to the error ( Globus error 94: the
>>> jobmanager does not accept any new requests (shutting down)), will
>>> see if it helps.
>>>
>>> Any other ideas are still very welcome!
>
> It appears that the failing jobs were in fact successfully submitted
> to Torque. For example, in /opt/edg/var/gatekeeper/grid-
> jobmap_20080505
> (spaces replaced with newlines for clarity):
>
> "localUser=11860"
> "userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
> CN=582979/CN=Judit Novak"
> "userFQAN=/ops/Role=lcgadmin/Capability=NULL"
> "userFQAN=/ops/Role=NULL/Capability=NULL"
> "jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
> "ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
> "lrmsID=42444.oberon.hep.kbfi.ee"
> "timestamp=2008-05-05 10:07:18"
>
> The job then may have been reported in such a way that the lcgpbs job
> manager considered the job as having failed. For example, the 'W'
> state
> is treated like that. In that case you would see a cancellation
> (qdel)
> request in the Torque logs. Can you check what happened to job 42444?
>
|