Print

Print


Hi Mario

could you detail what problem these two nodes are having ?
we are experiencing a similar problem, except that it is not systematic

basically we are observing spooradics
"MOM rejected modify request, error: 15001"
messages; sometimes the job get started anyway
some other times it fails the stagin operation
then the job is sent back to the server and is placed in Q state, but
then maui does not select it anymore.

we had two period of time during last weekend when we did observer the
Globus error 94, and we wander whether the two things are correlated
with each other or not

Cheers
Sergio :)

On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote:

>Well actually we may have figured out the problem. It seems two
>workernodes had problems with stageout, but not something one would
>notice immediately out of hand. We have isolated them and now SAM
>tests seem to be running fine (but we'll have to wait a bit longer to
>make sure this was the problem indeed). We also ran a separate test of
>a job on one of that workernodes and the logging information came back
>with exactly the known error so we do hope we have isolated it now. We
>will know in about 24h if all the SAM tests run through nicely.
>
>Mario
>
>On May 5, 2008, at 3:36 PM, <[log in to unmask]>
><[log in to unmask]
>>wrote:
>
>>Hi Mario, Ilja,
>>
>>>>Anyway, the exact details are available from this ggus ticket:
>>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655
>>>>
>>>>I have increased the maxproc settings of both "marshal"s as it
>>>>seemed to be somehow related to the error ( Globus error 94: the
>>>>jobmanager does not accept any new requests (shutting down)), will
>>>>see if it helps.
>>>>
>>>>Any other ideas are still very welcome!
>>
>>It appears that the failing jobs were in fact successfully submitted
>>to Torque. For example, in /opt/edg/var/gatekeeper/grid-
>>jobmap_20080505
>>(spaces replaced with newlines for clarity):
>>
>>"localUser=11860"
>>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/
>>CN=582979/CN=Judit Novak"
>>"userFQAN=/ops/Role=lcgadmin/Capability=NULL"
>>"userFQAN=/ops/Role=NULL/Capability=NULL"
>>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ"
>>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short"
>>"lrmsID=42444.oberon.hep.kbfi.ee"
>>"timestamp=2008-05-05 10:07:18"
>>
>>The job then may have been reported in such a way that the lcgpbs job
>>manager considered the job as having failed. For example, the 'W'
>>state
>>is treated like that. In that case you would see a cancellation
>>(qdel)
>>request in the Torque logs. Can you check what happened to job 42444?
>>



Cheers
Sergio :)

---------------------------------------------
  Dr. Sergio Maffioletti
 
  Grid Group
  CSCS, Swiss National Supercomputing Centre
  Via Cantonale
  CH-6928 Manno
  Tel: +41916108218
  Fax: +41916108282
  email: [log in to unmask]
---------------------------------------------