Hi Mario could you detail what problem these two nodes are having ? we are experiencing a similar problem, except that it is not systematic basically we are observing spooradics "MOM rejected modify request, error: 15001" messages; sometimes the job get started anyway some other times it fails the stagin operation then the job is sent back to the server and is placed in Q state, but then maui does not select it anymore. we had two period of time during last weekend when we did observer the Globus error 94, and we wander whether the two things are correlated with each other or not Cheers Sergio :) On 05, May 2008 02:57 PM, Mario Kadastik <[log in to unmask]> wrote: >Well actually we may have figured out the problem. It seems two >workernodes had problems with stageout, but not something one would >notice immediately out of hand. We have isolated them and now SAM >tests seem to be running fine (but we'll have to wait a bit longer to >make sure this was the problem indeed). We also ran a separate test of >a job on one of that workernodes and the logging information came back >with exactly the known error so we do hope we have isolated it now. We >will know in about 24h if all the SAM tests run through nicely. > >Mario > >On May 5, 2008, at 3:36 PM, <[log in to unmask]> ><[log in to unmask] >>wrote: > >>Hi Mario, Ilja, >> >>>>Anyway, the exact details are available from this ggus ticket: >>>>https://gus.fzk.de/pages/ticket_details.php?ticket=35655 >>>> >>>>I have increased the maxproc settings of both "marshal"s as it >>>>seemed to be somehow related to the error ( Globus error 94: the >>>>jobmanager does not accept any new requests (shutting down)), will >>>>see if it helps. >>>> >>>>Any other ideas are still very welcome! >> >>It appears that the failing jobs were in fact successfully submitted >>to Torque. For example, in /opt/edg/var/gatekeeper/grid- >>jobmap_20080505 >>(spaces replaced with newlines for clarity): >> >>"localUser=11860" >>"userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/ >>CN=582979/CN=Judit Novak" >>"userFQAN=/ops/Role=lcgadmin/Capability=NULL" >>"userFQAN=/ops/Role=NULL/Capability=NULL" >>"jobID=https://rb113.cern.ch:9000/X3pf3fHmWWZTWr9nY5VvKQ" >>"ceID=oberon.hep.kbfi.ee:2119/jobmanager-lcgpbs-short" >>"lrmsID=42444.oberon.hep.kbfi.ee" >>"timestamp=2008-05-05 10:07:18" >> >>The job then may have been reported in such a way that the lcgpbs job >>manager considered the job as having failed. For example, the 'W' >>state >>is treated like that. In that case you would see a cancellation >>(qdel) >>request in the Torque logs. Can you check what happened to job 42444? >> Cheers Sergio :) --------------------------------------------- Dr. Sergio Maffioletti Grid Group CSCS, Swiss National Supercomputing Centre Via Cantonale CH-6928 Manno Tel: +41916108218 Fax: +41916108282 email: [log in to unmask] ---------------------------------------------