Dear Alexander,
Thanks for you reply.
I too have noticed some of the observations you shared with us without
knowing the correct reason.
Sorry for any convenience caused by unwanted or heavier attachments from my
side.
-- Best Regards --
Adeel-ur-Rehman
-----Original Message-----
From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
On Behalf Of Alexander Novikov
Sent: Saturday, November 10, 2007 5:55 PM
To: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] Jobs hanging in Running state
Hello Adeel-ur-Rehman,
Actually I have no advices, but may be something additional to your
problem.
My site now in OK state but sometimes there are erors in random times.
The reason is smth like (see reason):
"Event: Done
- exit_code = 1
- host = rb127.cern.ch
- level = SYSTEM
- priority = asynchronous
- reason = Cannot read JobWrapper output, both from
Condor and from Maradona.
- seqcode =
UI=000003:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000007:LRMS=00
0000:APP=000000
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Fri Nov 9 14:11:52 2007
- user = /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=samoper/CN=582979/CN=Judit Novak
"
Also I`ve noticed that sometimes jobs in R-state(running for a while)
stucked, like you said.
Also there are "authentication failures" in sshd_log when poolusers
logged in "using hostbased"(the same counts). I`m not sure but some
time before(mb one-two months ago) there was no such errors or
"errors".
Also(but may be not concerned) in gridice monitor some jobs had no
machine where to be executed(or where they executed).
Why all this - I don`t know too.
Did you see smth like this in your site?
I`ld be also very grateful to especially you if you not send such
attachements to ALL. Printscreen of console output you can send as text or
just smth compressed rather than ".bmp".
--
AuR> Dear Maarten and All,
AuR> Thanks for your reply,
AuR> We have re-installed the pbs and torque rpms on our batch server and
AuR> configured the node this time leaving the queues to its default
AuR> configuration. But the behaviour of job execution seems to be same.
AuR> An important note is that, the same job we submit (same .jdl and .sh
file)
AuR> gets sometimes stucked in the Running state while sometimes it gets
executed
AuR> successfully.
AuR> Also, sometimes jobs stucked at the start after coming into the Running
AuR> state, while sometimes it gets stucked after spending sometime in the
AuR> Running state.
AuR> A screenshot of two jobs stucked in the running state in the start is
AuR> attached. If we observe such a situation even after hours, it remains
the
AuR> same as far as these jobs are concerned. Others jobs may enter and
execute
AuR> successfully or they also get into the same situation.
AuR> Any ideas, help is appreciated.
AuR> -- Best Regards --
AuR> Adeel-ur-Rehman
AuR> -----Original Message-----
AuR> From: [log in to unmask] [mailto:[log in to unmask]]
AuR> Sent: Saturday, November 10, 2007 3:58 AM
AuR> To: Adeel-ur-Rehman
AuR> Cc: [log in to unmask]
AuR> Subject: Re: [LCG-ROLLOUT] Jobs hanging in Running state
AuR> On Fri, 9 Nov 2007, Maarten Litmaath, CERN wrote:
>> On Fri, 9 Nov 2007, Adeel-ur-Rehman wrote:
>>
>> > I haven't applied any special settings. I only configured the queues
via
AuR> the
>> > following commands:
>> >
>> > qmgr -c "set queue atlas max_running = 4"
>> > .... for all queues(of course, the value is not the same for all the
AuR> queues)
>> >
>> >
>> > qmgr -c "set queue atlas Priority = 200"
>> > .... for all queues(of course, the value is not the same for all the
AuR> queues)
>> >
>> > qmgr -c "set queue ops resources_max.walltime = 01:00:00"
>> > qmgr -c "set queue ops resources_max.cput = 00:30:00"
>> > .... for only dteam and ops
>>
>> Those settings look reasonable, but I am not a Torque expert. Anybody?
AuR> The Torque users list can be of help:
AuR> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Best regards,
Alexander mailto:[log in to unmask]
|