Y.Lyublev wrote:
> The very strange situation after 29 April.
> The jobs of the different VO groups are executes exept OPS group.
> [root@ceglite SSH]# qstat -q
> server: ceglite.itep.ru
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> photon -- 48:00:00 72:00:00 -- 0 0 -- E R
> hone -- 48:00:00 72:00:00 -- 0 0 -- E R
> atlas -- 120:00:0 140:00:0 -- 16 37 -- E R
> cms -- 120:00:0 140:00:0 -- 3 144 -- E R
> lhcb -- 120:00:0 140:00:0 -- 7 0 -- E R
> ops -- 48:00:00 72:00:00 -- 0 0 -- E R
> alice -- 120:00:0 140:00:0 -- 44 9 -- E R
> dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
> ----- -----
> 70 190
> Event: Abort
> - host = rb115.cern.ch
> - level = SYSTEM
> - priority = asynchronous
> - reason = Job RetryCount (0) hit
> - seqcode =
> UI=000003:NS=0000000003:WM=000006:BH=0000000000:JSS=000003:LM=000006:LRMS=00
> 0000:APP=000000
> - source = WorkloadManager
> - src_instance = WM
> - timestamp = Wed Apr 30 04:15:21 2008
> - user = /DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=samoper/CN=582979/CN=Judit Novak
You should look at the error that is reported earlier in the logging info:
-----------------------------------------------------------------------------
Event: Done
- exit_code = 1
- host = rb115.cern.ch
- level = SYSTEM
- priority = asynchronous
- reason = Got a job held event, reason: Globus error 94:
the jobmanager does not accept any new requests (shutting down)
-----------------------------------------------------------------------------
That one has its own Wiki entry:
http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_94%3A_the_jobmanager_does_not_accept_any_new_requests_%28shutting_down%29
Anyway, as SAM jobs have been OK since 30-Apr-2008 09:11:09,
did you fix something?
|