Hi,
It seems fixed now for lcg-ce.usc.cesga.es, or at least it works from my UI.
The problem was that memory was very low. queue_submit() in Helper.pm of GRAM
checks for memory and returns a NORESOURCES error if the free memory is less
than 2% of the total, NORESOURCES is GRAM error 3, not necesarily IO.
The reason for that was that edg-wl-interlogd was using 717MB of RAM, so I
restarted it with:
/etc/init.d/edg-wl-locallogger restart
When I get back to USC I will investigate on why was this guy eating 717MB, any
ideas?
Best regards,
Manuel
Mensaje citado por Steve Traylen <[log in to unmask]>:
> Earlier to see if our RB was in shape after some suggestions it was not
> I sent a job to each of queues.
>
> 57 qeueues in total, 42 were successful which is pretty good.
>
> For the following that failed I tried the simplest thing I could do to try
> and show
> if a fault was still there.
>
> 1) golias25.farm.particle.cz
> 2) hik-lcg-ce.fzk.de
> 3) lcg-ce.usc.cesga.es
> 4) lhc01.sinp.msu.ru
> 5) grid003.ft.uam.es
>
>
#######################################################################################
> 1) Destination:
> golias25.farm.particle.cz:2119/jobmanager-lcgpbs-short
> Status Reason: Job RetryCount (3) hit
>
> globus-job-run golias25.farm.particle.cz/jobmanager-lcgpbs /bin/pwd :
> Fails
>
> with nooutput. Best guess is the ssh from the WN unchallenged back to CE
> does not work.
>
>
>
#######################################################################################
> 2) Destination: hik-lcg-ce.fzk.de:2119/jobmanager-lcgpbs-infinite
> Status Reason: 7 authentication failed: GSS Major Status:
> Authentication Failed GSS Minor Status Error Chain: init.c:499:
> globus_gss_assist_init_sec_context_async: Error during context initialization
> init_sec_context
>
>
> globus-job-run hik-lcg-ce.fzk.de /bin/pwd , fails
>
> Had a quick look at the CRLs at
> http://grid.fzk.de/ca/gridka-crl.pem
> http://grid.fzk.de/ca/fzk-crl.pem
>
> both CRLs looks to have expired today.
>
> $ openssl crl -in gridka-crl.pem -noout -nextupdate
> nextUpdate=Sep 12 14:19:25 2003 GMT
>
> $ openssl crl -in fzk-crl.pem -noout -nextupdate
> nextUpdate=Sep 12 14:19:19 2003 GMT
>
>
>
######################################################################################
>
> 3) Destination: lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs-long
> Status Reason: Got a job held event, reason: Globus error 3: an I/O
> operation failed
>
> which is new one on me.
>
> Fork jobs okay but
>
> globus-job-run lcg-ce.usc.cesga.es:2119/jobmanager-lcgpbs /boot/pwd
> GRAM Job failed because an I/O operation failed (error code 3)
> Don't know.
>
>
#################################################################################
>
> 4) Destination: lhc01.sinp.msu.ru:2119/jobmanager-lcgpbs-infinite
> Status Reason: Job RetryCount (3) hit
>
> Same as 3)
>
>
################################################################################
> 5) Destination: grid003.ft.uam.es:2119/jobmanager-lcgpbs-long
> Status Reason: Job RetryCount (3) hit
>
> Looks to be okay now.
>
>
>
>
>
> --
> Steve Traylen
> [log in to unmask]
> http://www.gridpp.ac.uk/
>
|