Hello Again,
There was a lhcb user that somehow had 50 job-manager processes which
were collapsing3 our CE. We have banned the user and now it seems the CE
load is more or less normal. Our problem right now is that when the job
finishes in the WN the RB doesn't realize about it. Job actually
finishes but we get Aborted: Job RetryCount (3) hit.
The ssh keys are properly configured to do a scp between the WN and the
CE. The gridftp is working fine in both RB and CE. So I presume it's the
communication between the WN and the RB which is not working well.
I have launched this command to test it but i get this error:
[lxplus005] /afs/cern.ch/user/c/cborrego > globus-job-run
ce04.pic.es/jobmanager-lcgpbs /opt/globus/bin/globus-url-copy
file:///etc/group gsiftp://rb01.pic.es/tmp/junk
Creating /home/pbsWD_dteam002_239693.pbs01.pic.es
Removing /home/pbsWD_dteam002_239693.pbs01.pic.es
Job finished
--------------------------------------------
host: td135.pic.es
cpu time:
elapsed time:
memory:
virtual memory:
job submitted at: Wed Aug 30 19:02:58
job started at: Wed Aug 30 00:00:??
job ended at: Wed Aug 30 19:03:06
--------------------------------------------
submit-helper script running on host td135 gave error: cache_export_dir
(/home/dteam002/.lcgjm/globus-cache-export.pi3548) on gatekeeper did not
contain a cache_export_dir.tar archive
Is this the proper way to test it? What does "contain a
cache_export_dir.tar archive" mean? Is this related in some way with the
fact that we have plenty of jobs in Wating state in pbs (job_state = W)?
These are jobs with an assigned WN but are not running. Mostly from lhcb
vo.
Thanks a lot!
Carlos
Carlos Borrego Iglesias wrote:
> Hello,
> In our lgc-CE running gLite 3.0.2 we are having serious problems with
> its load. We find like 20/30 gatekeeper processes
> (/opt/edg/sbin/edg-gatekeeper -conf
> /opt/globus/etc/globus-gatekeeper.conf) some of them belonging to root
> and some others to pool accounts. Is this normal?
>
> The CE has an enormous number of job manager processes and it reaches
> to load 20. On the other hand jobs arrive to our pbs server and stay
> in Waiting mode which we don't know if it's a cause or a consequence.
> It happens as well that the sshd daemon hangs because of this load so
> jobs when they finish in the WN are unable to send their output to the
> CE.
>
> Any ideas of what's happening?
>
> Thanks a lot
> Carlos
>
|