Hello,
We have found the problem. We didn't configure the WN's after the
upgrade to gLite 3.0.2, following glite update instructions
(http://glite.web.cern.ch/glite/packages/R3.0/updates.asp). The thing is
that the cron that fetches the crl is changed and now is called
fetch-crl instead of edg-fetch-crl:
[root@td133 root]# cat /etc/cron.d/edg-fetch-crl
PATH=/sbin:/bin:/usr/sbin:/usr/bin
10 3,9,15,21 * * * root /opt/edg/etc/cron/edg-fetch-crl-cron >>
/var/log/edg-fetch-crl-cron.log 2>&1
[root@td133 root]# ls -la /opt/edg/etc/cron/edg-fetch-crl-cron
ls: /opt/edg/etc/cron/edg-fetch-crl-cron: No such file or directory
This script was in the rpm edg-utils-system which is not part of the
glite 3.0.2 list of packages:
http://glite.web.cern.ch/glite/packages/R3.0/deployment/glite-WN/glite-WN.asp
The issue was solved by configuring the WN using the config_crl function
which creates the new cron:
[root@td137 root]# /opt/glite/yaim/scripts/run_function
/opt/local/yaim/pic-site-info.def config_crl
Using hostname: td137.pic.es
Assuming the node types: WN
Configuring config_crl
Removing /etc/cron.d/edg-fetch-crl
Now updating the CRLs - this may take a few minutes...
Cheers
Carlos
Carlos Borrego Iglesias wrote:
> Hello Again,
> There was a lhcb user that somehow had 50 job-manager processes which
> were collapsing3 our CE. We have banned the user and now it seems the
> CE load is more or less normal. Our problem right now is that when the
> job finishes in the WN the RB doesn't realize about it. Job actually
> finishes but we get Aborted: Job RetryCount (3) hit.
>
> The ssh keys are properly configured to do a scp between the WN and
> the CE. The gridftp is working fine in both RB and CE. So I presume
> it's the communication between the WN and the RB which is not working
> well.
>
> I have launched this command to test it but i get this error:
>
> [lxplus005] /afs/cern.ch/user/c/cborrego > globus-job-run
> ce04.pic.es/jobmanager-lcgpbs /opt/globus/bin/globus-url-copy
> file:///etc/group gsiftp://rb01.pic.es/tmp/junk
> Creating /home/pbsWD_dteam002_239693.pbs01.pic.es
> Removing /home/pbsWD_dteam002_239693.pbs01.pic.es
>
> Job finished
> --------------------------------------------
>
> host: td135.pic.es
> cpu time:
> elapsed time:
> memory:
> virtual memory:
> job submitted at: Wed Aug 30 19:02:58
> job started at: Wed Aug 30 00:00:??
> job ended at: Wed Aug 30 19:03:06
>
> --------------------------------------------
> submit-helper script running on host td135 gave error:
> cache_export_dir (/home/dteam002/.lcgjm/globus-cache-export.pi3548) on
> gatekeeper did not contain a cache_export_dir.tar archive
>
> Is this the proper way to test it? What does "contain a
> cache_export_dir.tar archive" mean? Is this related in some way with
> the fact that we have plenty of jobs in Wating state in pbs (job_state
> = W)? These are jobs with an assigned WN but are not running. Mostly
> from lhcb vo.
>
> Thanks a lot!
> Carlos
>
>
> Carlos Borrego Iglesias wrote:
>> Hello,
>> In our lgc-CE running gLite 3.0.2 we are having serious problems with
>> its load. We find like 20/30 gatekeeper processes
>> (/opt/edg/sbin/edg-gatekeeper -conf
>> /opt/globus/etc/globus-gatekeeper.conf) some of them belonging to
>> root and some others to pool accounts. Is this normal?
>>
>> The CE has an enormous number of job manager processes and it reaches
>> to load 20. On the other hand jobs arrive to our pbs server and stay
>> in Waiting mode which we don't know if it's a cause or a consequence.
>> It happens as well that the sshd daemon hangs because of this load so
>> jobs when they finish in the WN are unable to send their output to
>> the CE.
>>
>> Any ideas of what's happening?
>>
>> Thanks a lot
>> Carlos
>>
>
|