Hello Maarten,
We run /opt/lcg/sbin/cleanup-grid-accounts.sh only at one node of the
cluster (CE's /opt/lcg/etc/cleanup-grid-accounts.conf has all accounts). As
far as we can check, GPFS is working fine.
We see in gram_job_mgr_<pid>.log for every globus-job-manager process these
entries:
7/16 08:32:04 JMI: poll: seeking:
https://ce01.kallisto.hellasgrid.gr:20004/26569/1184563918/
7/16 08:32:04 JMI: poll_fast: ******** Failed to find
https://ce01.kallisto.hellasgrid.gr/26569/1184563918/
7/16 08:32:04 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
7/16 08:32:04 JMI: cmd = poll
every 10 seconds. Globus-job-manager processes are running for more than an
hour each.
Cheers,
George
-----Original Message-----
From: [log in to unmask] [mailto:[log in to unmask]]
Sent: Wednesday, July 18, 2007 5:49 PM
To: Stathakopoulos George
Cc: LHC Computer Grid - Rollout
Subject: Re: [LCG-ROLLOUT] FW: [LCG-ROLLOUT] CE runs out of memory due to
many globus-job-manager processes
On Wed, 18 Jul 2007, Stathakopoulos George wrote:
> Hello,
>
> I'm reposting this because I didn't find anything that can help to
> solve this issue.
I did not see a reply to this message I sent earlier:
----------------------------------------------------------------------------
-
On Fri, 13 Jul 2007, Stathakopoulos Giorgos wrote:
> Hello all,
>
> Our CE (ce01.kallisto.hellasgrid.gr) is overloaded due to many
> globus-job-manager processes of
>
> 1) globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
> -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> 2) /usr/bin/perl /opt/globus/libexec/globus-job-manager-script.pl -m
> pbs -f /tmp/gram_xxxxx -c remote_io_file_create
> 3) /opt/globus/libexec/globus-gass-cache-util -cleanup-tag -t
> https://ce01.kallisto.hellasgrid.gr:xxxxx/xxxxx/xxxxxxx
>
> Above processes start with a ratio of about 50/hour and they stay
> running. After a few hours CE stops responding and it runs out of
> memory. We have to reboot it to get it back.
>
> We have the latest update of middleware installed.
>
> Any ideas?
In /var/log I noticed that /opt/lcg/sbin/cleanup-grid-accounts.sh last did
something on June 24:
----------------------------------------------------------------------------
-----------
-rw-r--r-- 1 root root 92 Jul 14 02:16
cleanup-grid-accounts.log
-rw-r--r-- 1 root root 107 Jul 13 02:16
cleanup-grid-accounts.log.1.gz
-rw-r--r-- 1 root root 107 Jul 12 03:14
cleanup-grid-accounts.log.2.gz
[...]
-rw-r--r-- 1 root root 107 Jun 25 02:10
cleanup-grid-accounts.log.18.gz
-rw-r--r-- 1 root root 18094 Jun 24 02:10
cleanup-grid-accounts.log.19.gz
-rw-r--r-- 1 root root 25510 Jun 22 05:03
cleanup-grid-accounts.log.20.gz
-rw-r--r-- 1 root root 21685 Jun 21 05:07
cleanup-grid-accounts.log.21.gz
----------------------------------------------------------------------------
-----------
This is because /opt/lcg/etc/cleanup-grid-accounts.conf ends like this:
----------------------------------------------------------------------------
-----------
# next lines added by YAIM on Thu Jul 12 13:19:27 EEST 2007 ACCOUNTS='
'
----------------------------------------------------------------------------
-----------
Any idea how that happened? Please try the following:
----------------------------------------------------------------------------
-----------
/opt/glite/yaim/bin/yaim -r -s your-site-info.def -f config_users
----------------------------------------------------------------------------
-----------
Then check if /opt/lcg/etc/cleanup-grid-accounts.conf lists all grid
accounts.
If they do not get cleaned up regularly, that could slow things down a lot.
You use GPFS for the home directories: maybe it has problems with the large
numbers of hard links under the .globus/.gass_cache subdirectories?
Is GPFS in good shape? Any hardware errors?
|