Hello,
I'm reposting this because I didn't find anything that can help to solve
this issue.
I did "lsof |grep <pid of a globus-manager process> and I got two things:
1) The most of these processes have a TCP connection to a RB (e.g.
rb127.cern.ch, rb105.cern.ch) with status CLOSE_WAIT
2) The gram_job_mgr_26569.log (for process 26569) repeats every 10 seconds
these entries:
---1st entry---
Mon Jul 16 08:32:04 2007 JM_SCRIPT: New Perl JobManager created.
7/16 08:32:04 JMI: while return_buf = GRAM_SCRIPT_JOB_ID = 26900
7/16 08:32:04 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
7/16 08:32:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_SUBMIT
7/16 08:32:04 JM: in globus_gram_job_manager_reporting_file_create()
7/16 08:32:04 JM: not reporting job information
7/16 08:32:04 JM: in globus_gram_job_manager_history_file_create()
7/16 08:32:04 JM: empty client callback list.
7/16 08:32:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
7/16 08:32:04 JMI: testing job manager scripts for type fork exist and
permissions are ok.
7/16 08:32:04 JMI: completed script validation: job manager type is fork.
7/16 08:32:04 JMI: in globus_gram_job_manager_poll()
7/16 08:32:04 JMI: local stdout filename =
/home/lhcb/lhcb053/.globus/.gass_cache/local/md5/25/3217c320d56f835bc04aa626
06f270/md5/a5/f254d37548bd58377dcb593
b574005/data.
7/16 08:32:04 JMI: local stderr filename = /dev/null.
7/16 08:32:04 JMI: poll: seeking:
https://ce01.kallisto.hellasgrid.gr:20004/26569/1184563918/
7/16 08:32:04 JMI: poll_fast: ******** Failed to find
https://ce01.kallisto.hellasgrid.gr/26569/1184563918/
7/16 08:32:04 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
7/16 08:32:04 JMI: cmd = poll
7/16 08:32:04 JMI: returning with success
---2nd entry---
Mon Jul 16 08:32:04 2007 JM_SCRIPT: New Perl JobManager created.
Mon Jul 16 08:32:04 2007 JM_SCRIPT: polling job 26900
7/16 08:32:04 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 2
7/16 08:32:04 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1
7/16 08:32:14 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2
7/16 08:32:14 JMI: testing job manager scripts for type fork exist and
permissions are ok.
7/16 08:32:14 JMI: completed script validation: job manager type is fork.
7/16 08:32:14 JMI: in globus_gram_job_manager_poll()
7/16 08:32:14 JMI: local stdout filename =
/home/lhcb/lhcb053/.globus/.gass_cache/local/md5/25/3217c320d56f835bc04aa626
06f270/md5/a5/f254d37548bd58377dcb593
b574005/data.
7/16 08:32:14 JMI: local stderr filename = /dev/null.
7/16 08:32:14 JMI: poll: seeking:
https://ce01.kallisto.hellasgrid.gr:20004/26569/1184563918/
7/16 08:32:14 JMI: poll_fast: ******** Failed to find
https://ce01.kallisto.hellasgrid.gr/26569/1184563918/
7/16 08:32:14 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
scripts)
7/16 08:32:14 JMI: cmd = poll
7/16 08:32:14 JMI: returning with success
---3rd entry---
Same as above
...
Can we do something about this? Is this a configuration issue?
Thanks again,
George
-----Original Message-----
From: [log in to unmask] [mailto:[log in to unmask]]
Sent: Saturday, July 14, 2007 1:48 AM
To: Kyriakos Ginis; Stathakopoulos Giorgos
Cc: [log in to unmask]
Subject: Re: [LCG-ROLLOUT] CE runs out of memory due to many
globus-job-manager processes
On Sat, 14 Jul 2007, Kyriakos Ginis wrote:
> On Fri, Jul 13, 2007 at 05:21:19PM +0300, Stathakopoulos Giorgos wrote:
> > Hello all,
> >
> > Our CE (ce01.kallisto.hellasgrid.gr) is overloaded due to many
> > globus-job-manager processes of
> >
> > 1) globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
> > -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > 2) /usr/bin/perl /opt/globus/libexec/globus-job-manager-script.pl -m
> > pbs -f /tmp/gram_xxxxx -c remote_io_file_create
> > 3) /opt/globus/libexec/globus-gass-cache-util -cleanup-tag -t
> > https://ce01.kallisto.hellasgrid.gr:xxxxx/xxxxx/xxxxxxx
> >
> > Above processes start with a ratio of about 50/hour and they stay
> > running. After a few hours CE stops responding and it runs out of
> > memory. We have to reboot it to get it back.
> >
> > We have the latest update of middleware installed.
> >
> > Any ideas?
>
>
> Hello,
>
> We also have observed a recent increase in the processes spawned
> through the fork jobmanager. Has anything been changed recently
> regarding the way the jobs are submitted and monitored by the RB/WMS?
No. Can you check if those processes are coming from a certain RB or WMS?
Look e.g. with "netstat -a". Which accounts show that problem?
|