On Sat, Jul 14, 2007 at 12:48:00AM +0200, Maarten Litmaath, CERN wrote:
> On Sat, 14 Jul 2007, Kyriakos Ginis wrote:
>
> > On Fri, Jul 13, 2007 at 05:21:19PM +0300, Stathakopoulos Giorgos wrote:
> > > Hello all,
> > >
> > > Our CE (ce01.kallisto.hellasgrid.gr) is overloaded due to many
> > > globus-job-manager processes of
> > >
> > > 1) globus-job-manager -conf /opt/globus/etc/globus-job-manager.conf
> > > -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> > > 2) /usr/bin/perl /opt/globus/libexec/globus-job-manager-script.pl -m pbs
> > > -f /tmp/gram_xxxxx -c remote_io_file_create
> > > 3) /opt/globus/libexec/globus-gass-cache-util -cleanup-tag -t
> > > https://ce01.kallisto.hellasgrid.gr:xxxxx/xxxxx/xxxxxxx
> > >
> > > Above processes start with a ratio of about 50/hour and they stay
> > > running. After a few hours CE stops responding and it runs out of
> > > memory. We have to reboot it to get it back.
> > >
> > > We have the latest update of middleware installed.
> > >
> > > Any ideas?
> >
> >
> > Hello,
> >
> > We also have observed a recent increase in the processes spawned through
> > the fork jobmanager. Has anything been changed recently regarding the
> > way the jobs are submitted and monitored by the RB/WMS?
>
> No. Can you check if those processes are coming from a certain RB or WMS?
> Look e.g. with "netstat -a". Which accounts show that problem?
I temporarily disabled the queue for the SEE VO (the region VO) a few
minutes after sending the mail, because the problem was caused by a
local user. Apparently this user, runs a Monte Carlo simulation and does
a mass job submission, which at some moments creates a very high load on
the CEs.
For example I got the following output by ps :
[root@ce01 root]# ps -fwwwu seeXXX
UID PID PPID C STIME TTY TIME CMD
seeXXX 1981 1 0 22:08 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
-machine-type unknown -publish-jobs
seeXXX 2098 1 0 22:08 ? 00:00:00 perl
/home/see/seeXXX/.globus/.gass_cache/local/md5/09/be70e36032517f5c8c83d0ae4d245b/md5/a9/a3659099e3a8aa1732ea6f22c82b80/data
--dest-url=https://wms01.egee-see.org:20003/tmp/condor_g_scratch.0x899d990.19489/grid-monitor.ce01.athena.hellasgrid.gr:2119
.2316/grid-monitor-job-status
seeXXX 2099 2098 0 22:08 ? 00:00:05 perl
/tmp/grid_manager_monitor_agent.seeXXX.2098.1000 --delete-self
--maxtime=3600s
seeXXX 24399 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24400 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24401 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24411 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24630 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24631 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24654 24399 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_cPOePN -c remove_scratchdir
seeXXX 24655 24401 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_dBTaxf -c remove_scratchdir
seeXXX 24656 24400 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_Ltw170 -c remove_scratchdir
seeXXX 24657 24411 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_dvGF5t -c remove_scratchdir
seeXXX 24693 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24751 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24932 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -
rdn jobmanager-pbs -machine-type unknown -publish-jobs
seeXXX 24937 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 24989 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25107 24631 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_CFFe0B -c remove_scratchdir
seeXXX 25108 24630 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_ACS5nl -c remove_scratchdir
seeXXX 25109 24693 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_lg1sl4 -c remove_scratchdir
seeXXX 25124 1 0 22:25 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25130 24751 0 22:25 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_7WKedK -c remove_scratchdir
seeXXX 25276 1 0 22:26 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25465 1 0 22:26 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25497 24932 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_UK7epp -c remove_scratchdir
seeXXX 25503 24937 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_GLcfy9 -c remove_scratchdir
seeXXX 25504 24989 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_kqcmVh -c remove_scratchdir
seeXXX 25518 25124 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_Iq05K1 -c remove_scratchdir
seeXXX 25578 1 0 22:26 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25585 1 0 22:26 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25597 1 0 22:26 ? 00:00:00 globus-job-manager -conf
/opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
seeXXX 25605 25276 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_gGrLG4 -c remove_scratchdir
seeXXX 25762 25465 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_tYWDoa -c remove_scratchdir
seeXXX 26012 25597 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_kxANP0 -c remove_scratchdir
seeXXX 26013 25585 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_3YOIuE -c remove_scratchdir
seeXXX 26014 25578 0 22:26 ? 00:00:00 /usr/bin/perl
/opt/globus/libexec/globus-job-manager-script.pl -m pbs -f
/tmp/gram_X53VmY -c remove_scratchdir
--
Kyriakos Ginis
|