On Thu, 19 Jul 2007 [log in to unmask] wrote:
> On Thu, 19 Jul 2007, Stathakopoulos George wrote:
>
> > We run /opt/lcg/sbin/cleanup-grid-accounts.sh only at one node of the
> > cluster (CE's /opt/lcg/etc/cleanup-grid-accounts.conf has all accounts). As
> > far as we can check, GPFS is working fine.
>
> OK.
I see you only re-enabled that cleanup on Saturday, and the CE has been up
for 6 days now...
> > We see in gram_job_mgr_<pid>.log for every globus-job-manager process these
> > entries:
> >
> > 7/16 08:32:04 JMI: poll: seeking:
> > https://ce01.kallisto.hellasgrid.gr:20004/26569/1184563918/
> > 7/16 08:32:04 JMI: poll_fast: ******** Failed to find
> > https://ce01.kallisto.hellasgrid.gr/26569/1184563918/
> > 7/16 08:32:04 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl
> > scripts)
> > 7/16 08:32:04 JMI: cmd = poll
> >
> > every 10 seconds. Globus-job-manager processes are running for more than an
> > hour each.
>
> A lot of those messages appear even if everything is working fine.
>
> You reported the problem on Fri. the 13th (sic): what changed in your cluster
> that day or one day before?
>
> Check for APT auto-update logs and similar things. Maybe something on the
> GPFS server? If it has some issue, it could affect the clients a lot.
I happened to capture a "ps" output on Sat. when the CE was thrashing,
and found a lot of grid processes in the 'D' state (waiting on short-term I/O,
typically disk I/O):
[...]
see014 10782 0.0 0.1 5404 3492 ? D 03:16 0:00 globus-job-manager
-conf /opt/globus/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
-machine-type unknown -publish-jobs
dteam082 10853 0.0 0.1 5148 3064 ? D 03:16 0:00 globus-job-manager
-conf /opt/globus/etc/globus-job-manager.conf -type pbs -rdn jobmanager-pbs
-machine-type unknown -publish-jobs
atlas162 11252 0.0 0.1 4872 2856 ? D 03:16 0:00 globus-job-manager
-conf /opt/globus/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
-machine-type unknown -publish-jobs
lhcb052 11542 0.0 0.1 5496 3636 ? D 03:16 0:00 globus-job-manager
-conf /opt/globus/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
-machine-type unknown -publish-jobs
[...]
This would be consistent with GPFS having some issue, e.g. being really slow.
|