Hi Maarten,
Thanks for your explanations. After a quiet period, the problem seems to
be there again. I'm then able to provide you a "ps" result :
> [root@cclcgceli02 ~]$ ps -elf | grep globus-job-manager | grep cms050
> | wc -l
> 295
> [root@cclcgceli02 ~]$ ps -elf | grep globus-job-manager | grep cms050
> | more
> 0 S cms050 32096 1 0 75 0 - 1355 schedu 11:19 ?
> 00:00:01 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> 0 S cms050 8145 1 0 76 0 - 1354 schedu 11:37 ?
> 00:00:01 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type fork -rdn jobmanager-fork -machine-type unknown -publish-jobs
> 0 S cms050 9670 1 0 75 0 - 1291 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
> 0 S cms050 9677 1 0 75 0 - 1292 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
> 0 S cms050 9755 1 0 75 0 - 1252 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
> 0 S cms050 9761 1 0 75 0 - 1252 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
> 0 S cms050 9907 1 0 75 0 - 1251 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
> 0 S cms050 9912 1 0 75 0 - 1252 schedu 11:38 ?
> 00:00:00 globus-job-manager -conf /opt/globus/etc/globus-job-manager.c
> onf -type bqs -rdn jobmanager-bqs -machine-type unknown -publish-jobs
There are currently 3 different certificates mapped on cms050 (which is
the account we used to map the CMS production role).
On the cluster, for cms050, there are :
- 527 running jobs
- 1208 queued jobs
> On job submission, each RB has at most 5 concurrent globus-job-manager
> processes per user per CE (each should exit soon); when jobs are being
> cancelled or cleaned up, however, there is no such limit (a bug that
> has only been fixed in a very recent version of Condor).
Could it be a "misunderstanding" between this CE and a RB ?
>
> So, either those 232 processes were cancelling or cleaning up jobs,
> or there was some other problem causing them to pile up: were there
> any complaints in /var/log/messages about something unusual, e.g.
> file system full or I/O errors?
I found only some occurences of the errors below :
> Feb 1 09:48:38 cclcgceli02 edg-wl-interlogd[1365]: error reading
> server cert-rb-07.cnaf.infn.it reply: get_reply (header)
> Feb 1 09:55:07 cclcgceli02 edg-wl-logd[13990]:
> edg_wll_log_proto_server: edg_wll_ParseEvent error
> Feb 1 09:55:24 cclcgceli02 edg-wl-logd[14207]:
> edg_wll_log_proto_server: edg_wll_ParseEvent error
> Feb 1 09:55:28 cclcgceli02 edg-wl-logd[14218]:
> edg_wll_log_proto_server: edg_wll_ParseEvent error
For now, there is no "globus_duct_control" error, but I guess that I
will have some if the number of globus-job-manager processes grows again.
Thanks for your help
Pierre
--
______________________
Pierre GIRARD
French ROC deputy (EGEE/LCG)
Grid Computing Team Member
IN2P3/CNRS Computing Centre - Lyon (FRANCE)
e-mail: [log in to unmask]
Tel. +33 4.72.69.52.89
http://cc.in2p3.fr
CCIN2P3 Tel. +33 4.78.93.08.80 | CCIN2P3 Fax. +33 4.72.69.41.70
|