Hi Maarten,
> no, we have a long time not touching the configurations of w-ce01 which is
> also an old CE box (i plan to replace with slc4 lcgCE version but yet have
> time to proceed further), the only error i can find from gatekeeper log is
> 'Generic verification error for VOMS (failure)!' which shall be ignore
> anyway and might be irrelevant to this issue as well.
>
> the other error related to the invalid proxy, that should also have
> limited impact to the stability of the CE box. though there are more than
> 16k entries referring to same error:
just to paste you current server laod status after being reboot an hour
ago. we now have more than 537 globus-job-manager processes passing wms at
wms017.cnaf.infn.it, is this normal?
# top -bn1 |grep cms001| awk '{print $(NF)}' | sort | uniq -c
25 <defunct>
11 edg-gridftpd
1 globus-gass-cac
537 globus-job-mana
1 mktemp
71 perl
1 qstat
1 sh
11:04:13 up 1:56, 1 user, load average: 52.19, 49.83, 45.80
1656 processes: 1528 sleeping, 56 running, 59 zombie, 13 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 352.8% 0.0% 46.8% 0.0% 0.0% 0.0% 0.0%
cpu00 90.6% 0.0% 9.3% 0.0% 0.0% 0.0% 0.0%
cpu01 90.0% 0.0% 9.9% 0.0% 0.0% 0.0% 0.0%
cpu02 93.0% 0.0% 6.9% 0.0% 0.0% 0.0% 0.0%
cpu03 79.0% 0.0% 20.9% 0.0% 0.0% 0.0% 0.0%
Mem: 4087856k av, 4036180k used, 51676k free, 0k shrd, 19720k buff
3018968k actv, 577072k in_d, 78312k in_c
Swap: 4192956k av, 1357660k used, 2835296k free 210948k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
15648 cms001 16 0 11008 10M 1728 S 18.6 0.2 1:00 0 globus-job-mana
23137 cms001 17 0 11008 10M 1728 R 16.9 0.2 1:54 2 globus-job-mana
29637 cms001 15 0 11008 10M 1728 R 14.0 0.2 1:35 0 globus-job-mana
19118 cms001 17 0 11008 10M 1728 S 13.4 0.2 1:50 0 globus-job-mana
4950 cms001 16 0 11076 8716 1728 S 13.1 0.2 2:17 2 globus-job-mana
thanks
Br,
J
|