Dear all,
since earlier this morning, i observe extreme high system load and memory
usage ramp up speedily that cause system become unaccessiblle. we have
force rebooting the server twice since this morning. after reboot, the
load will reduce to normal mark, while it quickly ramp up to 1k with more
3k gatekeeper processes:
$ cessh w-ce01 top -bn1 | grep edg-gatekeepe .tmp2 | wc -l
3745
the CE tag we have is 3.0.14-0, just wondering if this is an known issue
or?
*) snapshot of system top info:
05:27:10 up 2 min, 1 user, load average: 13.34, 4.09, 1.43
233 processes: 205 sleeping, 27 running, 1 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 350.4% 0.0% 48.4% 0.0% 0.8% 0.0% 0.0%
cpu00 85.1% 0.0% 14.0% 0.0% 0.8% 0.0% 0.0%
cpu01 91.8% 0.0% 8.1% 0.0% 0.0% 0.0% 0.0%
cpu02 83.4% 0.0% 16.5% 0.0% 0.0% 0.0% 0.0%
cpu03 90.0% 0.0% 9.9% 0.0% 0.0% 0.0% 0.0%
Mem: 4087856k av, 723084k used, 3364772k free, 0k shrd, 49376k buff
487080k active, 157320k inactive
Swap: 4192956k av, 0k used, 4192956k free 226844k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
5444 cmsprd 19 0 9900 9900 1828 R 27.2 0.2 0:11 0 perl
8366 cms001 21 0 7048 7048 1716 R 20.6 0.1 0:00 0 globus-job-mana
8262 cms001 21 0 9972 9972 1712 R 16.5 0.2 0:00 3 globus-job-mana
5296 cms001 17 0 71972 70M 1808 S 15.6 1.7 0:06 1 perl
7831 cms001 16 0 7052 7052 1716 S 15.6 0.1 0:02 2 globus-job-mana
the day scope of system load can be read from
http://idv.sinica.edu.tw/hlshih/asgc_wce01_load_06022008_day.jpg and the
snapshot we have just before we reboot the box 5min ago,
http://idv.sinica.edu.tw/hlshih/asgc_wce01_load_06022008.jpg
thanks
Br,
J
--
-----
Jason Shih, ASGC/OPS
Tel: +886-2-2788-0058 x1005 or 1006
Fax: +886-2-2789-6793
|