Hi Maarten,
> > since earlier this morning, i observe extreme high system load and memory
> > usage ramp up speedily that cause system become unaccessiblle. we have
> > force rebooting the server twice since this morning. after reboot, the
> > load will reduce to normal mark, while it quickly ramp up to 1k with more
> > 3k gatekeeper processes:
> >
> > $ cessh w-ce01 top -bn1 | grep edg-gatekeepe .tmp2 | wc -l
> > 3745
>
> I do not remember having seen that one yet. Was anything changed lately?
> Any complaints in /var/log/messages, e.g. about a full file system or so?
>
> Can you check the output of netstat: maybe all those processes are connected
> to the same host? If so, block that host in the node/site firewall.
thanks
no, we have a long time not touching the configurations of w-ce01 which is
also an old CE box (i plan to replace with slc4 lcgCE version but yet have
time to proceed further), the only error i can find from gatekeeper log is
'Generic verification error for VOMS (failure)!' which shall be ignore
anyway and might be irrelevant to this issue as well.
the other error related to the invalid proxy, that should also have
limited impact to the stability of the CE box. though there are more than
16k entries referring to same error:
--
JMA 2008/06/01 08:41:57 GATEKEEPER_JM_ID 2008-06-01.08:41:46.0000016353.0000086583 for /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551 on 130.199.54.53
JMA 2008/06/01 08:41:57 GATEKEEPER_JM_ID 2008-06-01.08:41:46.0000016353.0000086583 mapped to atlasprd (41000, 1307)
JMA 2008/06/01 08:41:57 GATEKEEPER_JM_ID 2008-06-01.08:41:46.0000016353.0000086583 has GRAM_SCRIPT_JOB_ID 1212309717:lcgpbs:internal_45700877:23072.1212309712 manager type lcgpbs
JMA 2008/06/01 08:42:01 GATEKEEPER_JM_ID 2008-06-01.08:41:46.0000016353.0000086583 JM exiting
ERROR: Couldn't find a valid proxy.
Use -debug for further information.
--
and to release the load, we have stop the gridice agent to aovid parsing
GK logfile often (actually, the load contribute from gridice daemon shall
be limited so far as the GK logfile limited to less than 1GB anyway).
any idea?
Br,
J
|