Hi Gonzalo,
I observed this as well and this happens when the globus-gatekeeper log
file that gridice greps becomes very large, you could think of rotating it
more often. The Oxford guys also noticed that the version of ssh in SL304
is very slow; updgrading it to the SL305 will improve timing by one order
of magnitude (see `time grep ...'). Are you using SL304?
Yves
On Wed, 22 Mar 2006, Gonzalo Merino wrote:
> Hello,
>
> Since quite a long time we are seeing that our CE (which also hosts the
> Torque and Maui servers) is very highly loaded. Sometimes, this seems to
> make the Torque server itself to hang.
> Looking at the Torque server log files we see there are lots (and lots)
> of "status queries" against the Torque server. These appear in
> /var/spool/pbs/server_logs/YYYYMMDD as for example:
>
> 03/22/2006 00:00:05;0100;PBS_Server;Req;;Type jobscript request received
> from [log in to unmask], sock=10
>
> 03/22/2006 00:00:01;0100;PBS_Server;Req;;Type statusqueue request
> received from [log in to unmask], sock=10
>
> 03/22/2006 00:00:05;0100;PBS_Server;Req;;Type statusjob request received
> from [log in to unmask], sock=9
>
> ...
>
> In our Torque server, typically we see tens of them per minute. However,
> there is a curious pattern: every 20 minutes we see a peak of "statusjob
> request" queries, which can be of hundreds in a shot (seems it
> corresponds to the number of jobs in the system at that time).
>
> We think we can correlate the Torque/Maui "hangs" events with such
> "statusjob queries storms", so we wanted to understand what was causing
> them.
>
> The last thing I tried was to stop the edg-fmon-agent in the CE, and it
> seems the storms have stopped. I believe then that some fmon (gridice?)
> sensor was causing them.
>
> Does somebody know if there is some way to make these sensors to use
> cached info of qstat or something similar, to reduce the load they cause?
>
> thanks a lot,
> gonzalo
>
|