Hello,
Since quite a long time we are seeing that our CE (which also hosts the
Torque and Maui servers) is very highly loaded. Sometimes, this seems to
make the Torque server itself to hang.
Looking at the Torque server log files we see there are lots (and lots)
of "status queries" against the Torque server. These appear in
/var/spool/pbs/server_logs/YYYYMMDD as for example:
03/22/2006 00:00:05;0100;PBS_Server;Req;;Type jobscript request received
from [log in to unmask], sock=10
03/22/2006 00:00:01;0100;PBS_Server;Req;;Type statusqueue request
received from [log in to unmask], sock=10
03/22/2006 00:00:05;0100;PBS_Server;Req;;Type statusjob request received
from [log in to unmask], sock=9
...
In our Torque server, typically we see tens of them per minute. However,
there is a curious pattern: every 20 minutes we see a peak of "statusjob
request" queries, which can be of hundreds in a shot (seems it
corresponds to the number of jobs in the system at that time).
We think we can correlate the Torque/Maui "hangs" events with such
"statusjob queries storms", so we wanted to understand what was causing
them.
The last thing I tried was to stop the edg-fmon-agent in the CE, and it
seems the storms have stopped. I believe then that some fmon (gridice?)
sensor was causing them.
Does somebody know if there is some way to make these sensors to use
cached info of qstat or something similar, to reduce the load they cause?
thanks a lot,
gonzalo
|