Hi Gonzalo,
I suppose that you have LCG2.7.0;
first of all try to see the status of the GridICE daemons:
service gridce_daemons status
and eventually restart them:
service gridce_daemons restart
the problem with big log message files was related to the old version of
sensors (before LCG2.7.0). Now all needed info are taken by *listening* to the
log files.
for sure can also help to see the problem was reported by Marcin Radecki on
GGUS also:
https://gus.fzk.de/ws/overview.php?ticket=6889
after testing a new sensor rpm (v1.6.0-25) on production conditions, it is
available from our site:
http://infnforge.cnaf.infn.it/project/showfiles.php?group_id=8
I would like to thank again Francesco Gregoretti that built the IA64 version.
We are in the proccess of publishong all to LCG-savannah for next LCG patch
release.
Let us know if there are other problems.
Cheers,
Sergio
Gonzalo Merino wrote:
> Hello,
>
> Since quite a long time we are seeing that our CE (which also hosts the
> Torque and Maui servers) is very highly loaded. Sometimes, this seems to
> make the Torque server itself to hang.
> Looking at the Torque server log files we see there are lots (and lots)
> of "status queries" against the Torque server. These appear in
> /var/spool/pbs/server_logs/YYYYMMDD as for example:
>
> 03/22/2006 00:00:05;0100;PBS_Server;Req;;Type jobscript request received
> from [log in to unmask], sock=10
>
> 03/22/2006 00:00:01;0100;PBS_Server;Req;;Type statusqueue request
> received from [log in to unmask], sock=10
>
> 03/22/2006 00:00:05;0100;PBS_Server;Req;;Type statusjob request received
> from [log in to unmask], sock=9
>
> ...
>
> In our Torque server, typically we see tens of them per minute. However,
> there is a curious pattern: every 20 minutes we see a peak of "statusjob
> request" queries, which can be of hundreds in a shot (seems it
> corresponds to the number of jobs in the system at that time).
>
> We think we can correlate the Torque/Maui "hangs" events with such
> "statusjob queries storms", so we wanted to understand what was causing
> them.
>
> The last thing I tried was to stop the edg-fmon-agent in the CE, and it
> seems the storms have stopped. I believe then that some fmon (gridice?)
> sensor was causing them.
>
> Does somebody know if there is some way to make these sensors to use
> cached info of qstat or something similar, to reduce the load they cause?
>
> thanks a lot,
> gonzalo
>
--
---------------------------------------------------------------------
Sergio Fantinel EGEE Project
---------------------------------------------------------------------
INFN - Lab. Naz. di Legnaro phone: +39 049 8068 489
viale dell'Università n. 2,
35020 Legnaro (PD) ITALY [log in to unmask]
---------------------------------------------------------------------
|