Hi Matt,
Thanks for this. Unfortunately my glite_cream_load_monitor.conf has
exactly the same settings as you: I guess the more sensible defaults
made it into the rpm. Time for a bit more digging.
John
On 21/10/2014 13:07, Matt Doidge wrote:
> Hi John,
>
>> The Cambridge ones at least appear to be because job submission is being
>> disabled from time to time at the CE, presumably by ATLAS (I'm certainly
>> not doing it!). If it's not that, suggestions welcome.
>
> I had a look at our cream CEs to refresh my memory, I think what you
> need to tweak is the variables in:
> /etc/glite-ce-cream-utils/glite_cream_load_monitor.conf [1]
>
> Which is called by /usr/bin/glite_cream_load_monitor [2]
>
> If any of these parameters are met then the job submission is disabled
> for a time whilst the cream waits for things to calm down.
>
> You might need to tweak these as IIRC the defaults are quite low.
> glite_cream_load_monitor actually has a working man page which might be
> useful.
>
> Hope that helps!
> Matt
>
> [1] On one of our CEs we have:
> # cat /etc/glite-ce-cream-utils/glite_cream_load_monitor.conf
> # Thresholds for glite_cream_load_monitor
> # -1 means no limit
> #
> Load1 = 40
> Load5 = 40
> Load15 = 20
> MemUsage = 95
> SwapUsage = 95
> FDNum = 500
> DiskUsage = 95
> FTPConn = 300
> FDTomcatNum = 800
> ActiveJobs = -1
> PendingCmds = -1
>
> [2] The load balancer used is defined in the cream-config.xml:
> <parameter name="JOB_SUBMISSION_MANAGER_SCRIPT_PATH"
> value="/usr/bin/glite_cream_load_monitor
> /etc/glite-ce-cream-utils/glite_cream_load_monitor.conf" />
>
>> John
>>
>>>
>>> We're up to 30 Open UK Tickets this week. Here are the highlights:
>>>
>>> TIER 1
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109276 (11/10)
>>> Submissions to the FTS3 REST interface was failing for some, probably
>>> after the certs or crls got stale. Andrew L suggested implementing an
>>> httpd restart which Maarten suggested was overkill - but anyhoo the
>>> submitter has come back to say that he hasn't seen a problem all week,
>>> so this ticket can likely be closed. In progress (20/10)
>>>
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=108845 (27/9)
>>> Just a heads up that this atlas transfer failure ticket has been
>>> reopened. Reopened (18/10)
>>>
>>> RALPP
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109360 (15/10)
>>> This SNO+ ticket, about failing nagios tests at RALPP, hasn't been
>>> noticed yet. Assigned (15/10)
>>>
>>> SHEFFIELD
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109207 (8/10)
>>> SNO+ would like the VO_SW_DIR environmental variable to point to cvmfs -
>>> I know Elena has looked at this, any progress? In progress (9/10)
>>>
>>> Similar with another Sno_ ticket at Sheffield:
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109223 (9/10)
>>>
>>> BRUNEL
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109379 (16/10)
>>> SRM Nagios test failures. It looks like Brunels SE is in a dodgey state
>>> - too many ftp connection failures have been seen in the gridftp logs,
>>> httpd causing heavy load, possible SELinux problems after DB move. I'm
>>> sure if anyone has any input on this it would be appreciated. In
>>> progress (17/10)
>>>
>>> IMPERIAL/DIRAC
>>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=108723 (23/9)
>>> I think this ticket from Chris W, containing questions for the DIRAC
>>> team, can be closed in favour of the new line of communication Daniela
>>> set up (https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users).
>>> Waiting for reply (7/10)
>>>
>>> ECDF AND GLASGOW
>>> Two very similar LHCB cvmfs tickets at these sites, any chance of a
>>> link? Or perhaps just a coincidence?
>>> ECDF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=109440
>>> GLASGOW: https://ggus.eu/index.php?mode=ticket_info&ticket_id=109439
>>>
>>> I think that's all, at least as far as I can see.
>>>
>>> Cheers!
>>> Matt
|