Hi John,
> The Cambridge ones at least appear to be because job submission is being
> disabled from time to time at the CE, presumably by ATLAS (I'm certainly
> not doing it!). If it's not that, suggestions welcome.
I had a look at our cream CEs to refresh my memory, I think what you
need to tweak is the variables in:
/etc/glite-ce-cream-utils/glite_cream_load_monitor.conf [1]
Which is called by /usr/bin/glite_cream_load_monitor [2]
If any of these parameters are met then the job submission is disabled
for a time whilst the cream waits for things to calm down.
You might need to tweak these as IIRC the defaults are quite low.
glite_cream_load_monitor actually has a working man page which might be
useful.
Hope that helps!
Matt
[1] On one of our CEs we have:
# cat /etc/glite-ce-cream-utils/glite_cream_load_monitor.conf
# Thresholds for glite_cream_load_monitor
# -1 means no limit
#
Load1 = 40
Load5 = 40
Load15 = 20
MemUsage = 95
SwapUsage = 95
FDNum = 500
DiskUsage = 95
FTPConn = 300
FDTomcatNum = 800
ActiveJobs = -1
PendingCmds = -1
[2] The load balancer used is defined in the cream-config.xml:
<parameter name="JOB_SUBMISSION_MANAGER_SCRIPT_PATH"
value="/usr/bin/glite_cream_load_monitor
/etc/glite-ce-cream-utils/glite_cream_load_monitor.conf" />
> John
>
>>
>> We're up to 30 Open UK Tickets this week. Here are the highlights:
>>
>> TIER 1
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109276 (11/10)
>> Submissions to the FTS3 REST interface was failing for some, probably
>> after the certs or crls got stale. Andrew L suggested implementing an
>> httpd restart which Maarten suggested was overkill - but anyhoo the
>> submitter has come back to say that he hasn't seen a problem all week,
>> so this ticket can likely be closed. In progress (20/10)
>>
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=108845 (27/9)
>> Just a heads up that this atlas transfer failure ticket has been
>> reopened. Reopened (18/10)
>>
>> RALPP
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109360 (15/10)
>> This SNO+ ticket, about failing nagios tests at RALPP, hasn't been
>> noticed yet. Assigned (15/10)
>>
>> SHEFFIELD
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109207 (8/10)
>> SNO+ would like the VO_SW_DIR environmental variable to point to cvmfs -
>> I know Elena has looked at this, any progress? In progress (9/10)
>>
>> Similar with another Sno_ ticket at Sheffield:
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109223 (9/10)
>>
>> BRUNEL
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=109379 (16/10)
>> SRM Nagios test failures. It looks like Brunels SE is in a dodgey state
>> - too many ftp connection failures have been seen in the gridftp logs,
>> httpd causing heavy load, possible SELinux problems after DB move. I'm
>> sure if anyone has any input on this it would be appreciated. In
>> progress (17/10)
>>
>> IMPERIAL/DIRAC
>> https://ggus.eu/index.php?mode=ticket_info&ticket_id=108723 (23/9)
>> I think this ticket from Chris W, containing questions for the DIRAC
>> team, can be closed in favour of the new line of communication Daniela
>> set up (https://mailman.ic.ac.uk/mailman/listinfo/gridpp-dirac-users).
>> Waiting for reply (7/10)
>>
>> ECDF AND GLASGOW
>> Two very similar LHCB cvmfs tickets at these sites, any chance of a
>> link? Or perhaps just a coincidence?
>> ECDF: https://ggus.eu/index.php?mode=ticket_info&ticket_id=109440
>> GLASGOW: https://ggus.eu/index.php?mode=ticket_info&ticket_id=109439
>>
>> I think that's all, at least as far as I can see.
>>
>> Cheers!
>> Matt
|