Here is the Atlas report for this week.
Site problems:
=========
RAL T1:
* Castor upgrade completed at RAL cloud back online at the beginning of
the week.
* Intermittent network problems with the LFC. Still under investigation.
RalPP:
* Problems with dcache database affecting transfers solved in few hours.
Manchester:
* Bad weekend BDII stopped working on Saturday and site dropped out of
panda which had a development version so even when the BDII was fixed
site didn't get back until the production version was restored. Once back
a flood of jobs caused SE timeouts. Now back to normal
but the SE timeouts after a quiet period is something to look at.
* ECDF:
Failed transfers from/to RAL it seems the srm is down. Ticket opened
this morning.
FT Transfers:
=============
This is a category apart as they are functional tests and don't affect
production but in the new data distribution model they need to be followed up.
* UK (but not only) transfers to both Taiwan sites are quite bad.
Request to increase the timeout in FTS for both channels.
At the moment the time out has been increased only for one of the two
sites.
* Lancaster: timeouts with TOKYO. They are working on tuning their
server parameters.
* RalPP: timeouts with Australia Brian is working no tuning the FTS at
RAL relaxing the timeout and increasing the number of threads.
Problems caused by Atlas:
=========================
* Manchester:
- Glasgow factory had to be fixed to get pilots to both clusters. Fixed.
* Birmingham:
- Had a problem with the DB installation. Fixed by the atlas sw team.
- Second factory at CERN wasn't submitting to the Central cluster. Fixed.
Factories configurations were taken care mostly by Graeme who was very
quick to pick up problems and fix them. Now it will become a cloud squad
task so if there are any problems or configuration changes such as new/obsolete
queues sites should write to [log in to unmask]
|