Here is the Atlas report for this week.
Site problems:
=========
RAL:
* Failed transfers to PIC due to corrupted files, pre-checksum enabling, in castor. Files declared lost
in DQ2. system should take care of it.
QMUL:
* Finished the airconditionning upgrade last Tuesday. Storage server crashed on Saturday,
Chris declared a downtime and upgraded to StoRM 1.6 which solves a number of problems included
information published incorrectly, checksums and file permissions. Site is back online.
Lancaster:
* Job failures due to storage problems. Restarted a rfiod and fixed a configuration problem.
Still Pending.
* Missing release problems. Installation jobs couldn't install as they hit max retry limit.
Release now seems to be there. Possibly an NFS problem. To be followed up.
Manchester:
* Reduced network capacity to fix the sonar tests caused a number of job failures on more demanding
jobs such as the reco redirected from T1. Still need to look at the bonding. Will do this week.
Durham:
* Transfer failures between RAL. Cause not yet understood yet. But it seemed to be a spike and
now the failure rate has dropped to zero.
* Additional space requested for PRODDISK.
Birmingham:
* Blacklisted in DDM due to lack of space in PRODDISK. Additional space requested for PRODDISK.
Added 1 TB. Waiting to be unblacklisted.
* Shared cluster now in downtime.
Oxford:
* Problems with transfers from WN to SE caused by some mismatched package on DPM. Fixed
* Additional space requested for PRODDISK.
RalPP:
* In downtime since Friday for upstream electrical work. No problem after downtime.
* Blacklisted in DDM due to lack of space in PRODDISK. To be followed up.
Glasgow:
* Transfer failures due to problems with the campus gateway. Solved.
ECDF:
* Brief downtime for reconfiguration of the directories. Solved.
UCL:
* Job failures due to missing release. Involved atlas-sw-team. Site is being tested.
FT Transfers:
=========
This is a category apart as they are functional tests and don't affect
production but in the new data distribution model
they need to be followed up.
- Nothing to report
Problems caused by Atlas:
================
* RAL: Missing release has caused reco jobs to go to T2s. A number of sites had to increase their space
in PRODDISK. Sites get blacklisted in DDM automatically if the space is completely filled.
* Problem is with panda schedconfig loading the list of releases from the BDII.
Releases were tagged manually by Graeme and Rod this morning. Rod is trying to fix the
loader as well. schedconfig maintiner, offline ATM, has been contacted. Reco jobs should go back
to T1 now.
* Still PRODDISK size policy needs reviewing to overcome these situations while waiting for a more
permanent fix. Largest sites have already increased from few TB to 15-20 TB or more.
|