Apologies, but I may be late to today's meeting - hardware issue to resolve (non-GRID though :-) ). John On 03/12/2012 15:56, Matt Doidge wrote: > Hello all, > Here's this week's ticket revue. > Cheers, > Matt > > 32 Open UK tickets this week. It's the start of the month, so all > tickets, great or small, will get reviewed. > > NGI/VOMS > https://ggus.eu/ws/ticket_info.php?ticket=88546 (16/11) > Creation of epic.vo.gridpp.ac.uk. Name has been settled on, deployed on > the master VOMS instance and rolled out to the backups, ready for > whatever the next step will be. In progress (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=87813 (25/10) > Migration of the vo.helio-vo.eu to the UK. At last word everything was > done on the VOMS side, and testing on grid resources was needed to be > done. In progress (15/11) > > TIER 1 > https://ggus.eu/ws/ticket_info.php?ticket=89141 (3/12) > RAL are seeing a high atlas production job failure rate, and a possibly > related high FTS failure rate. In Progress (3/12) > > https://ggus.eu/ws/ticket_info.php?ticket=89081 (30/11) > Failed biomed SAM tests, tracked to a missing / in a .lsc file. Should > be fixed, waiting for confirmation (but don't wait too long). Waiting > for reply (3/12) > > https://ggus.eu/ws/ticket_info.php?ticket=89063 (30/11) > The atlas frontier squids at RAL weren't working, fixed (networking > problem) but ticket reopened and placed on hold as the monitoring for > these boxes needs updating. On hold (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88596 (19/11) > t2k.org jobs weren't be delegated to RAL. After some effort this has > been fixed, the ticket can be closed. In progress (1/12) > > https://ggus.eu/ws/ticket_info.php?ticket=86690 (3/10) > "JPKEKCRC02 missing from FTS ganglia metrics" for t2k. This has been a > pain to fix, at last word RAL were waiting on their ganglia expert to > come back, but that was a while ago (however I suspect they had bigger > fish to fry in November). In progress (6/11) > > https://ggus.eu/ws/ticket_info.php?ticket=86152 (17/9) > Correlated packet loss on the RAL perfsonar. On hold pending a wider > scale investigation. On hold (31/10) > > UCL > https://ggus.eu/ws/ticket_info.php?ticket=87468 (17/10) > The last Unsupported gLite software ticket (until the next batch). Ben > has put the remaining out of date CE into downtime after updating > another. In progress (29/11) > > BIRMINGHAM > https://ggus.eu/ws/ticket_info.php?ticket=89129 (3/12) > High atlas production failure rate, likely to be due to the migration to > EMI. It could be a problem with the software area, Mark has involved > Alessandro De Salvo. Waiting for reply (3/12) > > https://ggus.eu/ws/ticket_info.php?ticket=86105 (14/9) > Low atlas sonar rates to BNL from Birmingham. atlas tag removed from > ticket to lower noise. On hold (30/11) > > IMPERIAL > https://ggus.eu/ws/ticket_info.php?ticket=89105 (1/12) > t2k.org jobs failing on I.C. WMSs due to proxy expiry. Daniela thinks > that it may be a problem with myproxy (the cern myproxy servers are > having dns alias trouble by the looks of it). In progress (3/12) > > SHEFFIELD > https://ggus.eu/ws/ticket_info.php?ticket=89096 (30/11) > lhcb jobs to Sheffield that go through the WMS are seeing "BrokerHelper: > no compatible resources" resources, possibly due to the published values > for GlueCEStateFreeCPUs & GlueCEStateFreeJobSlots being 0. In progress > (3/12) > > LANCASTER > https://ggus.eu/ws/ticket_info.php?ticket=89066 (30/11) > biomed nagios tests failing on the Lancaster SE. "problem listing > Storage Path(s)", which suggests to me that we have a publishing > problem. Couldn't find any obvious bugbears though, keeping on digging. > In progress (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=89084 (30/11) > The problem in 89066 is also affecting the biomed CE tests. On hold (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88628 (20/11) > Getting t2k working on our clusters. Had some problem with building root > on one cluster, and even just submitting jobs to the other. In progress > (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88772 (22/11) > One of Lancaster's clusters is reporting default values for > "GlueCEPolicyMaxCPUTime", mucking up lhcb's job scheduling. Tracked to a > problem in the scripts > (https://ggus.eu/ws/ticket_info.php?ticket=88904), the fix will be out > in January so I've on-holded this until then. On hold (3/12) > > https://ggus.eu/ws/ticket_info.php?ticket=85367 (20/8) > ilc jobs always fail on a Lancaster CE, possibly due to the CE's poor > performance. For the third time in a row I've had to put this work off > for a month. On hold (3/12) > > https://ggus.eu/ws/ticket_info.php?ticket=84461 (23/7) > t2k transfer failures to Lancaster. Having trouble getting a routing > change put through with the RAL networking team, probably due to them > having a lot on their plate over the past month. In Progress (3/12) > > LIVERPOOL > https://ggus.eu/ws/ticket_info.php?ticket=88761 (22/11) > Technically a ticket from Liverpool to lhcb. A complaint over the > bandwidth used by lhcb jobs, probably due to a spike in lhcb jobs > running during an atlas quiet period. Are all sides satisfied about the > cause of this problem and the steps taken to prevent this happening > again? In progress (23/11) > > SUSSEX > https://ggus.eu/ws/ticket_info.php?ticket=88631 (20/11) > Looks like Emyr has fixed Sussex's not-publishing-UserDNs APEL problem, > so this ticket can be closed. In Progress (26/11) > > QMUL > https://ggus.eu/ws/ticket_info.php?ticket=88822 (23/11) > A similar ticket to 88772 at Lancaster. It could be that the SGE scripts > are needing updating too. In progress (26/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88987 (28/11) > t2k jobs are failing on ce05. In progress (30/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88887 (26/11) > lhcb pilots are also failing on ce05. In progress (28/11) > > https://ggus.eu/ws/ticket_info.php?ticket=88878 (26/11) > hone are also having troubles on ce05... In progress (26/11) > > https://ggus.eu/ws/ticket_info.php?ticket=86306 (22/9) > LHCB redundant, hard-to-kill pilots at QMUL. Chris opened a ticket to > the cream developers > (https://ggus.eu/tech/ticket_show.php?ticket=87891). But still the > request to purge lists come in from lhcb. In progress (21/11). > > GLASGOW > https://ggus.eu/ws/ticket_info.php?ticket=88376 (8/11) > Biomed authorisation errors on CE svr026. Sam asked if this was the only > CE that has seen this problem on the 9th. No reply since, I added in the > biomed e-mail address explicitly to the cc list to try and coax a > response. Waiting for reply (9/11) > > ECDF > https://ggus.eu/ws/ticket_info.php?ticket=86334 (24/9) > Low atlas sonar rates to BNL. Apparently things went from bad to worse > on the 23rd/24th of October. Duncan has removed the atlas VO tag on the > ticket to lower the noise on the atlas daily summary. On hold (30/11) > > EFDA-JET > https://ggus.eu/ws/ticket_info.php?ticket=88227 (6/11) > biomed complaining about 444444 waiting jobs & no running jobs being > published by jet. The guys there have had a go at fixing the problem > (probably caused by their update to EMI2), but are likely out of ideas. > I had a brain wave regarding user access in maui.cfg but if that's not > the solution I'm sure they'll appreciate ideas. In progress (3/12). > > OXFORD > https://ggus.eu/ws/ticket_info.php?ticket=86106 (14/9) > Poor atlas sonar rates from Oxford to BNL. On hold due to running out of > fixes to try, and the fact that they get good rates elsewhere. VO tag > removed to reduce noise. On hold (30/11) > > DURHAM > https://ggus.eu/ws/ticket_info.php?ticket=84123 (11/7) > atlas production failures at Durham. Site still in "quarantine". On hold > (20/11). > > https://ggus.eu/ws/ticket_info.php?ticket=75488 (19/10/11) > compchem authentication failures. As this ticket has been on hold at a > low priority since January then it would seem worthwhile to contact the > ticket originators to see what they want to do. On hold (8/10)