There is also no problem with file transfers but
2018-12-31 12:40:31 user.mixie:user.mixie.16590016._000001.myOutput.root #316 TRANSFER_FAILED
TOOL ID rucio-conveyor
SRC SITE UKI-LT2-QMUL_SCRATCHDISK
SRC URL srm://se03.esc.qmul.ac.uk:8444/srm/managerv2?SFN=/atlas/atlasscratchdisk/rucio/user/mixie/72/ec/user.mixie.16590016._000001.myOutput.root
DST SITE UKI-LT2-RHUL_SCRATCHDISK
DST URL srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/rucio/user/mixie/72/ec/user.mixie.16590016._000001.myOutput.root
TRANSFER ID 0cee1fc7-c12d-5435-85f8-dcb2549f5737
TRANSFER ENDPOINT https://lcgfts3.gridpp.rl.ac.uk:8446
ERROR MSG TRANSFER [2] DESTINATION SRM_PUTDONE Error on the surl srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasscratchdisk/rucio/user/mixie/72/ec/user.mixie.16590016._000001.myOutput.root while putdone : [SE][PutDone][SRM_INVALID_PATH] No such file or directory (error 2 on storage050.ppgrid1.rhul.ac.uk)
ACTIVITY User Subscriptions
FILE SIZE 6542955 bytes
DURATION 4 s
2018-12-31 13:15:18 user.flegger:NTUP_SMWZ.00994158._000002.UKI-LT2-RHUL.root.1 #96 TRANSFER_FAILED
TOOL ID rucio-conveyor
SRC SITE UKI-LT2-RHUL_DATADISK
SRC URL srm://se2.ppgrid1.rhul.ac.uk:8446/srm/managerv2?SFN=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/user/flegger/ee/fd/NTUP_SMWZ.00994158._000002.UKI-LT2-RHUL.root.1
DST SITE NET2_DATADISK
DST URL gsiftp://atlas-gridftp.bu.edu:2811/gpfs1/atlasdatadisk/rucio/user/flegger/ee/fd/NTUP_SMWZ.00994158._000002.UKI-LT2-RHUL.root.1
TRANSFER ID e6df1818-0baf-5450-aa39-099346406718
TRANSFER ENDPOINT https://fts.usatlas.bnl.gov:8446
ERROR MSG SOURCE [2] Error reported from srm_ifce : 2 [SE][Ls][SRM_INVALID_PATH] No such file or directory
ACTIVITY Data rebalancing
FILE SIZE 2990351558 bytes
DURATION 0 s
Perhaps it can help.
Elena
On 31 Dec 2018, at 12:38, George, Simon <[log in to unmask]> wrote:
> Thanks Elena.
>
>
> [root@se2 ~]# dpns-ls -l /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
> -rw-rw-r-- 1 1261 108 371314975 Jul 06 2015 /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
>
>
> I found the file on pool node
> storage045: /raid/dpmfs/atlas/2015-07-06/AOD.05536542._000001.pool.root.1.163198364.0
>
> [root@storage045(ppgrid1) ~]# ls -l /raid/dpmfs/atlas/2015-07-06/AOD.05536542._000001.pool.root.1.163198364.0
> -rw-rw-r-- 1 dpmmgr dpmmgr 371314975 Jul 6 2015 /raid/dpmfs/atlas/2015-07-06/AOD.05536542._000001.pool.root.1.163198364.0
>
> /var/log/xrootd/disk/xrootd.log has many of these:
>
> 181231 11:52:49 16324 ofs_open: platl017.100067:25@[::10.141.0.125] Unable to open /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1; permission denied
>
> More context of xrootd log:
> 181231 03:30:16 16324 XrootdXeq: platl017.32294:25@node134 pvt IPv4 login as /DC
> =ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo2/CN=531497/CN=Robot: ATLAS Pilo
> t2
> 181231 03:30:16 16324 platl017.32294:25@node134 ofs_open: 0-600 fn=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
> 181231 03:30:16 16324 dpmdiskacc_Access: Disk server hostname storage045.ppgrid1.rhul.ac.uk not matched to this host.
> 181231 03:30:16 16324 ofs_open: platl017.32294:25@node134 Unable to open /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1; permission denied
> 181231 03:30:16 16324 platl017.32294:25@node134 ofs_close: use=0 fn=dummy
> 181231 03:30:17 16324 XrootdXeq: platl017.32294:25@node134 disc 0:00:01
>
> (my highlighting)
>
> But is it storage045!
>
> [root@storage045(ppgrid1) ~]# hostname
> storage045.ppgrid1.rhul.ac.uk
> [root@storage045(ppgrid1) ~]# grep $(hostname) /etc/hosts
> 134.219.225.232 storage045.ppgrid1.rhul.ac.uk storage045.ppgrid1
> [root@storage045(ppgrid1) ~]# host $(hostname)
> storage045.ppgrid1.rhul.ac.uk has address 134.219.225.232
> [root@storage045(ppgrid1) ~]# host 134.219.225.232
> 232.225.219.134.in-addr.arpa domain name pointer storage045.ppgrid1.rhul.ac.uk.
>
> [root@storage045(ppgrid1) ~]# ifconfig p2p1.2
> p2p1.2 Link encap:Ethernet HWaddr A0:36:9F:2C:8F:7C
> inet addr:134.219.225.232 Bcast:134.219.225.255 Mask:255.255.255.0
> inet6 addr: fe80::a236:9fff:fe2c:8f7c/64 Scope:Link
> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
> RX packets:380221630 errors:0 dropped:0 overruns:0 frame:0
> TX packets:266028510 errors:0 dropped:0 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:3500942446699 (3.1 TiB) TX bytes:399922990727 (372.4 GiB)
>
> Anyone know what this means?
>
>
>
>
>
>
> From: Testbed Support for GridPP member institutes <[log in to unmask]> on behalf of Elena Korolkova <[log in to unmask]>
> Sent: 30 December 2018 22:23
> To: [log in to unmask]
> Subject: Re: HC failures, DPM problem?
>
> Hi Simon,
>
> there are failed (68) and finished (25) jobs during last 12 h.
>
> there is a problem to open a file:
>
> https://aipanda167.cern.ch/media/filebrowser/a84d70c5-5eb3-4dc4-a879-545f6a656734/user.gangarbt/tarball_PandaJob_4196806388_ANALY_RHUL_SL6/athena_stdout.txt
> ERROR [ERROR] Server responded with an error: [3010] Unable to open /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1; permission denied.
>
> Could you please check permissions for other files as well.
>
> Elena
>
>
>
>
>
>
>
> On 30 Dec 2018, at 18:54, Jeremy Coles <[log in to unmask]> wrote:
>
> > Hi Simon,
> >
> > No immediate technical suggestions from me, but (since limited numbers of people are online presently and it could be a HC issue) you could also seek help from:
> >
> >> ATLAS UK Cloud Support <[log in to unmask]>
> >
> > or
> >
> >> "atlas-adc-hammercloud-support (ATLAS ADC HammerCloud Support)" <[log in to unmask]>
> >
> > Best regards,
> > Jeremy
> >
> >
> >
> >
> >> On 30 Dec 2018, at 16:53, George, Simon <[log in to unmask]> wrote:
> >>
> >> If anyone is working at the moment, I'd appreciate some help.
> >> I found that ANALY_RHUL_SL6 is blacklisted due to HC test failures:
> >> http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UKI-LT2-RHUL&startTime=2018-12-17&endTime=2018-12-31&templateType=isGolden
> >> The site was down for a few days before xmas due to a PDU problem taking down a critical network switch which put several key servers offline (various lessons being learned...)
> >> Since coming back up everything looks to be up from my side but HC says otherwise. It looks like a problem with DPM but I cannot find what is wrong, all the services are running, just some actions fail. When I try by hand to retrieve a file via webdav that an HC test failed to access, I could download it no problem.
> >> Any suggestions please, what would be worth checking?
> >> Thanks,
> >> Simon
> >>
> >> To unsubscribe from the TB-SUPPORT list, click the following link:
> >> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
> >
> >
> > To unsubscribe from the TB-SUPPORT list, click the following link:
> > https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
> >
>
> ########################################################################
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
########################################################################
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
|