Print

Print


Hi Alessandra,

thanks, good idea. It looks ok to me:


[root@storage045(ppgrid1) ~]# openssl x509 -noout -subject -in /etc/grid-security/hostcert.pem 
subject= /C=UK/O=eScience/OU=RoyalHollowayLondon/L=Physics/CN=storage045.ppgrid1.rhul.ac.uk
[root@storage045(ppgrid1) ~]# openssl x509 -noout -subject -in /etc/grid-security/dpmmgr/dpmcert.pem
subject= /C=UK/O=eScience/OU=RoyalHollowayLondon/L=Physics/CN=storage045.ppgrid1.rhul.ac.uk
[root@storage045(ppgrid1) ~]# hostname
storage045.ppgrid1.rhul.ac.uk

I now see the HC tests have started working all by themselves just before the end of 2018.

Still analysis jobs are failing at too high a rate but I am not sure why.

The particular file which was breaking the HC tests is now accessible again and then logs no longer show errors.

I can fetch is using gfal-utils.


> gfal-ls -alH srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 
-rw-rw-r--   1 48    47    355M Jul  6  2015 srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1

> gfal-copy  srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1 /tmp/data1
[...]
Copying srm://se2.ppgrid1.rhul.ac.uk/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1   [DONE]  after 3s





From: Testbed Support for GridPP member institutes <[log in to unmask]> on behalf of Alessandra Forti <[log in to unmask]>
Sent: 02 January 2019 19:07
To: [log in to unmask]
Subject: Re: HC failures, DPM problem?
 
Hi Simon,
181231 03:30:16 16324 platl017.32294:25@node134 ofs_open: 0-600 fn=/dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1
181231 03:30:16 16324 dpmdiskacc_Access: Disk server hostname storage045.ppgrid1.rhul.ac.uk not matched to this host.
181231 03:30:16 16324 ofs_open: platl017.32294:25@node134 Unable to open /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1; permission denied
181231 03:30:16 16324 platl017.32294:25@node134 ofs_close: use=0 fn=dummy
181231 03:30:17 16324 XrootdXeq: platl017.32294:25@node134 disc 0:00:01

(my highlighting)

But is it storage045!
what about the host certificate? Does that correspond to storage045? Have you tried also other tools like gfal-copy/gfal-ls in verbose mode?

cheers
alessandra

[root@storage045(ppgrid1) ~]# hostname        
storage045.ppgrid1.rhul.ac.uk
[root@storage045(ppgrid1) ~]# grep $(hostname) /etc/hosts
134.219.225.232         storage045.ppgrid1.rhul.ac.uk storage045.ppgrid1
[root@storage045(ppgrid1) ~]# host $(hostname)
storage045.ppgrid1.rhul.ac.uk has address 134.219.225.232
[root@storage045(ppgrid1) ~]# host 134.219.225.232
232.225.219.134.in-addr.arpa domain name pointer storage045.ppgrid1.rhul.ac.uk.

[root@storage045(ppgrid1) ~]# ifconfig p2p1.2
p2p1.2    Link encap:Ethernet  HWaddr A0:36:9F:2C:8F:7C  
          inet addr:134.219.225.232  Bcast:134.219.225.255  Mask:255.255.255.0
          inet6 addr: fe80::a236:9fff:fe2c:8f7c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:380221630 errors:0 dropped:0 overruns:0 frame:0
          TX packets:266028510 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3500942446699 (3.1 TiB)  TX bytes:399922990727 (372.4 GiB)

Anyone know what this means?







From: Testbed Support for GridPP member institutes <[log in to unmask]> on behalf of Elena Korolkova <[log in to unmask]>
Sent: 30 December 2018 22:23
To: [log in to unmask]
Subject: Re: HC failures, DPM problem?
 
Hi Simon,

there are failed (68)  and  finished (25) jobs during last 12 h. 

there is a problem  to open a file:

https://aipanda167.cern.ch/media/filebrowser/a84d70c5-5eb3-4dc4-a879-545f6a656734/user.gangarbt/tarball_PandaJob_4196806388_ANALY_RHUL_SL6/athena_stdout.txt
 ERROR   [ERROR] Server responded with an error: [3010] Unable to open /dpm/ppgrid1.rhul.ac.uk/home/atlas/atlasdatadisk/rucio/mc15_13TeV/ed/68/AOD.05536542._000001.pool.root.1; permission denied.

Could you please check permissions for other files as well.

Elena







On 30 Dec 2018, at 18:54, Jeremy Coles <[log in to unmask]> wrote:

> Hi Simon,
>
> No immediate technical suggestions from me, but (since limited numbers of people are online presently and it could be a HC issue) you could also seek help from:
>
>> ATLAS UK Cloud Support <[log in to unmask]>
>
> or
>
>> "atlas-adc-hammercloud-support (ATLAS ADC HammerCloud Support)" <[log in to unmask]>
>
> Best regards,
> Jeremy
>
>
>
>
>> On 30 Dec 2018, at 16:53, George, Simon <[log in to unmask]> wrote:
>>
>> If anyone is working at the moment, I'd appreciate some help.
>> I found that ANALY_RHUL_SL6 is blacklisted due to HC test failures:
>> http://hammercloud.cern.ch/hc/app/atlas/siteoverview/?site=UKI-LT2-RHUL&startTime=2018-12-17&endTime=2018-12-31&templateType=isGolden
>> The site was down for a few days before xmas due to a PDU problem taking down a critical network switch which put several key servers offline (various lessons being learned...)
>> Since coming back up everything looks to be up from my side but HC says otherwise. It looks like a problem with DPM but I cannot find what is wrong, all the services are running, just some actions fail. When I try by hand to retrieve a file via webdav that an HC test failed to access, I could download it no problem.
>> Any suggestions please, what would be worth checking?
>> Thanks,
>> Simon
>>
>> To unsubscribe from the TB-SUPPORT list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
>
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
>

########################################################################

To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1


To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1


-- 
Respect is a rational process. \\//
For Ur-Fascism, disagreement is treason. (U. Eco)


To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1



To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1