It is inefficient and also misleading in the log files. Is AGIS queried everytime?

For SSB I'm waiting the next iteration to be sure there is still a problem. I will send config/logs if it persists.


On 10/03/2014 13:35, Wahid Bhimji wrote:
[log in to unmask]" type="cite"> Hi 

Yeah - that is probably OK in that it looks in all the top level directories/ spacetokens for the file. 
If hotdisk is retired then it shouldn't (needn't) look there - I guess it is because it is still in agis
http://atlas-agis.cern.ch/agis/ddm_endpoint/detail/UKI-NORTHGRID-MAN-HEP_HOTDISK/full/
But ifs not actually causing a problem then its not a problem (though a bit inefficient) (I can imagine its the same for everyone). 

However you are actually red for fax in SSB. 
140310 13:26:19 18446 Xrd: GetAccessToSrv: HandShake failed with server [bohr3226.tier2.hep.manchester.ac.uk:11000]
Maybe that was just at the time you were restarting - but if not perhaps send the logs / config again...

Wahid 

On 10 Mar 2014, at 13:15, Alessandra Forti <[log in to unmask]> wrote:

Actually the file name looks always the same so it might be some loop over directories.

On 10/03/2014 13:13, Alessandra Forti wrote:
[log in to unmask]" type="cite"> Hi,

it looks healthier but not perfect

140310 13:05:53 9187 XrootdXeq: [log in to unmask]">atlasplt.4371:[log in to unmask] disc 0:00:01
140310 13:05:53 9185 XrootdXeq: [log in to unmask]">atlpilot.21187:[log in to unmask] login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlasfr1/CN=445173/CN=Robot: ATLAS Fr Factory 1
140310 13:05:53 0x8e50700 XRD-N2N: lookup /atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:53 0x8e50700 XRD-N2N: cache hit, return /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/user/ivukotic/5f/05/user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:53 9182 XrootdXeq: [log in to unmask]">ivukotic.23906:[log in to unmask] disc 0:00:01
140310 13:05:53 9184 XrootdXeq: [log in to unmask]">atlas200.4017:[log in to unmask] login as /C=CA/O=Grid/OU=triumf.ca/CN=Asoka De Silva GC1
140310 13:05:53 9195 XrootdXeq: [log in to unmask]">atlpilot.14967:[log in to unmask] login as /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlasfr1/CN=445173/CN=Robot: ATLAS Fr Factory 1
140310 13:05:53 0x8748700 XRD-N2N: lookup /atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:53 0x8748700 XRD-N2N: cache hit, return /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/user/ivukotic/5f/05/user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:53 0x8f51700 XRD-N2N: lookup /atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:53 0x8f51700 XRD-N2N: cache hit, return /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/user/ivukotic/5f/05/user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140310 13:05:55 9117 XrootdXeq: [log in to unmask]">atlpil23.704:[log in to unmask] disc 0:00:04
140310 13:05:55 9118 XrootdXeq: [log in to unmask]">atlas097.103420:[log in to unmask] disc 0:00:03
140310 13:05:57 9186 XrootdXeq: [log in to unmask]">pilatlas.18025:[log in to unmask] disc 0:00:05
140310 13:05:57 9190 XrootdXeq: [log in to unmask]">atlpilot.14967:[log in to unmask] disc 0:00:05
140310 13:06:00 9196 XrootdXeq: [log in to unmask]">atlpilot.21187:[log in to unmask] disc 0:00:08
140310 13:06:01 9180 cms_Finder: Connected to cmsd via /var/spool/xrootd/fedredir_atlas/.olb/olbd.admin
140310 13:06:02 9187 XrootdXeq: [log in to unmask]">usatlas1.24147:[log in to unmask] disc 0:00:10
140310 13:06:07 9182 XrootdXeq: [log in to unmask]">atlas200.4017:[log in to unmask] disc 0:00:14
140310 13:07:02 9117 XrootdXeq: [log in to unmask]">patlas11.104816:[log in to unmask] disc 0:01:10
140310 13:07:48 0x8849700 XRD-N2N: lookup /atlas/rucio/mc12_8TeV:NTUP_TOP.01216672._000094.root.1
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n8@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasgroupdisk/soft-test/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n9@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlaslocalgroupdisk/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n10@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlashotdisk/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n11@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n12@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasscratchdisk/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 9193 Xrd: CheckErrorStatus: Server [rn2n13@localhost] declared: Unable to locate /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasproddisk/rucio/mc12_8TeV/61/b1/NTUP_TOP.01216672._000094.root.1; no such file or directory(error code: 3011)
140310 13:07:48 0x8849700 XRD-N2N: no valid replica for /atlas/rucio/mc12_8TeV:NTUP_TOP.01216672._000094.root.1
140310 13:07:48 9193 dpmfinder_Locate: [#01.000002] N2N error
140310 13:07:56 0x894a700 XRD-N2N: lookup /atlas/rucio/data12_8TeV:NTUP_SMWZ.01122120._000008.root.1


there are various errors perhaps due to DB inconsitencies since some are about hotdisk subdir which I've eliminated fw months ago

 dpns-ls /dpm/tier2.hep.manchester.ac.uk/home/atlas/atlashotdisk
/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlashotdisk: No such file or directory

let me know.

cheers
alessandra

On 10/03/2014 12:24, Wahid Bhimji wrote:
[log in to unmask]" type="cite"> Hi 

Looks fine to me. 
I still have the LFC_HOST lines but I guess the point is with the rucio only name2name they are not needed any more so indeed I would miss them out at the new documentation says. 

let me know how if it doesn't work out and I will take a look
cheers
Wahid

On 10 Mar 2014, at 12:09, Alessandra Forti <[log in to unmask]> wrote:

Hi,

so I'm going to change YAIM now. I've updated to the latest dpm-yaim rpm. Comparing my YAIM config with the link you have sent I have this differences I'd like you to confirm

Replace

OUT: DPM_XROOTD_FED_ATLAS_NAMELIB="XrdOucName2NameLFC.so root=/dpm/${MY_DOMAIN}/home/atlas match=bohr3226.tier2.hep.manchester.ac.uk"
   IN: DPM_XROOTD_FED_ATLAS_NAMELIB="XrdOucName2NameLFC.so pssorigin=localhost sitename=ATLAS_SITENAME"

where atlas site name is UKI-NORTHGRID-MAN-HEP. And

OUT: DPM_XROOTD_REDIR_MISC="$DPM_XROOTD_DISK_MISC"
   IN: DPM_XROOTD_REDIR_MISC="$DPM_XROOTD_DISK_MISC dpm.mmreqhost localhost"


OUT: DPM_XROOTD_DISK_MISC="xrootd.monitor all rbuff 32k auth flush 30s window 5s dest files info user io redir atl-prod05.slac.stanford.edu:9930
        if exec xrootd
            xrd.report atl-prod05.slac.stanford.edu:9931 every 60s all -buff -poll sync
        fi"

IN: DPM_XROOTD_DISK_MISC="xrootd.monitor all auth flush 30s fstat 60 lfn ops xfr 5 window 5s dest fstat info user redir atl-prod05.slac.stanford.edu:9930
     if exec xrootd

           xrd.report atl-prod05.slac.stanford.edu:9931 every 60s all -buff -poll sync
    fi"   

The variable below are not mentioned in the documentaion link should I keep them?

DPM_XROOTD_FED_ATLAS_SETENV="LFC_HOST=prod-lfc-atlas-ro.cern.ch LFC_CONRETRY=0 GLOBUS_THREAD_MODEL=pthread CSEC_MECH=ID"
DPM_XROOTD_FED_ATLAS_MISC="$DPM_XROOTD_DISK_MISC"

thanks

cheers
alessandra


On 07/03/2014 08:43, Wahid Bhimji wrote:
[log in to unmask]" type="cite"> Hi Alessandra 

Thanks for the config. 
First the n2n-rpm-list file is empty. But anyway you should have at least xrootd-server-atlas-n2n-plugin-2.0-0.x86_64 from the WLCG repo.

Assuming you have that version then the arguments for dpm.namelib shouldhave 
pssorigin=localhost sitename=UKI-NORTHGRID-MAN-HEP 
at the end 

This changed a while back - but you probably have the old config. The newest yaim variables are at 

Without the sitename option then it will not get the correct set of "rucioprefixes" which might be why its searching in all those crazy places for each file. 

hopefully thats it - I didn't check the rest..
Wahid


On 6 Mar 2014, at 19:21, Alessandra Forti <[log in to unmask]> wrote:

Number of jobs in transferring state shouldn't be that high. There might be some other reason but the first impression was that it was due to this, considering the staggering number of CLOSE_WAIT connections today.

I put everything here http://ks.tier2.hep.manchester.ac.uk/T2/tmp/xrootd-debug-20140306.tgz

thanks

cheers
alessandra

On 06/03/2014 19:11, Wahid Bhimji wrote:
[log in to unmask]" type="cite"> Hi 

Some of those messages are "normal " in looking for files that don't exist. 
But the number and odd search paths (and the fact that fedredir doesn't start properly ) makes me think something is wrong with the manchester config. 

can you send me the conf files from /etc/xrootd and versions
rpm -qa | grep xrootd 
rpm -qa | grep -i n2n 

and also full logs from 
/var/log/xrootd/fedredir_atlas/xrootd.log
/var/log/xrootd/dpmredir/xrootd.log

and I'll see if I spot something. 

(PS - what makes you say it is "clogging" transfers at Glasgow (or manchester))

Wahid

On 6 Mar 2014, at 18:50, Alessandra Forti <[log in to unmask]> wrote:

Last one of the day

The /var/log/xrootd/fedredir_atlas/xrootd.log is now full of these messages since 18:23

140306 18:38:16 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.annovi/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:16 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.anventur/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:16 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.aolszewski/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:16 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.aoun/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:17 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.apenson/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:17 0x4bfff700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.apereira/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-1M
140306 18:38:17 0x5aa0a700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.arobic/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:17 0x5aa0a700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.aroe/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:17 0x5a606700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.arnaudferrari/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:17 0x588e9700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.JoshuaMiloKunkle/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:17 0x587e8700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.GemmaHollyWooden/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:17 0x587e8700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.GiulioUsai/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:18 0x587e8700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.Gordon/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:18 0x599fa700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.Kerim/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:18 0x589ea700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.GiulioUsai/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M
140306 18:38:18 0x59afb700 XRD-LFC No such file or directory /grid/atlas/users/pathena/user.arnaez/ivukotic:user.ivukotic.xrootd.uki-northgrid-man-hep-100M


On 06/03/2014 18:25, Alessandra Forti wrote:
[log in to unmask]" type="cite"> Also fedredir keeps on not restarting properly

[root@bohr3226 ~]# service xrootd restart
Shutting down xrootd (xrootd, redir):                      [  OK  ]
Shutting down xrootd (xrootd, disk):                       [  OK  ]
Shutting down xrootd (xrootd, fedredir_atlas):             [FAILED]
Starting xrootd (xrootd, redir):                           [  OK  ]
Starting xrootd (xrootd, disk):                            [  OK  ]
Starting xrootd (xrootd, fedredir_atlas):                  [  OK  ]


I've applied xrd.timeout idle 1800 in the fedredir configuration file outside of any if statement and restarted xrootd.

let's hope it helps.

cheers
alessandra

On 06/03/2014 18:14, Alessandra Forti wrote:
[log in to unmask]" type="cite">I haven't applied the recipe yet but

+10k CLOSE_WAIT connections

Thu Mar  6 18:10:01 GMT 2014
  11962 CLOSE_WAIT

if this is caused by FAX this is not going to work.

It is also clogging the transfers from jobs at Manchester and looks like Glasgow too

http://panda.cern.ch/server/pandamon/query?dash=prod

cheers
alessandra

On 05/03/2014 16:53, Sam Skipsey wrote:
If it's the federated redirector that's having problems, then
xroot-dpmfedredir_atlas.cfg.

Otherwise, if it's the local redirector (and if local jobs were
breaking, then I guess it was?), then xrootd-dpmredir.cfg

(Or try changing both?)

Sam

On 5 March 2014 16:38, Alessandra Forti <[log in to unmask]> wrote:
Which of these files I should change?

[root@bohr3226 xrootd]# ls
dpmxrd-sharedkey.dat  xrootd-dpmdisk.cfg
xrootd-dpmfedredir_atlas.cfg         xrootd-dpmredir.cfg
xrootd-standalone.cfg
xrootd-clustered.cfg  xrootd-dpmdisk.cfg.rpmnew
xrootd-dpmfedredir_atlas.cfg.rpmnew  xrootd-dpmredir.cfg.rpmnew


On 05/03/2014 16:00, Alessandra Forti wrote:

Andy Hanuchevsky suggested one could use the idle option
"The particular directive can be found at:
http://xrootd.org/doc/prod/xrd_config.htm#_Toc310725348 specifically the
idle option."

However I am not sure who has tried that - nor am I sure it is the way
forward...

I can try this although I'm not sure a connection in CLOSE_WAIT can be
considered "Idle".







The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.




The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.





The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.