Dear All,
Bristol is seeing a lot lately, both on HPC & PP clusters, of hanging
lcg-cp where the user has set a timeout, but the timeout is not working.
The problem is so far always directed to DPM SE polgrid4.in2p3.fr.
cms031 27264 27263 0 00:43 ? 00:00:00 lcg-cp --verbose --vo=cms -b
-D srmv2 -t 2400 --verbose
file:///home/cms031/globus-tmp.bse05.22867.0/https_3a_2f_2flb007.cnaf.infn.it_3a9000_2fjNyOXBbN6ry4m1teix8bcg/CMSSW_2_2_3/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_270.root
srm://polgrid4.in2p3.fr:8446/srm/managerv2?SFN=/dpm/in2p3.fr/home/cms/trivcat/store/user/dimatteo/H_Chi2Chi2_4l_Tan5_Mzero120_Mhalf205_10TeV_GEN_HLT/H_Chi2Chi2_4l_Tan5_Mzero120_Mhalf205_10TeV_GEN_HLT/b866583716456d984ece7f7aab74e8bf/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_270.root
That has made a 0-length file at the remote end & the process is hanging.
66877.lcgce01.phy.br cms031 cms STDIN 8383 1 -- -- 48:00 R 07:18 bse06
qstat -f 66877 | egrep cput|wallt
resources_used.cput = 07:18:48
resources_used.walltime = 19:08:56
I can log onto the WN, become cms031, initialize environment & proxy, &
run the lcg-cp command; sometimes it succeeds completely & swiftly,
sometimes it hangs; and when it hangs, the lcg-cp timeout works for my
test.
[cms031@bse06]$ time lcg-cp -t 120 -v
CMSSW_2_2_3/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_499.root
srm://polgrid4.in2p3.fr:8446/srm/managerv2?SFN=/dpm/in2p3.fr/home/cms/trivcat/store/user/dimatteo/H_Chi2Chi2_4l_Tan5_Mzero120_Mhalf205_10TeV_GEN_HLT/H_Chi2Chi2_4l_Tan5_Mzero120_Mhalf205_10TeV_GEN_HLT/b866583716456d984ece7f7aab74e8bf/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_499-8.root
Destination SE type: SRMv2
Destination SRM Request Token: 309157d6-a248-4d1e-8457-08a0d06057b0
Source URL:
file:/home/cms031/globus-tmp.bse06.9072.0/https_3a_2f_2flb001.cnaf.infn.it_3a9000_2fZcQVnpxbSDm1CBz2Hd9NUQ/CMSSW_2_2_3/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_499.root
File size: 298343664
Source URL for copy:
file:/home/cms031/globus-tmp.bse06.9072.0/https_3a_2f_2flb001.cnaf.infn.it_3a9000_2fZcQVnpxbSDm1CBz2Hd9NUQ/CMSSW_2_2_3/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_499.root
Destination URL:
gsiftp://polgrid64.in2p3.fr/polgrid64.in2p3.fr:/data6/cms/2009-03-04/AH_4lMET_10TeV_GEN_SIM_DIGI_L1_DIGI2RAW_HLT_499-8.root.1267317.0
# streams: 1
# set timeout to 120 (seconds)
1048576 bytes 1553.03 KB/sec avg 1553.03 KB/sec inst [ stops - hung ]
real 2m2.339s It exited so I suppose the timeout worked.
user 0m0.153s
sys 0m0.015s
Why isn't the lcg-cp timeout working for the user's grid job? Or is this a
bug?
I did N tests using globus-url-copy to that DPM SE with various filesizes;
failure/success was independent of filesize. It's something somewhere on the
network that stops/starts working. Is there a way to debug this?
I've been in contact with an Admin at that remote end who's seen
alternately no problem with lcg-cp to that SE from elsewhere, but later
"we are facing hanging jobs problems also".
Grateful for Advice,
Winnie L
|