Print

Print


Dear all 

Our CE (pcncp04.ncp.edu.pk) on slc 4.6, is showing strange behavior since today's morning. All incoming jobs starts with showing "R" status, but all of sudden the status of these jobs changed from "R" to "E". SAM test is complaining about  JobWrapper as below

 
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://wms209.cern.ch:9000/eu799lmLzB8ktB9Ykx-HjQ
Current Status:     Aborted 
Logged Reason(s):
    - File not available.Cannot read JobWrapper output, both from Condor and from Maradona.
Status Reason:      hit job shallow retry count (1)
Destination:        pcncp04.ncp.edu.pk:2119/jobmanager-lcgpbs-ops
Submitted:          Mon Aug 24 12:14:04 2009 CEST

***********************************************************************

 

 

Whereas CE's logs shows for a particular job as follows

 

 

 

*********************************************************************

Aug 24 11:30:09 pcncp04 sshd[24884]: Accepted hostbased for prdcms35 from 172.16.14.54 port 59869 ssh2

Aug 24 17:30:09 pcncp04 sshd[24883]: Accepted hostbased for prdcms35 from 172.16.14.54 port 59869 ssh2

Aug 24 17:30:09 pcncp04 sshd(pam_unix)[24885]: session opened for user prdcms35 by (uid=0)

Aug 24 17:30:09 pcncp04 sshd[24885]: User prdcms35 attempting to execute command 'scp -r -p -f /home/prdcms35/.lcgjm/globus-cache-export.r24829/globus-cache-export.r24829.gpg' on command line

Aug 24 17:30:09 pcncp04 sshd(pam_unix)[24885]: session closed for user prdcms35

  

**********************************************************************

 


 


On the other hand when I tried to submit job from cic-samadmin portal, than it shows CE-sft-lcg-rm-cr failure on SAM as 

 


*********************************************************************


Checking lcg-cr command

Netork timeout on LFC: LFC_CONNTIMEOUT=10 LFC_CONRETRY=1 LFC_CONRETRYINT=2
Network and search timeouts on BDII set for lcg-utils: LCG_GFAL_BDII_TIMEOUT=20
SE timeouts in sec: connect 10, send/receive 120, SRM 180

Using lcg-utils version: 

+ lcg-cp --version
lcg_util-1.7.4-1
GFAL-client-1.11.6-2
+ set +x

Create a local file: sft-lcg-rm-cr.txt 

Move the file to the default SE (pcncp22.ncp.edu.pk) and register it with the LFN: sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461 

++ pwd
+ lcg-cr --connect-timeout 10 --sendreceive-timeout 120 --bdii-timeout 20 --srm-timeout 180 -v --vo ops -d pcncp22.ncp.edu.pk -l lfn:sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461 file:///home/sgmops03/globus-tmp.wn46.20102.0/https_3a_2f_2fglite-rb-01.cnaf.infn.it_3a9000_2fD1EMPLZtjdd1KJMz7MrN-g/work/testjob/nodes/pcncp04.ncp.edu.pk/sft-lcg-rm-cr.txt
Using grid catalog type: lfc
Using grid catalog : prod-lfc-shared-central.cern.ch
Checksum type: None
SE type: SRMv2
Destination SURL : srm://pcncp22.ncp.edu.pk/dpm/ncp.edu.pk/home/ops/generated/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615
Source SRM Request Token: 8cf31d51-c6bc-44c3-800e-af7c983b600b
Source URL: file:/home/sgmops03/globus-tmp.wn46.20102.0/https_3a_2f_2fglite-rb-01.cnaf.infn.it_3a9000_2fD1EMPLZtjdd1KJMz7MrN-g/work/testjob/nodes/pcncp04.ncp.edu.pk/sft-lcg-rm-cr.txt
File size: 228
VO name: ops
Destination specified: pcncp22.ncp.edu.pk
Destination URL for copy: gsiftp://pcncp22.ncp.edu.pk/pcncp22.ncp.edu.pk:/storage1/ops/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615.135061.0
# streams: 1
          228 bytes      1.12 KB/sec avg      1.12 KB/sec inst
Transfer took 1000 ms
send2nsd: NS002 - send error : Bad credentials
send2nsd: NS002 - send error : Bad credentials
[LFC][lfc_statg][] prod-lfc-shared-central.cern.ch: lfn:/grid/ops/SAM/sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461: Bad credentials
send2nsd: NS002 - send error : Bad credentials
srm://pcncp22.ncp.edu.pk/dpm/ncp.edu.pk/home/ops/generated/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615: Registration failed, please register it by hand, when the problem will be solved
guid:43a028fd-c6a3-457d-ac5c-8093def0c6bb
lcg_cr: Communication error on send
+ result=1
+ set +x

List the replicas:

+ lcg-lr --vo ops lfn:sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461
send2nsd: NS002 - send error : Bad credentials
[LFC][lfc_getreplica][] prod-lfc-shared-central.cern.ch: /grid/ops/SAM/sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461: Bad credentials
lcg_lr: Communication error on send
+ set +x

************************************************************************

 

 

any idea what is the reason behind this issue?

 thanks in advance  

 

 
Regards,
FAWAD SAEED
Scientific Officer Computing
National Centre for Physics
Islamabad
Tel: +92 - 51 260 1018
Fax: +92 - 51 920 5753
Email: [log in to unmask] <mailto:[log in to unmask]>