On Mon, Aug 24, 2009 at 4:26 PM, Fawad Saeed<[log in to unmask]> wrote:
> Dear all
>
> Our CE (pcncp04.ncp.edu.pk) on slc 4.6, is showing strange behavior since
> today's morning. All incoming jobs starts with showing “R” status, but all
> of sudden the status of these jobs changed from "R" to "E". SAM test is
> complaining about JobWrapper as below
Hi Fawad,
Start by looking at
http://goc.grid.sinica.edu.tw/gocwiki/Cannot_read_JobWrapper_output...
and in particulaur
http://goc.grid.sinica.edu.tw/gocwiki/ssh_problem_from_WN_to_CE
Look in the pbs_mom logs of the WN where the job run. There may
hopefully be an error.
Steve
>
>
>
> *************************************************************
>
> BOOKKEEPING INFORMATION:
>
> Status info for the Job : https://wms209.cern.ch:9000/eu799lmLzB8ktB9Ykx-HjQ
>
> Current Status: Aborted
>
> Logged Reason(s):
>
> - File not available.Cannot read JobWrapper output, both from Condor and
> from Maradona.
>
> Status Reason: hit job shallow retry count (1)
>
> Destination: pcncp04.ncp.edu.pk:2119/jobmanager-lcgpbs-ops
>
> Submitted: Mon Aug 24 12:14:04 2009 CEST
>
> ***********************************************************************
>
>
>
>
>
> Whereas CE’s logs shows for a particular job as follows
>
>
>
>
>
>
>
> *********************************************************************
>
> Aug 24 11:30:09 pcncp04 sshd[24884]: Accepted hostbased for prdcms35 from
> 172.16.14.54 port 59869 ssh2
>
> Aug 24 17:30:09 pcncp04 sshd[24883]: Accepted hostbased for prdcms35 from
> 172.16.14.54 port 59869 ssh2
>
> Aug 24 17:30:09 pcncp04 sshd(pam_unix)[24885]: session opened for user
> prdcms35 by (uid=0)
>
> Aug 24 17:30:09 pcncp04 sshd[24885]: User prdcms35 attempting to execute
> command 'scp -r -p -f
> /home/prdcms35/.lcgjm/globus-cache-export.r24829/globus-cache-export.r24829.gpg'
> on command line
>
> Aug 24 17:30:09 pcncp04 sshd(pam_unix)[24885]: session closed for user
> prdcms35
>
>
>
> **********************************************************************
>
>
>
>
>
> On the other hand when I tried to submit job from cic-samadmin portal, than
> it shows CE-sft-lcg-rm-cr failure on SAM as
>
>
>
> *********************************************************************
>
> Checking lcg-cr command
>
> Netork timeout on LFC: LFC_CONNTIMEOUT=10 LFC_CONRETRY=1 LFC_CONRETRYINT=2
>
> Network and search timeouts on BDII set for lcg-utils:
> LCG_GFAL_BDII_TIMEOUT=20
>
> SE timeouts in sec: connect 10, send/receive 120, SRM 180
>
> Using lcg-utils version:
>
> + lcg-cp --version
>
> lcg_util-1.7.4-1
>
> GFAL-client-1.11.6-2
>
> + set +x
>
> Create a local file: sft-lcg-rm-cr.txt
>
> Move the file to the default SE (pcncp22.ncp.edu.pk) and register it with
> the LFN: sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461
>
> ++ pwd
>
> + lcg-cr --connect-timeout 10 --sendreceive-timeout 120 --bdii-timeout 20
> --srm-timeout 180 -v --vo ops -d pcncp22.ncp.edu.pk -l
> lfn:sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461
> file:///home/sgmops03/globus-tmp.wn46.20102.0/https_3a_2f_2fglite-rb-01.cnaf.infn.it_3a9000_2fD1EMPLZtjdd1KJMz7MrN-g/work/testjob/nodes/pcncp04.ncp.edu.pk/sft-lcg-rm-cr.txt
>
> Using grid catalog type: lfc
>
> Using grid catalog : prod-lfc-shared-central.cern.ch
>
> Checksum type: None
>
> SE type: SRMv2
>
> Destination SURL :
> srm://pcncp22.ncp.edu.pk/dpm/ncp.edu.pk/home/ops/generated/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615
>
> Source SRM Request Token: 8cf31d51-c6bc-44c3-800e-af7c983b600b
>
> Source URL:
> file:/home/sgmops03/globus-tmp.wn46.20102.0/https_3a_2f_2fglite-rb-01.cnaf.infn.it_3a9000_2fD1EMPLZtjdd1KJMz7MrN-g/work/testjob/nodes/pcncp04.ncp.edu.pk/sft-lcg-rm-cr.txt
>
> File size: 228
>
> VO name: ops
>
> Destination specified: pcncp22.ncp.edu.pk
>
> Destination URL for copy:
> gsiftp://pcncp22.ncp.edu.pk/pcncp22.ncp.edu.pk:/storage1/ops/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615.135061.0
>
> # streams: 1
>
> 228 bytes 1.12 KB/sec avg 1.12 KB/sec inst
>
> Transfer took 1000 ms
>
> send2nsd: NS002 - send error : Bad credentials
>
> send2nsd: NS002 - send error : Bad credentials
>
> [LFC][lfc_statg][] prod-lfc-shared-central.cern.ch:
> lfn:/grid/ops/SAM/sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461: Bad
> credentials
>
> send2nsd: NS002 - send error : Bad credentials
>
> srm://pcncp22.ncp.edu.pk/dpm/ncp.edu.pk/home/ops/generated/2009-08-24/file81329b5d-d306-4957-b79a-9225d881d615:
> Registration failed, please register it by hand, when the problem will be
> solved
>
> guid:43a028fd-c6a3-457d-ac5c-8093def0c6bb
>
> lcg_cr: Communication error on send
>
> + result=1
>
> + set +x
>
> List the replicas:
>
> + lcg-lr --vo ops lfn:sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461
>
> send2nsd: NS002 - send error : Bad credentials
>
> [LFC][lfc_getreplica][] prod-lfc-shared-central.cern.ch:
> /grid/ops/SAM/sft-lcg-rm-cr-wn46.ncp.edu.pk.090824075522.936461: Bad
> credentials
>
> lcg_lr: Communication error on send
>
> + set +x
>
> ************************************************************************
>
>
>
>
>
> any idea what is the reason behind this issue?
>
> thanks in advance
>
>
>
>
> Regards,
> FAWAD SAEED
> Scientific Officer Computing
> National Centre for Physics
> Islamabad
> Tel: +92 - 51 260 1018
> Fax: +92 - 51 920 5753
> Email: [log in to unmask]
--
Steve Traylen
|