hi again,
after sending a lot of sft jobs to xg009.inp.demokritos.gr (from
https://monitoring.egee.man.poznan.pl/) i found out that from time to time
the sfts was successfull ,i took 6 successfull sft jobs in a row and then
the same JS error (this was for the same WN).i realise that when i took
the succesful sft jobs i had first change the max_running for dteam queue
from 1 to 6( qmgr -c "set queue dteam max_running=6")
then i change it again to 1 an so i took again the JS error despite the
fact that there was not running another dteam job and there was a free WN.
after a couple of hours i make an qmgr -c "set queue dteam
max_running=10" and i had an successful sft job
strange ...
i tried the following but i dont see anything to change :
"In /etc/grid-security/gridmapdir/ there are hard links
(with strange names like
%2fc%3dch%2fo%3dcern%2fou%3dgrid%2fcn%3dpiotr%20nyczyk%209654) to each
pool account that is taken. They have the same inode number (ls -li
<filename>) as the pool account file they point to. If there's no pool
account file left free, run
/opt/edg/sbin/lcg-expiregridmapdir.pl"
regards
xristos
hi to all,
its seams we have the same problem here at demokritos (CE :
xg009.inp.demokritos.gr)
two days ago everything was ok (right now i have 15 jobs running but when
a new job is submited i have an abort status).
the job is submited to the CE
the CE send the job to a free WN and the status is "Running",
after a while the status is turned to "waiting" and finally "Aborted"
i had an :
Job submission failed with error message
7 authentication failed [...]
i went to the following Wiki page
http://goc.grid.sinica.edu.tw/gocwiki/7_authentication_failed
the solution says to check the dns and the /etc/hosts ,after my check
things seams to be ok ( i dont believe that this can be a problem like
this because 2 days ago everything was ok and i dont change anything)
after that i also try the 2 following wikis but it seams that its not
the problem:
http://goc.grid.sinica.edu.tw/gocwiki/submit-helper_script_%2e%2e%2e_gave_error%3a_cache_export_dir_%2e%2e%2e
http://goc.grid.sinica.edu.tw/gocwiki/ssh_problem_from_WN_to_CE
any ideas?
regards
xristos
> Hi,
>
> Our finidngs were same as you identified and we had already gone through
all the remedies, proposed on GOC-wiki, but problem is not rectified.
>
> Our site was running without any problem and its status was 'OK'. The
problem started, just after we executed open-ssh-GSSI patch.
>
> Will you please propose us some remedy, which solves our problem.
>
> Cheers,
> Asif Osman
>
>
>
>
> -----Original Message-----
> From: LHC Computer Grid - Rollout on behalf of Alessandro Paolini Sent:
Thu 11/24/2005 10:14 AM
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] JS error on CE.pakgrid.org.pk
>
> Hi,
> I 've just tried to submit a simple job but it finished aborted :-(
> From the logging-info:
>
> ---
> Event: Done
> - exit_code = 1
> - host = egee-rb-01.cnaf.infn.it
> - level = SYSTEM
> - priority = asynchronous
> - reason = Got a job held event, reason: Unspecified
gridmanager error
> - seqcode =
> UI=000003:NS=0000000003:WM=000016:BH=0000000000:JSS=000012:LM=000031:LRMS=000000:APP=000000
- source = LogMonitor
> - src_instance = unique
> - status_code = FAILED
> - timestamp = Thu Nov 24 09:02:21 2005
> - user = /C=IT/O=INFN/OU=Personal
> Certificate/L=CNAF/CN=alessandro
> [log in to unmask]
> ---
> Event: Done
> - exit_code = 1
> - host = egee-rb-01.cnaf.infn.it
> - level = SYSTEM
> - priority = asynchronous
> - reason = Job got an error while in the CondorG queue.
> - seqcode =
> UI=000003:NS=0000000003:WM=000016:BH=0000000000:JSS=000012:LM=000033:LRMS=000000:APP=000000
- source = LogMonitor
> - src_instance = unique
> - status_code = FAILED
> - timestamp = Thu Nov 24 09:02:32 2005
> - user = /C=IT/O=INFN/OU=Personal
> Certificate/L=CNAF/CN=alessandro
> [log in to unmask]
> ---
>
> So it seems to be ssh problems between CE and WNs, but also it could be
some CRLs out of date.
>
> When i launched theese commands:
> [paolini@lcg-ui paolini]$ globus-job-run
> CE.pakgrid.org.pk/jobmanager-lcgpbs -queue dteam /bin/hostname
> [paolini@lcg-ui paolini]$ globus-job-run
> CE.pakgrid.org.pk/jobmanager-lcgpbs /bin/hostname
>
> no answer returned.
>
> CE identifies me as dteam006.
>
> Take a look to theese faqs:
>
> http://goc.grid.sinica.edu.tw/gocwiki/submit-helper_script_%2e%2e%2e_gave_error%3a_cache_export_dir_%2e%2e%2e
>
> http://goc.grid.sinica.edu.tw/gocwiki/ssh_problem_from_WN_to_CE
>
> Cheers,
> Alessandro
>
> Asif Osman ha scritto:
>
>>Hi,
>>
>>We are getting JS error in goc database.
>>Reason seems to be authentication failure between WN to CE
communication. Sequence of events occuring are as follows:
>>1) jobs submitted from UI
>>2) It is landed on the CE.pakgrid.org.pk
>>3) CE submits this job to WN, where it is executed without any problem
4) When job finishes, WN tries to copy stderr and stdout files to CE
>>
>>At this stage, the authentication fails and job is aborted.
>>
>>Following is the error messages logged in file:
>>/var/spool/pbs/mom_logs/20051124
>>
>>===========================================================================================================>
11/24/2005 10:17:04;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp
>> -Br
>> [log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.k22183/globus-cache-export.k22183.gpg
globus-cache-export.k22183.gpg status=1 (copy request failed),
>> try=311/24/2005 10:17:35;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/dteam001/.lcgjm/globus-cache-export.k22183/globus-cache-export.k22183.gpg
globus-cache-export.k22183.gpg status=1 (copy request failed), try=4
>>11/24/2005 10:17:42;0004;
>> pbs_mom;Fil;globus-cache-export.k22183.gpg;Unable to copy file
>> globus-cache-export.k22183.gpg from
>> CE.pakgrid.org.pk:/home/dteam001/.lcgjm/globus-cache-export.k22183/globus-cache-export.k22183.gpg
>>11/24/2005 10:17:42;0004;
>> pbs_mom;Fil;globus-cache-export.k22183.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 10:17:42;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.k22183.gpg
>>11/24/2005 10:17:46;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
10:17:46;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.U22611/globus-cache-export.U22611.gpg
globus-cache-export.U22611.gpg status=1 (copy request failed), try=1
>>11/24/2005 10:18:17;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.U22611/globus-cache-export.U22611.gpg
globus-cache-export.U22611.gpg status=1 (copy request failed), try=2
>>11/24/2005 10:18:21;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp
>> -Br
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.U22611/globus-cache-export.U22611.gpg
globus-cache-export.U22611.gpg status=1 (copy request failed), try=3
>>11/24/2005 10:18:52;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.U22611/globus-cache-export.U22611.gpg
globus-cache-export.U22611.gpg status=1 (copy request failed), try=4
>>11/24/2005 10:18:59;0004;
>> pbs_mom;Fil;globus-cache-export.U22611.gpg;Unable to copy file
>> globus-cache-export.U22611.gpg from
>> CE.pakgrid.org.pk:/home/lhcb001/.lcgjm/globus-cache-export.U22611/globus-cache-export.U22611.gpg
>>11/24/2005 10:18:59;0004;
>> pbs_mom;Fil;globus-cache-export.U22611.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 10:18:59;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.U22611.gpg
>>11/24/2005 10:22:48;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
10:22:48;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.Q24347/globus-cache-export.Q24347.gpg
globus-cache-export.Q24347.gpg status=1 (copy request failed), try=1
>>11/24/2005 10:23:19;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.Q24347/globus-cache-export.Q24347.gpg
globus-cache-export.Q24347.gpg status=1 (copy request failed), try=2
>>11/24/2005 10:23:23;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp
>> -Br
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.Q24347/globus-cache-export.Q24347.gpg
globus-cache-export.Q24347.gpg status=1 (copy request failed), try=3
>>11/24/2005 10:23:54;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/lhcb001/.lcgjm/globus-cache-export.Q24347/globus-cache-export.Q24347.gpg
globus-cache-export.Q24347.gpg status=1 (copy request failed), try=4
>>11/24/2005 10:24:01;0004;
>> pbs_mom;Fil;globus-cache-export.Q24347.gpg;Unable to copy file
>> globus-cache-export.Q24347.gpg from
>> CE.pakgrid.org.pk:/home/lhcb001/.lcgjm/globus-cache-export.Q24347/globus-cache-export.Q24347.gpg
>>11/24/2005 10:24:01;0004;
>> pbs_mom;Fil;globus-cache-export.Q24347.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 10:24:01;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.Q24347.gpg
>>11/24/2005 11:04:18;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
11:04:18;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.gi1902/globus-cache-export.gi1902.gpg
globus-cache-export.gi1902.gpg status=1 (copy request failed), try=1
>>11/24/2005 11:04:49;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.gi1902/globus-cache-export.gi1902.gpg
globus-cache-export.gi1902.gpg status=1 (copy request failed),
>> try=211/24/2005 11:04:53;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/bin/scp -Br
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.gi1902/globus-cache-export.gi1902.gpg
globus-cache-export.gi1902.gpg status=1 (copy request failed), try=3
>>11/24/2005 11:05:24;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.gi1902/globus-cache-export.gi1902.gpg
globus-cache-export.gi1902.gpg status=1 (copy request failed),
>> try=411/24/2005 11:05:31;0004;
>> pbs_mom;Fil;globus-cache-export.gi1902.gpg;Unable to copy file
>> globus-cache-export.gi1902.gpg from
>> CE.pakgrid.org.pk:/home/cms001/.lcgjm/globus-cache-export.gi1902/globus-cache-export.gi1902.gpg
>>11/24/2005 11:05:31;0004;
>> pbs_mom;Fil;globus-cache-export.gi1902.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 11:05:31;0004;
>> pbs_mom;Fil;globus-cache-export.gi1902.gpg;lure
>>11/24/2005 11:05:31;0004; pbs_mom;Fil;globus-cache-export.gi1902.gpg;No
>> such file or directory
>>11/24/2005 11:05:31;0004; pbs_mom;Fil;globus-cache-export.gi1902.gpg;
11/24/2005 11:05:31;0004;
>> pbs_mom;Fil;globus-cache-export.gi1902.gpg;Permission denied
>> (external-keyx,gssapi,publickey,password,keyboard-interactive).
>>11/24/2005 11:05:31;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.gi1902.gpg
>>11/24/2005 11:09:22;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
11:09:22;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.xG3306/globus-cache-export.xG3306.gpg
globus-cache-export.xG3306.gpg status=1 (copy request failed), try=1
>>11/24/2005 11:09:53;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.xG3306/globus-cache-export.xG3306.gpg
globus-cache-export.xG3306.gpg status=1 (copy request failed),
>> try=211/24/2005 11:09:57;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/bin/scp -Br
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.xG3306/globus-cache-export.xG3306.gpg
globus-cache-export.xG3306.gpg status=1 (copy request failed), try=3
>>11/24/2005 11:10:29;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.xG3306/globus-cache-export.xG3306.gpg
globus-cache-export.xG3306.gpg status=1 (copy request failed),
>> try=411/24/2005 11:10:36;0004;
>> pbs_mom;Fil;globus-cache-export.xG3306.gpg;Unable to copy file
>> globus-cache-export.xG3306.gpg from
>> CE.pakgrid.org.pk:/home/cms001/.lcgjm/globus-cache-export.xG3306/globus-cache-export.xG3306.gpg
>>11/24/2005 11:10:36;0004;
>> pbs_mom;Fil;globus-cache-export.xG3306.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 11:10:36;0004;
>> pbs_mom;Fil;globus-cache-export.xG3306.gpg;lure
>>11/24/2005 11:10:36;0004; pbs_mom;Fil;globus-cache-export.xG3306.gpg;No
>> such file or directory
>>11/24/2005 11:10:36;0004; pbs_mom;Fil;globus-cache-export.xG3306.gpg;
11/24/2005 11:10:36;0004;
>> pbs_mom;Fil;globus-cache-export.xG3306.gpg;Permission denied
>> (external-keyx,gssapi,publickey,password,keyboard-interactive).
>>11/24/2005 11:10:36;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.xG3306.gpg
>>11/24/2005 11:14:19;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
11:14:19;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.VZ4905/globus-cache-export.VZ4905.gpg
globus-cache-export.VZ4905.gpg status=1 (copy request failed), try=1
>>11/24/2005 11:14:50;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.VZ4905/globus-cache-export.VZ4905.gpg
globus-cache-export.VZ4905.gpg status=1 (copy request failed),
>> try=211/24/2005 11:14:54;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/bin/scp -Br
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.VZ4905/globus-cache-export.VZ4905.gpg
globus-cache-export.VZ4905.gpg status=1 (copy request failed), try=3
>>11/24/2005 11:15:26;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.VZ4905/globus-cache-export.VZ4905.gpg
globus-cache-export.VZ4905.gpg status=1 (copy request failed),
>> try=411/24/2005 11:15:33;0004;
>> pbs_mom;Fil;globus-cache-export.VZ4905.gpg;Unable to copy file
>> globus-cache-export.VZ4905.gpg from
>> CE.pakgrid.org.pk:/home/cms001/.lcgjm/globus-cache-export.VZ4905/globus-cache-export.VZ4905.gpg
>>11/24/2005 11:15:33;0004;
>> pbs_mom;Fil;globus-cache-export.VZ4905.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 11:15:33;0004;
>> pbs_mom;Fil;globus-cache-export.VZ4905.gpg;lure
>>11/24/2005 11:15:33;0004; pbs_mom;Fil;globus-cache-export.VZ4905.gpg;No
>> such file or directory
>>11/24/2005 11:15:33;0004; pbs_mom;Fil;globus-cache-export.VZ4905.gpg;
11/24/2005 11:15:33;0004;
>> pbs_mom;Fil;globus-cache-export.VZ4905.gpg;Permission denied
>> (external-keyx,gssapi,publickey,password,keyboard-interactive).
>>11/24/2005 11:15:33;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.VZ4905.gpg
>>11/24/2005 11:19:21;0080; pbs_mom;Req;req_reject;Reject reply
>> code=15001, aux=0, type=11, from [log in to unmask]
11:19:21;0080; pbs_mom;Fil;sys_copy;command: /usr/bin/scp -Br
[log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.JC6269/globus-cache-export.JC6269.gpg
globus-cache-export.JC6269.gpg status=1 (copy request failed), try=1
>>11/24/2005 11:19:52;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.JC6269/globus-cache-export.JC6269.gpg
globus-cache-export.JC6269.gpg status=1 (copy request failed),
>> try=211/24/2005 11:19:56;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/bin/scp -Br
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.JC6269/globus-cache-export.JC6269.gpg
globus-cache-export.JC6269.gpg status=1 (copy request failed), try=3
>>11/24/2005 11:20:27;0080; pbs_mom;Fil;sys_copy;command:
>> /usr/sbin/pbs_rcp -r
>> [log in to unmask]:/home/cms001/.lcgjm/globus-cache-export.JC6269/globus-cache-export.JC6269.gpg
globus-cache-export.JC6269.gpg status=1 (copy request failed),
>> try=411/24/2005 11:20:34;0004;
>> pbs_mom;Fil;globus-cache-export.JC6269.gpg;Unable to copy file
>> globus-cache-export.JC6269.gpg from
>> CE.pakgrid.org.pk:/home/cms001/.lcgjm/globus-cache-export.JC6269/globus-cache-export.JC6269.gpg
>>11/24/2005 11:20:34;0004;
>> pbs_mom;Fil;globus-cache-export.JC6269.gpg;CE.pakgrid.org.pk:
Connection refused
>>11/24/2005 11:20:34;0004;
>> pbs_mom;Fil;globus-cache-export.JC6269.gpg;lure
>>11/24/2005 11:20:34;0004; pbs_mom;Fil;globus-cache-export.JC6269.gpg;No
>> such file or directory
>>11/24/2005 11:20:34;0004; pbs_mom;Fil;globus-cache-export.JC6269.gpg;
11/24/2005 11:20:34;0004;
>> pbs_mom;Fil;globus-cache-export.JC6269.gpg;Permission denied
>> (external-keyx,gssapi,publickey,password,keyboard-interactive).
>>11/24/2005 11:20:34;0008; pbs_mom;Req;del_files;cannot stat
>> globus-cache-export.JC6269.gpg
>>
>>Any solution?
>>Cheers,
>>Asif Osman
>>
>>
>
>
> --
> Dr. Alessandro Paolini
> INFN - CNAF
> Viale Berti Pichat 6/2
> 40127 Bologna
> Italy
> tel: +39 051 6092723
> fax: +39 051 6092746
> ICQ: 192172027
>
Christos Filippidis
NCSR DEMOKRITOS
Institute of Nuclear Physics
office block 6(ktirion 6)
Gr-15310 Agia Paraskevi
GREECE
Tel:2106503425
http://consult.cern.ch/xwho/people/117002
http://www.inp.demokritos.gr/~filippidisx/
----------------------------------------------
"Institute of Nuclear Physics NCSR Demokritos"
http://www.inp.demokritos.gr/
Christos Filippidis
NCSR DEMOKRITOS
Institute of Nuclear Physics
office block 6(ktirion 6)
Gr-15310 Agia Paraskevi
GREECE
Tel:2106503425
http://consult.cern.ch/xwho/people/117002
http://www.inp.demokritos.gr/~filippidisx/
----------------------------------------------
"Institute of Nuclear Physics NCSR Demokritos"
http://www.inp.demokritos.gr/
|