Hi Roberto,
thanks a lot for your analysis.
I'm still digging through the SGE submit scripts, but let me quickly go through your "checklist":
Am 25.10.2012 um 10:57 schrieb Roberto Rosende Dopazo <[log in to unmask]>:
> $ grep 780235605 glite-ce-cream.log.2
> 25 Oct 2012 08:46:27,074 org.glite.ce.cream.jobmanagement.db.table.JobTable - Job inserted. JobId = CREAM780235605
> 25 Oct 2012 08:46:27,896 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS CHANGED: -- => REGISTERED [localUser=auger007] [gridJobId=https://lb01.ncg.ingrid.pt:9000/EMJNdf69pIJE02JkFNzdMw] [delegationId=13510740812E634829wms012Encg2Eingrid2Ept]
> 25 Oct 2012 08:46:29,390 org.glite.ce.cream.cmdmanagement.CommandManager - new command [NAME="JOB_START"; PRIORITY_LEVEL=1; IS_ASYNCHRONOUS=true; STATUS=ACCEPTED; CATEGORY="JOB_MANAGEMENT"; EXECUTOR_NAME="BLAHExecutor"; USER_ID="CN_Julio_Lozano_Bahilo_O_ugr_DC_irisgrid_DC_es_auger_Role_Production_Capability_NULL"; CREATION_TIME="Thu Oct 25 08:46:29 CEST 2012"; REMOTE_REQUEST_ADDRESS="193.136.75.1"; JOB_ID_LIST={ CREAM780235605 }; IS_ADMIN="false"; USER_FQAN={ /auger/Role=Production/Capability=NULL; /auger/Role=NULL/Capability=NULL }; USER_DN="CN=Julio.Lozano.Bahilo,O=ugr,DC=irisgrid,DC=es"]
> 25 Oct 2012 08:46:30,141 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS CHANGED: REGISTERED => PENDING [localUser=auger007] [gridJobId=https://lb01.ncg.ingrid.pt:9000/EMJNdf69pIJE02JkFNzdMw] [delegationId=13510740812E634829wms012Encg2Eingrid2Ept]
> 25 Oct 2012 08:46:34,851 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS CHANGED: PENDING => IDLE [localUser=auger007] [gridJobId=https://lb01.ncg.ingrid.pt:9000/EMJNdf69pIJE02JkFNzdMw] [delegationId=13510740812E634829wms012Encg2Eingrid2Ept]
> 25 Oct 2012 08:46:34,855 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - ID=9857; NAME="JOB_START"; PRIORITY_LEVEL=1; IS_ASYNCHRONOUS=true; STATUS=EXECUTING; CATEGORY="JOB_MANAGEMENT"; USER_ID="CN_Julio_Lozano_Bahilo_O_ugr_DC_irisgrid_DC_es_auger_Role_Production_Capability_NULL"; CREATION_TIME="Thu Oct 25 08:46:29 CEST 2012"; START_PROCESSING_TIME="Thu Oct 25 08:46:29 CEST 2012"; JOB_ID_LIST="CREAM780235605"; IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="193.136.75.1"; USER_DN="CN=Julio.Lozano.Bahilo,O=ugr,DC=irisgrid,DC=es"; USER_FQAN={ /auger/Role=Production/Capability=NULL; /auger/Role=NULL/Capability=NULL } lrmsAbsJobId=sge/20121025084632/1331321;
> 25 Oct 2012 08:46:48,259 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS CHANGED: IDLE => RUNNING [localUser=auger007] [gridJobId=https://lb01.ncg.ingrid.pt:9000/EMJNdf69pIJE02JkFNzdMw] [delegationId=13510740812E634829wms012Encg2Eingrid2Ept]
> 25 Oct 2012 08:47:03,742 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS CHANGED: RUNNING => DONE-FAILED [failureReason=Cannot find gridftp remove application] [exitCode=W] [localUser=auger007] [gridJobId=https://lb01.ncg.ingrid.pt:9000/EMJNdf69pIJE02JkFNzdMw][workerNode=wn139] [delegationId=13510740812E634829wms012Encg2Eingrid2Ept]
> 25 Oct 2012 08:48:04,554 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS UPDATED: DONE-FAILED
> 25 Oct 2012 08:48:17,345 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM780235605 STATUS UPDATED: DONE-FAILED
This is again one example where these "which" commands simply fail and I have no idea why. As I wrote before, we suffered from this quite some time and as it's not reproducible, but rather seems to be a race condition, I will simply "fix" the path in the submit script also for that. Shouldn't happen now anymore.
[auger007@wn139 ~]$ ls -l *780235605*
-rw-r--r-- 1 auger007 auger 0 Oct 25 08:46 cream_780235605.e1331321
-rw-r--r-- 1 auger007 auger 0 Oct 25 08:46 cream_780235605.o1331321
-rw-r--r-- 1 auger007 auger 39 Oct 25 08:47 err_cream_780235605_StandardError
-rw-r--r-- 1 auger007 auger 143 Oct 25 08:47 out_cream_780235605_StandardOutput
[auger007@wn139 ~]$ cat err_cream_780235605_StandardError
Cannot find gridftp remove application
[auger007@wn139 ~]$ cat out_cream_780235605_StandardOutput
lcg-jobwrapper-hook.sh not readable or not present
LM_log_done_begin Cannot find gridftp remove application LM_log_done_end
jw exit status = 1
[auger007@wn139 ~]$ which uberftp
/opt/globus/bin/uberftp
[auger007@wn139 ~]$
coming from:
gridftp_rm_command=`which uberftp 2>/dev/null`
if [ -x "$gridftp_rm_command" ]; then
majorver=`$gridftp_rm_command -version | perl -nle 'print $1 if /(\d+)+(\.)*/'`
gridftp_option=
if [ $majorver = 1 ]; then
gridftp_option="-a gsi"
fi
gridftp_rm_cmdline="${gridftp_rm_command} ${__token_hostname} $gridftp_option \"quote dele ${__token_fullpath}\""
fi
if [ ! -n "${gridftp_rm_cmdline}" ]; then
for gridftp_rm_command in ${GLITE_LOCATION:-/opt/glite}/bin/glite-gridftp-rm \
`which glite-gridftp-rm 2>/dev/null` \
/usr/bin/glite-gridftp-rm ; do
if [ -x "$gridftp_rm_command" ]; then
gridftp_rm_cmdline="${gridftp_rm_command} ${__token_file}"
break;
fi
done
fi
if [ ! -n "${gridftp_rm_cmdline}" ]; then
fatal_error "Cannot find gridftp remove application"
fi
> Could you test gridftp?
> https://wiki.italiangrid.it/twiki/bin/view/CREAM/TroubleshootingGuide#1_4_Test_gridftp
Run for the pilot submission machine, so it is the same proxy used also by Condor:
[pilot@pilot ~]$ export X509_USER_PROXY=/tmp/prodProxy
[pilot@pilot ~]$ uberftp cream-ce.physik.uni-wuppertal.de
220 cream-ce.physik.uni-wuppertal.de GridFTP Server 6.14 (gcc64, 1342551528-83) [Globus Toolkit 5.2.1] ready.
230 User atlasprd018 logged in.
uberftp> ls /etc
-rw------- 1 root root 0 Oct 19 13:29 .pwd.lock
-rw-r--r-- 1 root root 4439 Apr 17 15:03 DIR_COLORS
-rw-r--r-- 1 root root 5139 Apr 17 15:03 DIR_COLORS.256color
-rw-r--r-- 1 root root 4113 Apr 17 15:03 DIR_COLORS.lightbgcolor
[...]
> Maybe you can check also open ports?
> https://wiki.italiangrid.it/twiki/bin/view/CREAM/ServiceReferenceCard#Open_ports
I disabled all firewalls when the problems occur:
[root@cream-ce cream]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
[root@cream-ce cream]#
I think both issues would also prevent Auger to run and they have >1000 jobs running already...
Cheers
Torsten
--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> <>
<> Dr. Torsten Harenberg [log in to unmask] <>
<> Bergische Universitaet <>
<> FB C - Physik Tel.: +49 (0)202 439-3521 <>
<> Gaussstr. 20 Fax : +49 (0)202 439-2811 <>
<> 42097 Wuppertal <>
<> <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>
|