Hi Maarten,
Am 25.10.2012 um 16:43 schrieb Maarten Litmaath <[log in to unmask]>:
> A race with what? If commands like which, grid-proxy-info and globus-url-copy
> cannot be found, it would "normally" mean the PATH has been screwed up:
> you need to debug that, it may well explain various issues you are seeing
> and who knows what more trouble it can bring!
No harm meant, but imagine what I do since three days from abt 7:30 a.m. until 11 p.m.? But it's not easy if you have to understand the full set of perl, shell, scripts and JAVA code which are inside CREAM. And I just did a fresh CE install with yaim and am using worker nodes which used to run fine (we were one of the most stable sites in the German ATLAS cloud). So I haven't expected that many trouble.
I tried to take the bull by the horns and seeking advice in this list where to look at.
Fact is:
Auger jobs run
ATLAS Analysis jobs run
ATLAS production job don't (sometimes).
The failures distribute nicely over all worker nodes (not a single back hole).
I could boil it down that PATH is modified sometimes heavily, while all other variables from grid-env.sh stay intact:
---- JOB Environment ----
APFCID=8418744.11
APFFID=voatlas213
APFMON=http://apfmon.lancs.ac.uk/mon/
ARC=lx24-amd64
ATLAS_LOCAL_AREA=/gridsoft/atlas-cvmfs/local
_=/bin/env
CE_ID=cream-ce.physik.uni-wuppertal.de:8443/cream-sge-atlasprd.q
__copy_proxy_min_retry_wait=60
CREAM_JOBID=https://cream-ce.physik.uni-wuppertal.de:8443/CREAM779428483
__delegationProxyCertSandboxPath=gsiftp://cream-ce.physik.uni-wuppertal.de/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/proxy/1351163632_573523_14052430460278
__delegationProxyCertSandboxPathTmp=/tmp/1351163632_573523_14052430460278779428483
__delegationTimeSlot=3600
EDG_LOCATION=/opt/edg
ENVIRONMENT=BATCH
FACTORYQUEUE=wuppertalprod_cream-ce
G_BROKEN_FILENAMES=1
GLITE_ENV_SET=TRUE
GLITE_LOCATION=/opt/glite
GLITE_LOCATION_VAR=/opt/glite/var
GLITE_WMS_JOBID=N/A
GLITE_WMS_LOCATION=/opt/glite
GLITE_WMS_LOG_DESTINATION=cream-ce.physik.uni-wuppertal.de
GLITE_WMS_SEQUENCE_CODE=
GLOBUS_LOCATION=/opt/globus
GRID_ENV_LOCATION=/opt/glite/etc/profile.d
GRID_JOBID=N/A
GTAG=http://voatlas213.cern.ch/pilots/2012-10-25/wuppertalprod_cream-ce/8418744.11.out
GT_PROXY_MODE=old
HISTSIZE=1000
HOME=/home/atlasprd005/home_cream_779428483
HOSTNAME=wn004
INPUTRC=/etc/inputrc
JOB_ID=1334000
JOB_NAME=cream_779428483
JOB_SCRIPT=/sge-root/default/spool/wn004/job_scripts/1334000
LANG=de_DE.utf-8
LC_COLLATE=de_DE.utf-8
LC_CTYPE=de_DE.utf-8
LCG_GFAL_INFOSYS=bdii-fzk.gridka.de:2170
LCG_LOCATION=/opt/lcg
LC_MESSAGES=de_DE.utf-8
LC_MONETARY=de_DE.utf-8
LC_NUMERIC=de_DE.utf-8
LC_TIME=de_DE.utf-8
LD_LIBRARY_PATH=/opt/d-cache/dcap/lib:/opt/d-cache/dcap/lib64:/opt/glite/lib:/opt/glite/lib64:/opt/globus/lib:/opt/lcg/lib:/opt/lcg/lib64:/opt/classads/lib64/:/opt/c-ares/lib/
LESSOPEN=|/usr/bin/lesspipe.sh %s
LOGNAME=atlasprd005
LS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:
MAIL=/var/spool/mail/atlasprd005
MANPATH=/opt/edg/share/man:/opt/glite/share/man:/opt/glite/yaim/man:/opt/globus/man:/opt/lcg/man:/opt/lcg/share/man::::::
MYPROXY_SERVER=grid-px0.desy.de
NHOSTS=1
NQUEUES=1
NSLOTS=1
OLDPWD=/tmp/1334000.1.atlasprd.q/condorg_ipS30711
PANDA_JSID=voatlas213
PATH=/usr/kerberos/bin:/tmp/1334000.1.atlasprd.q:/usr/local/bin:/bin:/usr/bin:/home/atlasprd005/bin
PERL5LIB=/opt/lcg/lib64/perl:/opt/gpt/lib/perl
PWD=/tmp/1334000.1.atlasprd.q/condorg_ipS30711/pilot3
PYTHONPATH=/opt/glite/lib64/python2.4/site-packages:/opt/glite/lib/python:/opt/lcg/lib64/python2.4/site-packages:/opt/lcg/lib64/python
QUEUE=atlasprd.q
REQNAME=cream_779428483
REQUEST=cream_779428483
RESTARTED=0
RUCIO_ACCOUNT=pilot
SGE_ACCOUNT=sge
SGE_ARCH=lx24-amd64
SGE_BINARY_PATH=/sge-root/bin/lx24-amd64
SGE_CELL=default
SGE_CWD_PATH=/home/atlasprd005
SGE_JOB_SPOOL_DIR=/sge-root/default/spool/wn004/active_jobs/1334000.1
SGE_O_HOME=/home/atlasprd005
SGE_O_HOST=cream-ce
SGE_O_LOGNAME=atlasprd005
SGE_O_MAIL=/var/spool/mail/tomcat
SGE_O_PATH=/sge-root/bin/lx24-amd64:/sbin:/bin:/usr/sbin:/usr/bin
SGE_O_SHELL=/sbin/nologin
SGE_O_WORKDIR=/var/tmp
SGE_ROOT=/sge-root
[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/77/CREAM779428483/CREAM779428483_jobWrapper.sh@@@[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/proxy/1351163632_573523_14052430460278
[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/77/CREAM779428483/StandardOutput@@@[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/77/CREAM779428483/StandardError
SGE_STDERR_PATH=/home/atlasprd005/cream_779428483.e1334000
SGE_STDIN_PATH=/dev/null
SGE_STDOUT_PATH=/home/atlasprd005/cream_779428483.o1334000
SGE_TASK_FIRST=undefined
SGE_TASK_ID=undefined
SGE_TASK_LAST=undefined
SGE_TASK_STEPSIZE=undefined
SHELL=/bin/bash
SHLVL=6
SITE_GIIS_URL=grid-bdii.physik.uni-wuppertal.de
SITE_NAME=wuppertalprod
SRM_PATH=/opt/d-cache/srm
SSH_CLIENT=132.195.125.4 33279 22
SSH_CONNECTION=132.195.125.4 33279 132.195.125.14 22
SSH_TTY=/dev/pts/0
TERM=xterm
TMPDIR=/tmp/1334000.1.atlasprd.q
TMP=/tmp/1334000.1.atlasprd.q
USER=atlasprd005
VO_ATLAS_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/sw
VO_AUGER_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_AUGER_SW_DIR=/gridsoft/auger
VO_DECH_DEFAULT_SE=scaise-2.scai.fraunhofer.de
VO_DECH_SW_DIR=/gridsoft/dech
VO_DTEAM_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_DTEAM_SW_DIR=/gridsoft/dteam
VO_GHEP_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_GHEP_SW_DIR=/gridsoft/ghep
VO_ICECUBE_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_ICECUBE_SW_DIR=/gridsoft/icecube
VO_OPS_DEFAULT_SE=grid-se.physik.uni-wuppertal.de
VO_OPS_SW_DIR=/gridsoft/ops
X509_USER_PROXY=/home/atlasprd005/home_cream_779428483/cream_779428483.proxy
I'm in contact with LMU Munich now, which also run a EMI-2 CE with SGE and they also had trouble and made some changes to the submit scripts to get around this. I will try that now.
Regards,
Torsten
--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> <>
<> Dr. Torsten Harenberg [log in to unmask] <>
<> Bergische Universitaet <>
<> FB C - Physik Tel.: +49 (0)202 439-3521 <>
<> Gaussstr. 20 Fax : +49 (0)202 439-2811 <>
<> 42097 Wuppertal <>
<> <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>
|