Hi Maarten,
hi list,
I just want to conclude on this thread as I think I have found the source of the problem.
I wrote a workaround that enhances the PATH, so that jobs will succeed in any case, which gives me much more debugging possibilities. And I captured wrapper scripts and job output throughout the night and now I think the problem is completely "SGE related".
Digging into the proc entry of one sge_shepherd:
23902 ? S 0:00 sge_shepherd-1346114 -bg
24077 ? SNs 0:00 -bash /sge-root/default/spool/wn160/job_scripts/1346114
[root@wn160 23902]# cat environ
MANPATH=/opt/edg/share/man:/opt/glite/share/man:/opt/glite/yaim/man:/opt/globus/man:/opt/lcg/man:/opt/lcg/share/man::::::LC_MONETARY=de_DE.utf-8HOSTNAME=wn160SHELL=/bin/bashTERM=xtermGRID_ENV_LOCATION=/opt/glite/etc/profile.dHISTSIZE=1000SSH_CLIENT=132.195.125.4 33861 22GLOBUS_LOCATION=/opt/globusPERL5LIB=/opt/lcg/lib64/perl:/opt/gpt/lib/perlVO_OPS_DEFAULT_SE=grid-se.physik.uni-wuppertal.deSGE_CELL=defaultGT_PROXY_MODE=oldLC_NUMERIC=de_DE.utf-8SSH_TTY=/dev/pts/0USER=rootLS_COLORS=no=00:fi=00:di=00;34:ln=00;36:pi=40;33:so=00;35:bd=40;33;01:cd=40;33;01:or=01;05;37;41:mi=01;05;37;41:ex=00;32:*.cmd=00;32:*.exe=00;32:*.com=00;32:*.btm=00;32:*.bat=00;32:*.sh=00;32:*.csh=00;32:*.tar=00;31:*.tgz=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.zip=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.bz=00;31:*.tz=00;31:*.rpm=00;31:*.cpio=00;31:*.jpg=00;35:*.gif=00;35:*.bmp=00;35:*.xbm=00;35:*.xpm=00;35:*.png=00;35:*.tif=00;35:LD_LIBRARY_PATH=/opt/d-cache/dcap/lib:/opt/d-cache/dcap/lib64:/opt/glite/lib:/opt/glite/lib64:/opt/globus/lib:/opt/lcg/lib:/opt/lcg/lib64:/opt/classads/lib64/:/opt/c-ares/lib/VO_GHEP_SW_DIR=/gridsoft/ghepVO_DTEAM_SW_DIR=/gridsoft/dteamLCG_LOCATION=/opt/lcgATLAS_LOCAL_AREA=/gridsoft/atlas-cvmfs/localVO_OPS_SW_DIR=/gridsoft/opsVO_ATLAS_DEFAULT_SE=grid-se.physik.uni-wuppertal.dePATH=/bin:/usr/bin:/sbin:/usr/sbinMAIL=/var/spool/mail/rootLC_MESSAGES=de_DE.utf-8LC_COLLATE=de_DE.utf-8VO_DTEAM_DEFAULT_SE=grid-se.physik.uni-wuppertal.deEDG_LOCATION=/opt/edgPWD=/rootINPUTRC=/etc/inputrcVO_AUGER_DEFAULT_SE=grid-se.physik.uni-wuppertal.deSITE_GIIS_URL=grid-bdii.physik.uni-wuppertal.deLANG=de_DE.utf-8VO_DECH_DEFAULT_SE=scaise-2.scai.fraunhofer.deSGE_ROOT=/sge-rootMYPROXY_SERVER=grid-px0.desy.deHOME=/rootSHLVL=2GLITE_LOCATION_VAR=/opt/glite/varVO_AUGER_SW_DIR=/gridsoft/augerGLITE_ENV_SET=TRUELOGNAME=rootPYTHONPATH=/opt/glite/lib64/python2.4/site-packages:/opt/glite/lib/python:/opt/lcg/lib64/python2.4/site-packages:/opt/lcg/lib64/pythonLCG_GFAL_INFOSYS=bdii-fzk.gridka.de:2170LC_CTYPE=de_DE.utf-8SSH_CONNECTION=132.195.125.4 33861 132.195.125.170 22VO_GHEP_DEFAULT_SE=grid-se.physik.uni-wuppertal.deLESSOPEN=|/usr/bin/lesspipe.sh %sVO_ATLAS_SW_DIR=/cvmfs/atlas.cern.ch/repo/swVO_ICECUBE_SW_DIR=/gridsoft/icecubeGLITE_LOCATION=/opt/gliteLC_TIME=de_DE.utf-8VO_DECH_SW_DIR=/gridsoft/dechSITE_NAME=wuppertalprodG_BROKEN_FILENAMES=1SRM_PATH=/opt/d-cache/srmVO_ICECUBE_DEFAULT_SE=grid-se.physik.uni-wuppertal.de_=/sge-root/bin/lx24-amd64/sge_execd
You see all the stuff from /etc/profile.d/grid-env.sh
Here's a node which doesn't have the problem:
30232 ? S 0:00 sge_shepherd-1329961 -bg
[root@wn158 30232]# cat environ
SELINUX_INIT=YESCONSOLE=/dev/consoleTERM=linuxSGE_CELL=defaultINIT_VERSION=sysvinit-2.86PATH=/bin:/usr/bin:/sbin:/usr/sbinRUNLEVEL=3runlevel=3PWD=/LANG=en_US.UTF-8SGE_ROOT=/sge-rootPREVLEVEL=Nprevious=NHOME=/SHLVL=2_=/sge-root/b
You see: no grid related stuff, especially no GLITE_ENV_SET.
At the top of /etc/profile.d/grid-env.sh we have a
if [ "X${GLITE_ENV_SET+X}" = "X" ]; then
. /opt/glite/etc/profile.d/grid-env-funcs.sh
So if GLITE_ENV_SET is already set, the script will not define gridpath_prepend which would be later used to set the PATH correctly:
gridpath_prepend "PATH" "/opt/lcg/bin"
gridpath_prepend "PATH" "/opt/globus/bin"
gridpath_prepend "PATH" "/opt/glite/bin"
gridpath_prepend "PATH" "/opt/edg/bin"
gridpath_prepend "PATH" "/opt/d-cache/srm/bin:/opt/d-cache/dcap/bin"
So PATH will be left to PATH=/bin:/usr/bin:/sbin:/usr/sbin plus whatever some other script will add.
That means: if SGE is started by init, you will have no /etc/profile.d/grid-env.sh sourced before and everything is okay. If you need to re-start sge_execd later for whatever reason, you will end up with an "all-but-PATH" environment.
I will now add another layer on tup of /etc/profile.d/grid-env.sh which prevent this from being executed when called as root.
I hope this is helpful for any other SGE side, too.
Best regards and have a nice weekend
Torsten
--
<><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><><>
<> <>
<> Dr. Torsten Harenberg [log in to unmask] <>
<> Bergische Universitaet <>
<> FB C - Physik Tel.: +49 (0)202 439-3521 <>
<> Gaussstr. 20 Fax : +49 (0)202 439-2811 <>
<> 42097 Wuppertal <>
<> <>
<><><><><><><>< Of course it runs NetBSD http://www.netbsd.org ><>
|