On Fri, Feb 09, 2007 at 10:45:55AM -0000, Jensen, J (Jens) wrote:
> I was speculating that connecting to the gatekeeper would be ok
> but something would fail to accept the delegated credential.
>
> Just an option. As usual we have to use a process of
> elimination.
Well to make things more clear (I hope):
It only happens a few times (No idea what the ratio is, or if it happens only at Imperial).
Here is a gram log from a failure which to me suggests a problem connecting back to
the client since it fails in stage_in.
Kostas Georgiou
2/7 20:03:12 JM: Security context imported
2/7 20:03:12 JM: Adding new callback contact (url=https://condorg.triumf.ca:57650/, mask=1048575)
2/7 20:03:12 JM: Added successfully
2/7 20:03:12 Pre-parsed RSL string: &(rsl_substitution=(GRIDMANAGER_GASS_URL https://condorg.triumf.ca:57651))(executable=$(GRIDMANAGER_GASS_URL)#'/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh')(scratchdir='')(directory=$(SCRATCH_DIRECTORY))(arguments='')(stdout=$(GLOBUS_CACHED_STDOUT))(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CACHED_STDOUT) $(GRIDMANAGER_GASS_URL)#'/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout')($(GLOBUS_CACHED_STDERR) $(GRIDMANAGER_GASS_URL)#'/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr'))(proxy_timeout=240)(save_state=yes)(two_phase=3600)(remote_io_url=$(GRIDMANAGER_GASS_URL))(queue=72hr)(jobtype=single)(maxWallTime=443)
2/7 20:03:12
<<<<<Job Request RSL
&("rsl_substitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = $("SCRATCH_DIRECTORY") )("arguments" = "" )("stdout" = $("GLOBUS_CACHED_STDOUT") )("stderr" = $("GLOBUS_CACHED_STDERR") )("file_stage_out" = ($("GLOBUS_CACHED_STDOUT") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ($("GLOBUS_CACHED_STDERR") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxy_timeout" = "240" )("save_state" = "yes" )("two_phase" = "3600" )("remote_io_url" = $("GRIDMANAGER_GASS_URL") )("queue" = "72hr" )("jobtype" = "single" )("maxWallTime" = "443" )
>>>>>Job Request RSL
2/7 20:03:12
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = $("SCRATCH_DIRECTORY") )("arguments" = "" )("stdout" = $("GLOBUS_CACHED_STDOUT") )("stderr" = $("GLOBUS_CACHED_STDERR") )("filestageout" = ($("GLOBUS_CACHED_STDOUT") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ($("GLOBUS_CACHED_STDERR") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxytimeout" = "240" )("savestate" = "yes" )("twophase" = "3600" )("remoteiourl" = $("GRIDMANAGER_GASS_URL") )("queue" = "72hr" )("jobtype" = "single" )("maxwalltime" = "443" )
>>>>>Job Request RSL (canonical)
2/7 20:03:12 JM: Evaluating RSL Value2/7 20:03:12 JM: Evaluated RSL Value to GRIDMANAGER_GASS_URL2/7 20:03:12 JM: Evaluating RSL Value2/7 20:03:12 JM: Evaluated RSL Value to https://condorg.triumf.ca:576512/7 20:03:13 Evaluating scratch directory RSL
2/7 20:03:13 Scratch Directory RSL ->
2/7 20:03:13 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:03:13 JMI: completed script validation: job manager type is sge.
2/7 20:03:13 JMI: in globus_gram_job_manager_script_make_scratchdir()
2/7 20:03:13 JMI: cmd = make_scratchdir
2/7 20:03:13 JMI: returning with success
Wed Feb 7 20:03:14 2007 JM_SCRIPT: New Perl JobManager created.
Wed Feb 7 20:03:14 2007 JM_SCRIPT: Entering Job Manager default implementation of make_scratchdir
Wed Feb 7 20:03:14 2007 JM_SCRIPT: Trying to create directory named /home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d
Wed Feb 7 20:03:14 2007 JM_SCRIPT: Sent NFS sync for /home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d
Wed Feb 7 20:03:15 2007 JM_SCRIPT: I think it was made.... verifying
Wed Feb 7 20:03:15 2007 JM_SCRIPT: Using /home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d as the scratch directory for this job.
2/7 20:03:15 JMI: while return_buf = GRAM_SCRIPT_SCRATCH_DIR = /home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d
2/7 20:03:22 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
2/7 20:03:22 Adding scratch dir to symbol table and env: /home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d
2/7 20:03:22
<<<<<Job RSL
&("environment" = ("SCRATCH_DIRECTORY" "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" ) ("HOME" "/home/grid/lt2-atlasprd" ) ("LOGNAME" "lt2-atlasprd" ) )("rslsubstitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = $("SCRATCH_DIRECTORY") )("arguments" = "" )("stdout" = $("GLOBUS_CACHED_STDOUT") )("stderr" = $("GLOBUS_CACHED_STDERR") )("filestageout" = ($("GLOBUS_CACHED_STDOUT") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ($("GLOBUS_CACHED_STDERR") $("GRIDMANAGER_GASS_URL") # "/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxytimeout" = "240" )("savestate" = "yes" )("twophase" = "3600" )("remoteiourl" = $("GRIDMANAGER_GASS_URL") )("queue" = "72hr" )("jobtype" = "single" )("maxwalltime" = "443" )
>>>>>Job RSL
2/7 20:03:22
<<<<<Job RSL (post-eval)
&("environment" = ("SCRATCH_DIRECTORY" "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" ) ("HOME" "/home/grid/lt2-atlasprd" ) ("LOGNAME" "lt2-atlasprd" ) )("rslsubstitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" )("arguments" = "" )("stdout" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" )("stderr" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" )("filestageout" = ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxytimeout" = "240" )("savestate" = "yes" )("twophase" = "3600" )("remoteiourl" = "https://condorg.triumf.ca:57651" )("queue" = "72hr" )("jobtype" = "single" )("maxwalltime" = "443" )
>>>>>Job RSL (post-eval)
2/7 20:03:22
<<<<<Job RSL (post-validation)
&("stdin" = "/dev/null" )("count" = "1" )("gram_my_job" = "collective" )("dry_run" = "no" )("environment" = ("SCRATCH_DIRECTORY" "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" ) ("HOME" "/home/grid/lt2-atlasprd" ) ("LOGNAME" "lt2-atlasprd" ) )("rslsubstitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" )("arguments" = "" )("stdout" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" )("stderr" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" )("filestageout" = ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxytimeout" = "240" )("savestate" = "yes" )("twophase" = "3600" )("remoteiourl" = "https://condorg.triumf.ca:57651" )("queue" = "72hr" )("jobtype" = "single" )("maxwalltime" = "443" )
>>>>>Job RSL (post-validation)
2/7 20:03:22
<<<<<Job RSL (post-validation-eval)
&("stdin" = "/dev/null" )("count" = "1" )("gram_my_job" = "collective" )("dry_run" = "no" )("environment" = ("SCRATCH_DIRECTORY" "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" ) ("HOME" "/home/grid/lt2-atlasprd" ) ("LOGNAME" "lt2-atlasprd" ) )("rslsubstitution" = ("GRIDMANAGER_GASS_URL" "https://condorg.triumf.ca:57651" ) )("executable" = "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh" )("scratchdir" = "" )("directory" = "/home/grid/lt2-atlasprd//gram_scratch_YqeZEiPi3d" )("arguments" = "" )("stdout" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" )("stderr" = "x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" )("filestageout" = ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout" ) ("x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr" "https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr" ) )("proxytimeout" = "240" )("savestate" = "yes" )("twophase" = "3600" )("remoteiourl" = "https://condorg.triumf.ca:57651" )("queue" = "72hr" )("jobtype" = "single" )("maxwalltime" = "443" )
>>>>>Job RSL (post-validation-eval)
2/7 20:03:22 JMI: Getting RSL output value
2/7 20:03:22 JMI: Processing output positions
2/7 20:03:22 JMI: Getting RSL output value
2/7 20:03:22 JMI: Processing output positions
2/7 20:03:22 JM: Evaluating RSL Value2/7 20:03:22 JM: Evaluated RSL Value to x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout2/7 20:03:22 JM: Evaluating RSL Value2/7 20:03:22 JM: Evaluated RSL Value to https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stdout2/7 20:03:22 JM: Evaluating RSL Value2/7 20:03:22 JM: Evaluated RSL Value to x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr2/7 20:03:22 JM: Evaluating RSL Value2/7 20:03:22 JM: Evaluated RSL Value to https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/stderr2/7 20:03:22 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:03:22 JMI: completed script validation: job manager type is sge.
2/7 20:03:22 JMI: in globus_gram_job_manager_script_remote_io_file_create()
2/7 20:03:22 JMI: cmd = remote_io_file_create
Wed Feb 7 20:03:23 2007 JM_SCRIPT: New Perl JobManager created.
Wed Feb 7 20:03:23 2007 JM_SCRIPT: remote_io_file_create(enter)
Wed Feb 7 20:03:27 2007 JM_SCRIPT: remote_io_file_create(exit)
2/7 20:03:27 JMI: while return_buf = GRAM_SCRIPT_REMOTE_IO_FILE = /home/grid/lt2-atlasprd/.globus/.gass_cache/local/md5/ae/8c36bd9a4e94bbc5bddcf0839d42f9/md5/21/f1aecf347892f34c9e707f16eff43c/data
2/7 20:03:27 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
2/7 20:03:31 JM: Opening output destinations
2/7 20:03:31 JM: stdout goes to x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stdout
2/7 20:03:31 JM: stderr goes to x-gass-cache://ce00.hep.ph.ic.ac.uk/18901.1170878592/dev/stderr
2/7 20:03:31 stdout or stderr is being used, starting to poll
2/7 20:03:31 no opens in progress, registering state machine callback
2/7 20:03:31 JM: Finished opening output destinations
2/7 20:03:31 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_OPEN_OUTPUT
2/7 20:03:31 JM: GSSAPI type is GSI.. relocating proxy
2/7 20:03:33 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:03:33 JMI: completed script validation: job manager type is sge.
2/7 20:03:33 JMI: in globus_gram_job_manager_script_proxy_relocate()
2/7 20:03:33 JMI: cmd = proxy_relocate
Wed Feb 7 20:03:35 2007 JM_SCRIPT: New Perl JobManager created.
Wed Feb 7 20:03:35 2007 JM_SCRIPT: proxy_relocate(enter)
2/7 20:03:35 JMI: while return_buf = GRAM_SCRIPT_X509_USER_PROXY = /home/grid/lt2-atlasprd/.globus/.gass_cache/local/md5/ae/8c36bd9a4e94bbc5bddcf0839d42f9/md5/79/9885245d2a4b29ba153ca58b74fed0/data
2/7 20:03:37 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_PROXY_RELOCATE
2/7 20:03:37 JM: Relocated Proxy to /home/grid/lt2-atlasprd/.globus/.gass_cache/local/md5/ae/8c36bd9a4e94bbc5bddcf0839d42f9/md5/79/9885245d2a4b29ba153ca58b74fed0/data
2/7 20:03:37 JM: Creating and locking state lock file
2/7 20:03:37 JM: Writing state file
2/7 20:03:37 JM: before sending to client: rc=0 (Success)
2/7 20:03:37 Job Manager State Machine (exiting): GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE
2/7 20:03:40 globus_gram_job_manager_query_callback() not a literal URI match
2/7 20:03:40 JM : in globus_l_gram_job_manager_query_callback, query=signal 5
2/7 20:03:40 JM : reply: (status=32 failure code=0 (Success))
2/7 20:03:40 JM : sending reply:
protocol-version: 2
status: 32
failure-code: 0
job-failure-code: 0
2/7 20:03:40 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_TWO_PHASE_COMMITTED
2/7 20:03:40 JM: NOT empty client callback list.
2/7 20:03:40 JM: sending callback of status 64 (failure code 0) to https://condorg.triumf.ca:57650/.
2/7 20:03:40 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:03:40 JMI: completed script validation: job manager type is sge.
2/7 20:03:40 JMI: in globus_gram_job_manager_script_stage_in()
2/7 20:03:40 JMI: cmd = stage_in
2/7 20:03:40 JMI: returning with success
Wed Feb 7 20:03:41 2007 JM_SCRIPT: New Perl JobManager created.
Wed Feb 7 20:03:41 2007 JM_SCRIPT: stage_in(enter)
Wed Feb 7 20:03:50 2007 JM_SCRIPT: executable staging failed with
2/7 20:03:50 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
2/7 20:03:50 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_SOURCE = https://condorg.triumf.ca:57651/home/atlas/ProdSys/JOBS/4636/2281262/2/condor_wrap.sh
2/7 20:03:50 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_TYPE = executable
2/7 20:03:50 JMI: while return_buf = GRAM_SCRIPT_ERROR = 43
2/7 20:03:50 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_STAGE_IN
2/7 20:03:50 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED
2/7 20:03:50 JM: Writing state file
2/7 20:03:50 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CLOSE_OUTPUT
2/7 20:03:50 JM: in globus_gram_job_manager_history_file_create()
2/7 20:03:50 JM: NOT empty client callback list.
2/7 20:03:50 JM: sending callback of status 4 (failure code 43) to https://condorg.triumf.ca:57650/.
2/7 20:08:07 globus_gram_job_manager_query_callback() not a literal URI match
2/7 20:08:07 JM : in globus_l_gram_job_manager_query_callback, query=register 1048575 https://condorg.triumf.ca:41614/
2/7 20:08:07 JM: job manager request handling is not done yet, request will be processed
2/7 20:08:07 JM: Adding new callback contact (url=https://condorg.triumf.ca:41614/, mask=1048575)
2/7 20:08:07 JM: Added successfully
2/7 20:08:07 JM : reply: (status=4 failure code=0 (Success))
2/7 20:08:07 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 0
job-failure-code: 43
2/7 20:08:13 globus_gram_job_manager_query_callback() not a literal URI match
2/7 20:08:16 JM : in globus_l_gram_job_manager_query_callback, query=signal 7 &(remote_io_url=https://condorg.triumf.ca:41615)(invalid=bad)
2/7 20:08:16 JM : reply: (status=4 failure code=94 (the jobmanager does not accept any new requests (shutting down)))
2/7 20:08:16 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 94
job-failure-code: 0
2/7 20:08:23 globus_gram_job_manager_query_callback() not a literal URI match
2/7 20:08:23 JM : in globus_l_gram_job_manager_query_callback, query=signal 10
2/7 20:08:23 JM : reply: (status=4 failure code=0 (Success))
2/7 20:08:23 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 0
job-failure-code: 43
2/7 20:08:23 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
2/7 20:08:23 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
2/7 20:08:23 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:08:23 JMI: completed script validation: job manager type is sge.
2/7 20:08:23 JMI: in globus_gram_job_manager_rm_scratchdir()
2/7 20:08:23 JMI: cmd = remove_scratchdir
2/7 20:08:23 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
2/7 20:08:23 JMI: testing job manager scripts for type sge exist and permissions are ok.
2/7 20:08:23 JMI: completed script validation: job manager type is sge.
2/7 20:08:23 JMI: cmd = cache_cleanup
Wed Feb 7 20:08:24 2007 JM_SCRIPT: New Perl JobManager created.
Wed Feb 7 20:08:24 2007 JM_SCRIPT: cache_cleanup(enter)
Wed Feb 7 20:08:27 2007 JM_SCRIPT: cache_cleanup(exit)
2/7 20:08:27 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CACHE_CLEAN_UP
2/7 20:08:27 JM: in globus_gram_job_manager_reporting_file_remove()
2/7 20:08:27 JM: exiting globus_gram_job_manager.
|