Hi,
After upgrade to 2.4 at TAU the site was running a few days and then,
some kind of trouble started.
The problem manifest in (2) ways:
(1) The job that finish the run are moved to WAIT queue and then
run agin (in loop!)
(2) When creating a proxy (grid-proxy-init) , and issue:
globus-job-run lcfgng /bin/ls
The system reports:
GRAM Job submission failed because the job manager failed to
open stderr (error code 74)
We tried different things :
1) run :
/opt/globus/bin/globusrun -a -r lcfgng
And in the same time, look at the
/var/log/globus-gatekeeper.log. the answer is :
Notice: 6: Got connection 132.67.130.65 at Sun May 22 17:04:21 2005
Notice: 5: Trying to use original user proxy ...
Notice: 5: Authenticated globus user:
/C=IL/O=IUCC/OU=TAU/CN=benhammou/SN=44
gss_assist_get_unwrap failure:
GSS Major Status: General failure
GSS Minor Status Error Chain:
unwrap.c:273: gss_unwrap: internal problem with SSL BIO: SSL_read rc=-1
OpenSSL Error: by_file.c:229: in library: x509 certificate routines,
function X509_load_crl_file: missing asn1 eos
OpenSSL Error: pem_lib.c:669: in library: PEM routines, function
PEM_read_bio: no start line
OpenSSL Error: by_file.c:229: in library: x509 certificate routines,
function X509_load_crl_file: missing asn1 eos
OpenSSL Error: pem_lib.c:669: in library: PEM routines, function
PEM_read_bio: no start lineFailure: Reading incoming message GSS
failed Major:000d0000 Minor:00000001 Token:00000000
Failure: Reading incoming message GSS failed Major:000d0000
Minor:00000001 Token:00000000
2)in the home directory , the file gram_job_mgr_4906.log :
5/22 16:59:47 JM: Security context imported
5/22 16:59:47 JM: Adding new callback contact
(url=https://lcfgng.cs.tau.ac.il:20002/, mask=1048575)
5/22 16:59:47 JM: Added successfully
5/22 16:59:47 Pre-parsed RSL string: &("rsl_substitution" =
("GLOBUSRUN_GASS_URL" "https://lcfgng.cs.tau.ac.il:20001" ) )(
"stderr" = $("GLOBUSRUN_GASS_URL") # "/dev/stderr" )("stdout" =
$("GLOBUSRUN_GASS_URL") # "/dev/stdout" )("executable" = "
/bin/ls" )
5/22 16:59:47
<<<<<Job Request RSL
&("rsl_substitution" = ("GLOBUSRUN_GASS_URL"
"https://lcfgng.cs.tau.ac.il:20001" ) )("stderr" =
$("GLOBUSRUN_GASS_URL") #
"/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "/dev/stdout"
)("executable" = "/bin/ls" )
Job Request RSL
5/22 16:59:47
<<<<<Job Request RSL (canonical)
&("rslsubstitution" = ("GLOBUSRUN_GASS_URL"
"https://lcfgng.cs.tau.ac.il:20001" ) )("stderr" =
$("GLOBUSRUN_GASS_URL") # "
/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "/dev/stdout"
)("executable" = "/bin/ls" )
Job Request RSL (canonical)
5/22 16:59:47 JM: Evaluating RSL Value5/22 16:59:47 JM: Evaluated RSL
Value to GLOBUSRUN_GASS_URL5/22 16:59:47 JM: Evaluat
ing RSL Value5/22 16:59:47 JM: Evaluated RSL Value to
https://lcfgng.cs.tau.ac.il:200015/22 16:59:47 Job Manager State Mac
hine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_MAKE_SCRATCHDIR
5/22 16:59:47
<<<<<Job RSL
&("environment" = ("HOME" "/home/dteam002" ) ("LOGNAME" "dteam002" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://
lcfgng.cs.tau.ac.il:20001" ) )("stderr" = $("GLOBUSRUN_GASS_URL") #
"/dev/stderr" )("stdout" = $("GLOBUSRUN_GASS_URL") # "
/dev/stdout" )("executable" = "/bin/ls" )
Job RSL
5/22 16:59:47
<<<<<Job RSL (post-eval)
&("environment" = ("HOME" "/home/dteam002" ) ("LOGNAME" "dteam002" )
)("rslsubstitution" = ("GLOBUSRUN_GASS_URL" "https://
lcfgng.cs.tau.ac.il:20001" ) )("stderr" =
"https://lcfgng.cs.tau.ac.il:20001/dev/stderr" )("stdout" =
"https://lcfgng.cs.t
au.ac.il:20001/dev/stdout" )("executable" = "/bin/ls" )
Job RSL (post-eval)
5/22 16:59:47
<<<<<Job RSL (post-validation)
&("directory" = $("HOME") )("stdin" = "/dev/null" )("count" = "1"
)("job_type" = "multiple" )("gram_my_job" = "collective"
)("dry_run" = "no" )("environment" = ("HOME" "/home/dteam002" )
("LOGNAME" "dteam002" ) )("rslsubstitution" = ("GLOBUSRUN
_GASS_URL" "https://lcfgng.cs.tau.ac.il:20001" ) )("stderr" =
"https://lcfgng.cs.tau.ac.il:20001/dev/stderr" )("stdout" =
"https://lcfgng.cs.tau.ac.il:20001/dev/stdout" )("executable" =
"/bin/ls" )
Job RSL (post-validation)
5/22 16:59:47
<<<<<Job RSL (post-validation-eval)
&("directory" = "/home/dteam002" )("stdin" = "/dev/null" )("count" =
"1" )("job_type" = "multiple" )("gram_my_job" = "coll
ective" )("dry_run" = "no" )("environment" = ("HOME" "/home/dteam002"
) ("LOGNAME" "dteam002" ) )("rslsubstitution" = ("GL
OBUSRUN_GASS_URL" "https://lcfgng.cs.tau.ac.il:20001" ) )("stderr" =
"https://lcfgng.cs.tau.ac.il:20001/dev/stderr" )("std
out" = "https://lcfgng.cs.tau.ac.il:20001/dev/stdout" )("executable" =
"/bin/ls" )
Job RSL (post-validation-eval)
5/22 16:59:47 JMI: Getting RSL output value
5/22 16:59:47 JMI: Processing output positions
5/22 16:59:47 JMI: Getting RSL output value
5/22 16:59:47 JMI: Processing output positions
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_REMOTE_IO_FILE_CREATE
5/22 16:59:47 JM: Opening output destinations
5/22 16:59:47 JM: stdout goes to
x-gass-cache://lcfgng.cs.tau.ac.il/4906.1116770387/dev/stdout
5/22 16:59:47 JM: stderr goes to
x-gass-cache://lcfgng.cs.tau.ac.il/4906.1116770387/dev/stderr
5/22 16:59:47 JM: Opening https://lcfgng.cs.tau.ac.il:20001/dev/stdout
5/22 16:59:47 JM: Opened GASS handle 1.
5/22 16:59:47 JM: exiting
globus_l_gram_job_manager_output_destination_open()
5/22 16:59:47 JM: Opening https://lcfgng.cs.tau.ac.il:20001/dev/stderr
5/22 16:59:47 JM: Opened GASS handle 2.
5/22 16:59:47 JM: exiting
globus_l_gram_job_manager_output_destination_open()
5/22 16:59:47 stdout or stderr is being used, starting to poll
5/22 16:59:47 JM: Finished opening output destinations
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CLOSE_OUTPUT
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_PRE_FILE_CLEAN_UP
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_FILE_CLEAN_UP
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_SCRATCH_CLEAN_UP
5/22 16:59:47 JMI: testing job manager scripts for type fork exist and
permissions are ok.
5/22 16:59:47 JMI: completed script validation: job manager type is
fork.
5/22 16:59:47 JMI: cmd = cache_cleanup
Sun May 22 16:59:47 2005 JM_SCRIPT: New Perl JobManager created.
Sun May 22 16:59:47 2005 JM_SCRIPT: cache_cleanup(enter)
Sun May 22 16:59:47 2005 JM_SCRIPT: cache_cleanup(exit)
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_CACHE_CLEAN_UP
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_EARLY_FAILED_RESPONSE
5/22 16:59:47 JM: before sending to client: rc=0 (Success)
5/22 16:59:47 Job Manager State Machine (exiting):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
5/22 16:59:47 JM: in globus_gram_job_manager_reporting_file_remove()
5/22 16:59:47 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_DONE
5/22 16:59:47 JM: in globus_gram_job_manager_reporting_file_remove()
5/22 16:59:47 JM: exiting globus_gram_job_manager
From goggling the problem it looks like there is something to do with
the signing policy. We changed it to format according to
http://www.grid-support.ac.uk/downloads/pdf/6300_Signing_Policy_02.pdf
but it did not solve the problem. We changed the signing policy locally
for debugging only.
After we tried all the ideas we had, we would need some help....
Thanks,
Yan & Eddie
|