Following up on the problem I've reported weeks ago of the large number
of atlas jobs ending up in "W" state at UCL-HEP,
I've digged some extra information from the gram_job_manager logs. These
show that the delivery of stdout fails with errors of the type
GRAM_SCRIPT_GT3_FAILURE_TYPE = filestageout
8/14 22:26:02 JMI: while return_buf = GRAM_SCRIPT_ERROR = 155
This will probably result in the job showing up as Aborted on the user side, but also results in jobs cluttering the PBS queue, bouncing forever between states "W", "Q" and "R", until finally cleared with a "qdel".
Does anyone have any expertise on how:
1) prevent the stageout error (see detailed section of gram log attached below for clues)
2) prevent the cluttering of the PBS queue and consequential cluttering of the pool accounts home dirs?
As background info, PBS, MAUI and gatekeeper logs offer no additional clues, and this problem affects only LCG jobs, not local users'.
Many thanks,
gianfranco
=====
8/13 23:34:01 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_PRE_CLOSE_OUTPUT
8/13 23:34:01 JM: Writing state file
8/13 23:34:01 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_CLOSE_OUTPUT
8/13 23:34:01 JMI: testing job manager scripts for type lcgpbs exist and permissions are ok.
8/13 23:34:01 JMI: completed script validation: job manager type is lcgpbs.
8/13 23:34:01 JMI: in globus_gram_job_manager_script_stage_out()
8/13 23:34:01 JMI: cmd = stage_out
8/13 23:34:01 JMI: returning with success
Sun Aug 13 23:34:01 2006 JM_SCRIPT: New Perl JobManager created.
Sun Aug 13 23:34:01 2006 JM_SCRIPT: Entering Job Manager submit-helper implementation of stage_out
Sun Aug 13 23:34:01 2006 JM_SCRIPT: stage_out(enter)
Sun Aug 13 23:34:02 2006 JM_SCRIPT: Sent NFS sync for /grid/home/atlas037/.globus/.gass_cache/local/md5/6d/155ed872d289541f383af141ec59d1/md5/36/fe8be5d42805976a95fbc5fce7eaf5/data
Sun Aug 13 23:37:11 2006 JM_SCRIPT: filestageout staging failed with error: [globus_l_gass_copy_gass_setup_callback]: url: https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out request was DENIED, for reason: 400, Bad Request
Sun Aug 13 23:37:11 2006 JM_SCRIPT: Leaving Job Manager submit-helper implementation of stage_out
8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE = error: [globus_l_gass_copy_gass_setup_callback]: url: https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out request was DENIED, for reason: 400, Bad Request
8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_DESTINATION = https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out
8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_SOURCE = x-gass-cache://pc90.hep.ucl.ac.uk/28249.1155505558/dev/stdout
8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_TYPE = filestageout
8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_ERROR = 155
8/13 23:37:11 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_STAGE_OUT
8/13 23:37:11 JM: in globus_gram_job_manager_history_file_create()
8/13 23:37:11 JM: NOT empty client callback list.
8/13 23:37:11 JM: sending callback of status 4 (failure code 155) to https://pcuwgrid2.cern.ch:62222/.
8/13 23:40:07 globus_gram_job_manager_query_callback() not a literal URI match
8/13 23:40:07 JM : in globus_l_gram_job_manager_query_callback, query=status
8/13 23:40:07 JM : reply: (status=4 failure code=0 (Success))
8/13 23:40:07 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 0
job-failure-code: 155
^@8/13 23:40:07 -------------------
8/13 23:46:05 globus_gram_job_manager_query_callback() not a literal URI match
8/13 23:46:05 JM : in globus_l_gram_job_manager_query_callback, query=signal 10
8/13 23:46:05 JM : reply: (status=4 failure code=0 (Success))
8/13 23:46:05 JM : sending reply:
protocol-version: 2
status: 4
failure-code: 0
job-failure-code: 155
^@8/13 23:46:05 -------------------
8/13 23:46:05 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
8/13 23:46:05 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
8/13 23:46:05 JMI: testing job manager scripts for type lcgpbs exist and permissions are ok.
8/13 23:46:05 JMI: completed script validation: job manager type is lcgpbs.
8/13 23:46:05 JMI: in globus_gram_job_manager_rm_scratchdir()
8/13 23:46:05 JMI: cmd = remove_scratchdir
8/13 23:46:06 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
8/13 23:46:06 JMI: testing job manager scripts for type lcgpbs exist and permissions are ok.
8/13 23:46:06 JMI: completed script validation: job manager type is lcgpbs.
8/13 23:46:06 JMI: cmd = cache_cleanup
Sun Aug 13 23:46:06 2006 JM_SCRIPT: New Perl JobManager created.
Sun Aug 13 23:46:06 2006 JM_SCRIPT: Entering Job Manager submit-helper implementation of cache_cleanup
8/13 23:46:07 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CACHE_CLEAN_UP
8/13 23:46:07 JM: in globus_gram_job_manager_reporting_file_remove()
8/13 23:46:07 JM: exiting globus_gram_job_manager.
=====
--
Dr. Gianfranco Sciacca Tel: +44 (0)20 7679 3044
Dept of Physics and Astronomy Internal: 33044
University College London D15 - Physics Building
London WC1E 6BT
|