Gianfranco Sciacca wrote:
> Following up on the problem I've reported weeks ago of the large number
> of atlas jobs ending up in "W" state at UCL-HEP,
> I've digged some extra information from the gram_job_manager logs. These
> show that the delivery of stdout fails with errors of the type
>
> GRAM_SCRIPT_GT3_FAILURE_TYPE = filestageout
> 8/14 22:26:02 JMI: while return_buf = GRAM_SCRIPT_ERROR = 155
What RB are you using? The stage-out is to the GRAM client, i.e. the RB.
Error 155 usually means there is a network problem, e.g. with firewall or
GLOBUS_TCP_PORT_RANGE settings on either end. An intermediate router/
firewall might not be happy with the same data port (say 20000) getting
rapidly reused for different connections; that would be a bug in the
router itself, which you may not be able to have fixed any time soon.
> This will probably result in the job showing up as Aborted on the user
> side, but also results in jobs cluttering the PBS queue, bouncing
> forever between states "W", "Q" and "R", until finally cleared with a
> "qdel".
>
> Does anyone have any expertise on how:
> 1) prevent the stageout error (see detailed section of gram log attached
> below for clues)
> 2) prevent the cluttering of the PBS queue and consequential cluttering
> of the pool accounts home dirs?
>
> As background info, PBS, MAUI and gatekeeper logs offer no additional
> clues, and this problem affects only LCG jobs, not local users'.
>
> Many thanks,
> gianfranco
>
> =====
> 8/13 23:34:01 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_PRE_CLOSE_OUTPUT
> 8/13 23:34:01 JM: Writing state file
> 8/13 23:34:01 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_CLOSE_OUTPUT
> 8/13 23:34:01 JMI: testing job manager scripts for type lcgpbs exist and
> permissions are ok.
> 8/13 23:34:01 JMI: completed script validation: job manager type is lcgpbs.
> 8/13 23:34:01 JMI: in globus_gram_job_manager_script_stage_out()
> 8/13 23:34:01 JMI: cmd = stage_out
> 8/13 23:34:01 JMI: returning with success
> Sun Aug 13 23:34:01 2006 JM_SCRIPT: New Perl JobManager created.
> Sun Aug 13 23:34:01 2006 JM_SCRIPT: Entering Job Manager submit-helper
> implementation of stage_out
> Sun Aug 13 23:34:01 2006 JM_SCRIPT: stage_out(enter)
> Sun Aug 13 23:34:02 2006 JM_SCRIPT: Sent NFS sync for
> /grid/home/atlas037/.globus/.gass_cache/local/md5/6d/155ed872d289541f383af141ec59d1/md5/36/fe8be5d42805976a95fbc5fce7eaf5/data
>
> Sun Aug 13 23:37:11 2006 JM_SCRIPT: filestageout staging failed with
> error: [globus_l_gass_copy_gass_setup_callback]: url:
> https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out
> request was DENIED, for reason: 400, Bad Request
> Sun Aug 13 23:37:11 2006 JM_SCRIPT: Leaving Job Manager submit-helper
> implementation of stage_out
> 8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_MESSAGE =
> error: [globus_l_gass_copy_gass_setup_callback]: url:
> https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out
> request was DENIED, for reason: 400, Bad Request
> 8/13 23:37:11 JMI: while return_buf =
> GRAM_SCRIPT_GT3_FAILURE_DESTINATION =
> https://pcuwgrid2.cern.ch:62223/home/xxx/LCG/w5jets_scale2/w5j_15022_49882.out
>
> 8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_SOURCE =
> x-gass-cache://pc90.hep.ucl.ac.uk/28249.1155505558/dev/stdout
> 8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_GT3_FAILURE_TYPE =
> filestageout
> 8/13 23:37:11 JMI: while return_buf = GRAM_SCRIPT_ERROR = 155
> 8/13 23:37:11 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_STAGE_OUT
> 8/13 23:37:11 JM: in globus_gram_job_manager_history_file_create()
> 8/13 23:37:11 JM: NOT empty client callback list.
> 8/13 23:37:11 JM: sending callback of status 4 (failure code 155) to
> https://pcuwgrid2.cern.ch:62222/.
> 8/13 23:40:07 globus_gram_job_manager_query_callback() not a literal URI
> match
> 8/13 23:40:07 JM : in globus_l_gram_job_manager_query_callback,
> query=status
> 8/13 23:40:07 JM : reply: (status=4 failure code=0 (Success))
> 8/13 23:40:07 JM : sending reply:
> protocol-version: 2
> status: 4
> failure-code: 0
> job-failure-code: 155
> ^@8/13 23:40:07 -------------------
> 8/13 23:46:05 globus_gram_job_manager_query_callback() not a literal URI
> match
> 8/13 23:46:05 JM : in globus_l_gram_job_manager_query_callback,
> query=signal 10
> 8/13 23:46:05 JM : reply: (status=4 failure code=0 (Success))
> 8/13 23:46:05 JM : sending reply:
> protocol-version: 2
> status: 4
> failure-code: 0
> job-failure-code: 155
> ^@8/13 23:46:05 -------------------
> 8/13 23:46:05 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_TWO_PHASE_COMMITTED
> 8/13 23:46:05 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_FILE_CLEAN_UP
> 8/13 23:46:05 JMI: testing job manager scripts for type lcgpbs exist and
> permissions are ok.
> 8/13 23:46:05 JMI: completed script validation: job manager type is lcgpbs.
> 8/13 23:46:05 JMI: in globus_gram_job_manager_rm_scratchdir()
> 8/13 23:46:05 JMI: cmd = remove_scratchdir
> 8/13 23:46:06 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_SCRATCH_CLEAN_UP
> 8/13 23:46:06 JMI: testing job manager scripts for type lcgpbs exist and
> permissions are ok.
> 8/13 23:46:06 JMI: completed script validation: job manager type is lcgpbs.
> 8/13 23:46:06 JMI: cmd = cache_cleanup
> Sun Aug 13 23:46:06 2006 JM_SCRIPT: New Perl JobManager created.
> Sun Aug 13 23:46:06 2006 JM_SCRIPT: Entering Job Manager submit-helper
> implementation of cache_cleanup
> 8/13 23:46:07 Job Manager State Machine (entering):
> GLOBUS_GRAM_JOB_MANAGER_STATE_FAILED_CACHE_CLEAN_UP
> 8/13 23:46:07 JM: in globus_gram_job_manager_reporting_file_remove()
> 8/13 23:46:07 JM: exiting globus_gram_job_manager.
> =====
>
>
>
|