Dear David,
The 4th job has timed out on purpose ( I was trying to know how many
events I was able to process in one hour). But I am pretty sure that the
2nd and 3rd jobs finished smoothly.
Best regards,
Frederic
>
> Frederic,
>
> This is most interesting - your comments from probing our frontend
> gatekeeper explains some of our pbs_mom logs.
>
> I can see 4 jobs for you on Friday evening. They all ran on node62 of our
> cluster and I have attached the editted highlights from the pbs_mom log.
> The 2nd job looks quite normal and the 4th looks like it timed out after 1
> hour. The 1st and 3rd had a problem that I have seen before but could not
> explain.
>
> As I understand the scheme, the RB wraps the user's job in a script that
> does
> globus-url-copy input sandbox and brokerinfo RB -> WN
> run user's job
> globus-url-copy output sandbox WN -> RB
>
> and then ~globus-job-submits it to the CE's globus gatekeeper. The
> gatekeeper in turn wraps the RB script in a few lines that
>
> setup globus environment on execution host
> execute received job from deep in user's ~/.globus/.gass_cache
>
> and qsub's this globus gatekeeper wrapper to the pbs_server.
>
> The attached shows failure to rcp/scp the output of the "globus gatekeeper
> wrapper" stdout and stderr back to the gatekeeper. The rcp would fail
> due to firewall/libwrap restrictions, but it looks like the scp fails
> because the ~/.globus/.gass_cache subdirectory is not there.
>
> (it should really have $usecp directives and do "cp" directly
> but that is another story I think)
>
> Your observations suggest it may not be there because someone thinks the
> job is finished and has cleaned away the ~/.globus/.gass_cache
> sub-directory used by the job.
>
> I have no idea what sort of error I am looking for - has anyone seen
> anything similar ?
>
>
>
> David Martin
>
> Dept of Physics and Astronomy,
> University of Glasgow,
> Glasgow, G12 8QQ,
> United Kingdom
>
> tel: (0)141 330 4197 fax: (0)141 330 5881
> email: [log in to unmask]
>
|