On 11 Jul 2011, at 11:14, Govind Songara wrote:
> Do we need to disable checkpoint at all?
Grid jobs _won't_ checkpoint. Even if they could, it wouldn't work (timeout on remote end of file transfers being the obvious one)
However, Torque won't try to checkpoint a job, unless the job is specifically marked as checkpointable; and it is enabled in the server. Unless you've hacked your CE's, then the CE's don;t mark jobs as checkpointable don't do that.
Checkpointing therefore shouldn't happen, and thus it won't cause problems.
The problem is really proxy expiry / long queuing times.
(I'm havingmixed feelings on torque's rerunable=true default setting. I suspect if it was off, then many of these problem jobs would disapear. What's not clear to me is how many non-problem jobs would fail instead of succeding as a result of that. I have no data on that).
> On Mon, Jul 11, 2011 at 10:10 AM, Alessandra Forti <[log in to unmask]> wrote:
> That'll be because the same proxy is shared over a group of jobs - when the pilot factory submits jobs in batches, it does this to
> reduce demand on the proxy renewals, and to save copying the same file multiple times. (I think it's hard links on on the backend
> that are used...)
>
>
> on my system the disappeared proxies this weekend belonged all to jobs that queued for too long.
>
> cheers
> alessandra
>
>
>
> On 11/07/2011 10:03, Stuart Purdie wrote:
> On 11 Jul 2011, at 09:58, Stephen Jones wrote:
>
> Ben Waugh wrote:
> Looks like at the moment I've got a lot of missing stagein files but they are job wrapper scripts, not proxies. Is this normal?
>
> Job Id: 277752.lcg-ce04.hep.ucl.ac.uk - missing stagein: /opt/glite/var/cream_sandbox/atlaspil/_C_UK_O_eScience_OU_Glasgow_L_Compserv_CN_graeme_stewart_atla
> s_Role_pilot_Capability_NULL_pilatl04/76/CREAM768345533/CREAM768345533_jobWrapper.sh
> On 08/07/11 16:59, Stephen Jones wrote:
> It's senseless. We may have situations where (a) the proxy is gone (b) the job wrapper is gone (c) job wrapper and proxy are both gone. The script lists all missing stagein files. Your output shows that your proxy IS available, yet the job wrapper isn't (to prove it, you could put this "patch" into the script to explicitly list the stagein files that _are_ there).
> That'll be because the same proxy is shared over a group of jobs - when the pilot factory submits jobs in batches, it does this to reduce demand on the proxy renewals, and to save copying the same file multiple times. (I think it's hard links on on the backend that are used...)
>
> It's likely that another job in the same group of jobs is still running, but that one died.
>
|