On 2 Jun 2011, at 09:49, Gustav Wikström wrote:
> Hi all,
>
> I'm having serious problems with running my VO t2k.org jobs, currently
> 95% of them are being cancelled by the WMSs (lcgwms03.gridpp.rl.ac.uk
> and wms02.grid.hep.ic.ac.uk) or the CEs. As I understand it, when a
> WMS stops a job, it is labeled Aborted, and then Cancelled is when a
> CE stops a job? The bad thing is that there is no information about a
> job after it has been stopped unless it failed.
>
> So, what could cause a job to be cancelled? Is memory usage one of the reasons?
Not the most likely culprit, as it's not the most strongly enforced constriant across all sites, but it is possible. It does have a bit of a site dependance, so if the 5% that don't get cancelled end up on a different site, that's useful data. Job CPU use and Wall time are more strongly enforced; but it could also be missing input files causing the jobs to die on start up.
If it's (apparently) randomly distributed across all sites, the first thing I'd be checking is proxy lifespans, job queueing time and myproxy stuff (if used).
There might be more information lurking around, which, if you've not tried already, can be released with 'glite-wms-job-status --verbosity 3 <jid>', and 'glite-wms-job-logging-info --verbosity 3 <jid>'
which might give more idea on where to poke at next. In particular, the WMS (by default) will try re-submitting a failed job a couple of times, and walking through that process might be informative. The amount of time jobs spend running might also help identify the root problem.
|