Olivier van der Aa wrote:
> Dear All,
>
> We have a case here at Imperial where the jobs gets stuck a long time
> in waiting state.
> Our rb is gfe01.hep.ph.ic.ac.uk. If I look in the events table of the
> rb for a given job I get this:
>
> select prog,arrived from events where jobid="mjzBSsixc2QHmKyGZvjKaw"
> +-----------------+---------------------+
> | UserInterface | 2007-02-14 10:07:38 |
> | UserInterface | 2007-02-14 10:07:41 |
> | NetworkServer | 2007-02-14 10:07:46 |
> | UserInterface | 2007-02-14 10:07:51 |
> | NetworkServer | 2007-02-14 10:07:51 |
> | WorkloadManager | 2007-02-14 11:45:54 |
> | WorkloadManager | 2007-02-14 11:47:18 |
> | WorkloadManager | 2007-02-14 11:47:19 |
> | WorkloadManager | 2007-02-14 11:47:20 |
> | JobController | 2007-02-14 11:47:22 |
> | JobController | 2007-02-14 11:47:24 |
> | JobController | 2007-02-14 11:47:26 |
> | LogMonitor | 2007-02-14 12:45:57 |
> | LogMonitor | 2007-02-14 12:46:00 |
> | LogMonitor | 2007-02-14 13:00:31 |
> | LogMonitor | 2007-02-14 13:00:32 |
> | LogMonitor | 2007-02-14 13:00:33 |
> | LogMonitor | 2007-02-14 13:00:35 |
> +-----------------+---------------------+
>
> Clearly the NetworkServer accepted my request at 10h07 and the workload
> manager only received the request at 11h45 !
> What could be the cause of such a long delay.
>
> I observe that they are quite a lot of files in /var/edgwl/
> workload_manager like
> input.fl.1171462935.27646.wrong containing stack traces...
>
> Does it mean that the workload manager is crashing ?
Yes, if those files are recent. Did the file system get full recently?
Please send the output of the attached "chk-wl.sh" script.
|