Here is the result. The problem has been solved by removing the
harming input.fl file...
--------
1: POSIX ADVISORY WRITE 5452 3a:01:213636 0 EOF f7fdcc40 c03b05c8
c3695e24 00000000 f7fdcc4c
2: POSIX ADVISORY WRITE 3886 3a:04:16429 0 EOF c3695e20 f7fdcc44
f4b91fa4 00000000 c3695e2c
3: FLOCK ADVISORY WRITE 3412 3a:04:16426 0 EOF f4b91fa0 c3695e24
c03b05c8 00000000 f4b91fac
PID TTY TIME CMD
5463 ? 00:00:02 bdii-fwd
4094 ? 00:07:49 bdii-update
5582 ? 00:10:37 condor_gridmana
5452 ? 00:00:01 condor_master
5454 ? 00:01:26 condor_schedd
4432 ? 00:00:00 edg-wl-bkserver
4383 ? 00:00:42 edg-wl-interlog
5323 ? 00:00:30 edg-wl-job_cont <
16227 ? 00:00:00 edg-wl-job_cont < <defunct>
5501 ? 00:01:11 edg-wl-log_moni <
4391 ? 00:00:01 edg-wl-logd
5751 ? 00:01:02 edg-wl-ns_daemo <
4489 ? 00:00:00 edg-wl-renewd
4147 ? 00:34:50 edg-wl-workload <
5583 ? 00:10:41 gahp_server
16400 ? 00:00:00 sh
16401 ? 00:00:05 slapadd
14376 ? 00:00:08 slapd
=== WM ===
14 Feb, 16:43:51 [4] -I- RBSimpleImpl::findSuitableCEs: Will not
consider lcgce01.gridpp.rl.ac.uk:2119/jobmanager-lcgpbs-cmsL700
because of previous match
14 Feb, 16:43:51 [4] -W- findLCGApplicationDir: InformationIndex
search (no tuples): filter = (&(&(objectClass=GlueVOView)
(GlueChunkKey=GlueCEUniqueID=ce1.pp.rhul.ac.uk:2119/jobmanager-pbs-
cmsgrid)(GlueCEAccessControlBaseRule=VO:dteam)))
14 Feb, 16:43:51 [4] -I- Helper::resolve: Selected ce1.pp.rhul.ac.uk:
2119/jobmanager-pbs-cmsgrid for job https://gfe01.hep.ph.ic.ac.uk:
9000/Lit1M_CJPn3mqC1Kqnk8dg
=== NS ===
14 Feb, 16:44:34 [3] -I- "CFSI::serializeServer": Serializing Server.
14 Feb, 16:44:34 [3] -I- "CFSI::listOutputFiles": Preparing to Purge.
14 Feb, 16:44:34 [3] -I- "Manager::run": Command done
=== JC ===
14 Feb, 16:44:10 -C- JobControllerReal::submit(...): Added condor id
468620 for job id https://gfe01.hep.ph.ic.ac.uk:9000/
svaXungwElhdFb6TAapWnw
14 Feb, 16:44:10 -I- JobControllerClientReal::get_next_request():
Waiting for requests...
14 Feb, 16:44:10 -C- JobControllerReal::submit(...): Job submitted to
Condor cluster: 468621
=== LM ===
14 Feb, 16:44:19 -I- MonitorLoop::run(): Must wait for other 11 seconds.
14 Feb, 16:44:30 -I- MonitorLoop::run(): No new event found, going to
sleep.
14 Feb, 16:44:30 -I- MonitorLoop::run(): Checking each 10 seconds for
new events.
---------------------------
On 14 Feb 2007, at 16:42, Maarten Litmaath wrote:
> Olivier van der Aa wrote:
>
>> Dear All,
>> We have a case here at Imperial where the jobs gets stuck a long
>> time in waiting state.
>> Our rb is gfe01.hep.ph.ic.ac.uk. If I look in the events table of
>> the rb for a given job I get this:
>> select prog,arrived from events where jobid="mjzBSsixc2QHmKyGZvjKaw"
>> +-----------------+---------------------+
>> | UserInterface | 2007-02-14 10:07:38 |
>> | UserInterface | 2007-02-14 10:07:41 |
>> | NetworkServer | 2007-02-14 10:07:46 |
>> | UserInterface | 2007-02-14 10:07:51 |
>> | NetworkServer | 2007-02-14 10:07:51 |
>> | WorkloadManager | 2007-02-14 11:45:54 |
>> | WorkloadManager | 2007-02-14 11:47:18 |
>> | WorkloadManager | 2007-02-14 11:47:19 |
>> | WorkloadManager | 2007-02-14 11:47:20 |
>> | JobController | 2007-02-14 11:47:22 |
>> | JobController | 2007-02-14 11:47:24 |
>> | JobController | 2007-02-14 11:47:26 |
>> | LogMonitor | 2007-02-14 12:45:57 |
>> | LogMonitor | 2007-02-14 12:46:00 |
>> | LogMonitor | 2007-02-14 13:00:31 |
>> | LogMonitor | 2007-02-14 13:00:32 |
>> | LogMonitor | 2007-02-14 13:00:33 |
>> | LogMonitor | 2007-02-14 13:00:35 |
>> +-----------------+---------------------+
>> Clearly the NetworkServer accepted my request at 10h07 and the
>> workload manager only received the request at 11h45 !
>> What could be the cause of such a long delay.
>> I observe that they are quite a lot of files in /var/edgwl/
>> workload_manager like
>> input.fl.1171462935.27646.wrong containing stack traces...
>> Does it mean that the workload manager is crashing ?
>
> Yes, if those files are recent. Did the file system get full
> recently?
> Please send the output of the attached "chk-wl.sh" script.
> <chk-wl.sh>
--
- O. van der Aa - Imperial College London -
- LT2 Technical Coordinator -
- tel: +442075947810, -
- SIP: [log in to unmask] -
- fax: +442078238830 -
- http://surl.se/agtu -
|