Hi all,
About a month ago, it was noticed that the Oxford DPM's webdav interface
had died, leaving in its logs only this:
[Mon Jan 13 12:34:19 2014] [notice] Apache/2.2.3 (Scientific Linux) configured -- resuming normal operations
[Mon Jan 13 12:57:23 2014] [error] [client 185.4.227.194] Could not fetch resource information. [500, #0]
[Mon Jan 13 12:57:23 2014] [error] [client 185.4.227.194] (11)Resource temporarily unavailable: Could not instantiate a context: [#00.000012] (null): () [500, #0]
[Mon Jan 13 14:29:12 2014] [error] [client 200.52.205.120] Could not fetch resource information. [500, #0]
[Mon Jan 13 14:29:12 2014] [error] [client 200.52.205.120] (11)Resource temporarily unavailable: Could not instantiate a context: [#00.000012] (null): () [500, #0]
[Mon Jan 13 15:23:44 2014] [error] [client 92.240.68.152] request failed: error reading the headers
[Mon Jan 13 18:27:37 2014] [error] [client 98.27.131.3] Could not fetch resource information. [500, #0]
[Mon Jan 13 18:27:37 2014] [error] [client 98.27.131.3] (11)Resource temporarily unavailable: Could not instantiate a context: [#00.000012] (null): () [500, #0]
[Mon Jan 13 18:33:59 2014] [notice] SIGHUP received. Attempting to restart
[Mon Jan 13 18:33:59 2014] [notice] seg fault or similar nasty error detected in the parent process
everything was restarted, and it ran fine for a while, but now it's done it
again:
[Sun Mar 23 11:40:34 2014] [error] [client 41.189.60.102] Could not fetch resource information. [500, #0]
[Sun Mar 23 11:40:34 2014] [error] [client 41.189.60.102] (11)Resource temporarily unavailable: Could not instantiate a context: [#00.000012] (null): () [500, #0]
[Sun Mar 23 15:15:00 2014] [error] [client 142.132.1.15] Could not fetch resource information. [500, #0]
[Sun Mar 23 15:15:00 2014] [error] [client 142.132.1.15] (11)Resource temporarily unavailable: Could not instantiate a context: [#00.000012] (null): () [500, #0]
[Sun Mar 23 18:30:34 2014] [notice] SIGHUP received. Attempting to restart
[Sun Mar 23 18:30:34 2014] [notice] seg fault or similar nasty error detected in the parent process
Indeed, _all_ the pool nodes have done it, but not at exactly the same time.
Mostly the same time, ish, but not exactly. For example, the last lines in
the logs for a handful of nodes:
t2se21 [Sun Mar 23 06:35:07 2014] [notice] seg fault or similar nasty error detected in the parent process
t2se22 [Mon Mar 24 00:36:39 2014] [notice] seg fault or similar nasty error detected in the parent process
t2se23 [Sun Mar 23 06:24:29 2014] [notice] seg fault or similar nasty error detected in the parent process
t2se24 [Sun Mar 23 18:25:01 2014] [notice] seg fault or similar nasty error detected in the parent process
However, and I'm not sure whether or not this is a thing, these messages
do all seem to occur at about 25-40 minutes past a 6-hour mark in the day,
(six am, six pm, or midnight - no apparent deaths at lunchtime though).
However, some of the nodes also died with (e.g.):
[Sun Mar 23 04:02:04 2014] [notice] SIGHUP received. Attempting to restart
and nothing about nasty seg faults. Or indeed, anything nasty at all. But they're
still dead, and about the same sort of time (but not exactly).
I am at something[1] of a loss here. Does anyone have any ideas:
- what's going on?
- how I can find out what's going on?
- and mostly, how I can stop it[2]?
Ewan
[1] For values of 'something' equal to 'totally'.
[2] Yes, I could/can/might just do a while true do ; if [dead] then
restart ; done sort of thing, but it's icky.
|