Dear All,
At Birmingham, we're running DPM 1.8 and everything was fine up until
about 48 hours ago when the SE basically stopped responding to requests.
A basic restart of services and reboot of machine helped for ~30mins and
then the same 'hanging' problem happened. After some messing around
yesterday, it seemed that the dpns daemon *might* have been to blame (a
couple of times it wasn't responding to shutdown requests for a reboot).
However, having scoured all associated logs (srm, dpm, dpns, system) I
couldn't find anything that looked wrong (or at least, nothing that
wasn't there 4 days ago!). In addition, the services themselves were
being reported as running. This morning, the same thing had happened so
I poked around the SE and found that in the last few days we had a lot
of H1 requests which (for some reason best known to someone else) were
all hitting files on the one pool node (also the one we use for software
NFS mounts).
Now my questions are:
1) Could the dpns (or some other component - pool node?) have been
overloaded by a large number of requests? And if so, what's considered a
large number?
2) Where could I look to check this or in fact, debug why the system
wouldn't respond to a simple lcg-cr command? The logs don't seem to help
with this....
3) Assuming it's not a good idea for most of H1's files to be on a
single disk/pool node, is there an easy way to redistribute them?
Any help with this problem would be greatly appreciated!!
Many Thanks,
Mark
|