Hi Mark
On 26 May 2011, at 10:18, Mark Slater wrote:
> Dear All,
>
> At Birmingham, we're running DPM 1.8 and everything was fine up until about 48 hours ago when the SE basically stopped responding to requests. A basic restart of services and reboot of machine helped for ~30mins and then the same 'hanging' problem happened. After some messing around yesterday, it seemed that the dpns daemon *might* have been to blame (a couple of times it wasn't responding to shutdown requests for a reboot). However, having scoured all associated logs (srm, dpm, dpns, system) I couldn't find anything that looked wrong (or at least, nothing that wasn't there 4 days ago!). In addition, the services themselves were being reported as running. This morning, the same thing had happened so I poked around the SE and found that in the last few days we had a lot of H1 requests which (for some reason best known to someone else) were all hitting files on the one pool node (also the one we use for software NFS mounts).
>
> Now my questions are:
>
> 1) Could the dpns (or some other component - pool node?) have been overloaded by a large number of requests? And if so, what's considered a large number?
>
potentially. The number of threads on dpns is configurable/ increasable. But I doubt it is that hitting the limit. SRM could hit a limit before that. More likely it is the pool node being overloaded. How many rfiod or gridftpd threads are running on that node? John Bland had an issue with H1 before - hundreds of connections to remote sites. You could ticket H1 to stop them doing this (even though they keep doing so). John may have some tips on how he controlled them.
> 2) Where could I look to check this or in fact, debug why the system wouldn't respond to a simple lcg-cr command? The logs don't seem to help with this....
>
You can use ps / pstree to list the threads. The logs are difficult to decode but if you send them to Jean-Phillippe on the dpm-user-forum he is usually pretty good/keen to debug this kind of thing.
> 3) Assuming it's not a good idea for most of H1's files to be on a single disk/pool node, is there an easy way to redistribute them?
>
Are H1 in their own pool ? If you do dpm-qryconf you can see. If so then the you can either add another filesystem to that pool using dpm-addfs
Or you can allow them to write to another pool with
dpm-moidifypool --poolname MYPOOL --group +hone
You can then drain some of the existing data across with
dpm-drain --size 2T
(but if it is in a spacetoken you will not be able to drain across pools.)
Anyway
>
> Any help with this problem would be greatly appreciated!!
>
> Many Thanks,
>
> Mark
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
|