Hi Matt,
I'm not worried about the messages in your logs, they just tell you that
the dCache cells could not communicate with the door cells other. They
constantly pass messages to each other, so if the doors all died I would
expect this response. 'lm' stands for location manager, just in case you
didn't know. This helps coordinate the cells communication.
However, I am a bit worried about java running out of memory though. Had
your dCache been seeing a lot of traffic?
Cheers,
Greig
On Mon, 6 Mar 2006, Matt Doidge wrote:
> Hello,
> Today we've had a nasty incident where all the gridFTP doors on our
> pool nodes here at Lancaster spontaneously keeled over and died.
> Checking the logs saw a lot of "java Out of Memory" errors. The doors
> are coming back up cleanly with a dcache-core restart on each pool
> node, but looking in the log file on one of them after a restart I see
> (and this is just from the end of the file, there's several hundred
> lines of the same);
>
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : lm -> Cell Not Found : lm
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : GFTP-fal-pygrid-26-Unknown-158 -> Cell Not Found :
> GFTP-fal-pygrid-26-Unknown-158
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : GFTP-fal-pygrid-26-Unknown-155 -> Cell Not Found :
> GFTP-fal-pygrid-26-Unknown-155
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : GFTP-fal-pygrid-26-Unknown-154 -> Cell Not Found :
> GFTP-fal-pygrid-26-Unknown-154
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : GFTP-fal-pygrid-26-Unknown-153 -> Cell Not Found :
> GFTP-fal-pygrid-26-Unknown-153
> 03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
> killing : GFTP-fal-pygrid-26-Unknown-152 -> Cell Not Found :
> GFTP-fal-pygrid-26-Unknown-152
> Starting BatchCell on /opt/d-cache/config/gridftpdoor.batch
> Main thread finished
>
>
> The door seems to start up okay, but the impression I get from this is
> that it didn't kill the previous door cleanly.
>
> Anyone seen similar behaviour? If it's just a memory leak type problem
> would restarting the processes weekly be a good idea?
>
> cheers,
> matt
>
--
=======================================================================
Dr Greig A Cowan http://www.ph.ed.ac.uk/~gcowan1
School of Physics, University of Edinburgh, James Clerk Maxwell Building
TIER-2 STORAGE SUPPORT PAGES: http://wiki.gridpp.ac.uk/wiki/Grid_Storage
=======================================================================
|