Hello,
Today we've had a nasty incident where all the gridFTP doors on our
pool nodes here at Lancaster spontaneously keeled over and died.
Checking the logs saw a lot of "java Out of Memory" errors. The doors
are coming back up cleanly with a dcache-core restart on each pool
node, but looking in the log file on one of them after a restart I see
(and this is just from the end of the file, there's several hundred
lines of the same);
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : lm -> Cell Not Found : lm
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : GFTP-fal-pygrid-26-Unknown-158 -> Cell Not Found :
GFTP-fal-pygrid-26-Unknown-158
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : GFTP-fal-pygrid-26-Unknown-155 -> Cell Not Found :
GFTP-fal-pygrid-26-Unknown-155
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : GFTP-fal-pygrid-26-Unknown-154 -> Cell Not Found :
GFTP-fal-pygrid-26-Unknown-154
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : GFTP-fal-pygrid-26-Unknown-153 -> Cell Not Found :
GFTP-fal-pygrid-26-Unknown-153
03/06 15:16:55 Cell(System@gridftp-fal-pygrid-26Domain) : Problem
killing : GFTP-fal-pygrid-26-Unknown-152 -> Cell Not Found :
GFTP-fal-pygrid-26-Unknown-152
Starting BatchCell on /opt/d-cache/config/gridftpdoor.batch
Main thread finished
The door seems to start up okay, but the impression I get from this is
that it didn't kill the previous door cleanly.
Anyone seen similar behaviour? If it's just a memory leak type problem
would restarting the processes weekly be a good idea?
cheers,
matt
|