I spoke too soon, things aren't okay. At around 9:30pm EST the
gatekeeper ran out of memory again and the kernel started killing
processes. It didn't kill anything critical this time so the system is
still partially functional and I was able to log in and look around.
There were over 400 processes running, mostly globus-job-manager
processes and related perl scripts. The oldest job-manager process
still alive was started at the same time that I last rebooted the node.
Are other sites having the same problem? I don't think this was happing
here at BNL before I upgraded to LCG1-1_1_3. If this problem is really
caused by condor's grid_manager then shouldn't the jobs that are being
submitted not request to turn on the broken condor grid_manager?
~Jason
On Thu, 2003-12-11 at 03:43, Marco Serra wrote:
> On Wed, 10 Dec 2003, Jason A. Smith wrote:
>
> > Today I had to reboot the BNL CE, it looks like there were hundreds of
> > lcgpbs jobmanager processes still active. The server was unresponsive
> > except for many kernel out-of-memory messages on the console, trying to
> > kill the globus-job-manager and related perl scripts. Was there a
> > problem with the automated test scripts that are being run? Things look
> > okay now.
>
> Jason this looks like a known bug in the grid_manager module of condor-g.
> http://marianne.in2p3.fr/datagrid/bugzilla/show_bug.cgi?id=2414
> "VDT" is already informed about this bug and they are actively working
> with us to have a fix for the LCG2 release.
> This could be related to your automated test scripts (which one is it?)
> if the submission rate is "too high"(hard to define a precise number), but
> it is not obvious.
>
> Marco
>
> /-- [log in to unmask] -- LCG Project --\
> | Marco Serra INFN -/- CERN |
> | IT Div. - Bat.31 S-013 - CERN - CH-1211 Geneva 23 |
> | phone@cern: +41-22-7674758 |
> | At the moment I am ... at CERN |
> | --|
> | There is a difference between knowing the |
> | path and walking the path |
> \--/
--
/------------------------------------------------------------------\
| Jason A. Smith Email: [log in to unmask] |
| Atlas Computing Facility, Bldg. 510M Phone: (631)344-4226 |
| Brookhaven National Lab, P.O. Box 5000 Fax: (631)344-7616 |
| Upton, NY 11973-5000 |
\------------------------------------------------------------------/
|