We could use some guidance on solving this grid-manager-monitor problem.
The lock file for one user is corrupted and we are seeing the grid-
manager-monitor thrashing reported a few days ago. Somehow the
problem seems reproducible for the user, but he is submitting through
the same RB as several other people and using the same infrastructure
to configure the jobs. He has submitted 200 jobs, but another user
without the problem has also submitted more than 200.
The gateway has all 4GB of real memory used and 2GB of swap used.
The lock file is corrupted for cms002. It should be about 16 bytes
long
-rw-r--r-- 1 atlas001 atlas 17 May 7 02:14 /opt/globus/
tmp/grid_manager_monitor_agent_log.12536.lock
-rw-r--r-- 1 cms002 cms 7289 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12337.lock
-rw-r--r-- 1 cms003 cms 16 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12338.lock
-rw-r--r-- 1 dteam002 dteam 17 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12437.lock
-rw-r--r-- 1 cms007 cms 17 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12342.lock
-rw-r--r-- 1 cms004 cms 16 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12339.lock
-rw-r--r-- 1 cms006 cms 16 May 24 08:48 /opt/globus/
tmp/grid_manager_monitor_agent_log.12341.lock
We're not sure how to proceed except to clean the corrupted logs and
watch the system.
Thanks, Ian
|