On Fri, 27 May 2005, Ian Fisk wrote: > We are seeing the the lock corruption problem again. After stable > running for 5 days we have two corrupted Grid-manager-monitor lock > files. The machine swapping does not appear to be the cause of the > problem, but it is the result of the problem. I have a hunch the problem is due to your system: ----------------------------------------------------------------------------- $ globus-job-run cmslcgce.fnal.gov /bin/pwd /uscms_data/d1/grid_home/dteam006 $ globus-job-run cmslcgce.fnal.gov /bin/df . Filesystem 1K-blocks Used Available Use% Mounted on 131.225.207.94:fs_ibrix02 2836649880 991990964 1838253096 36% /uscms_data/d1 ----------------------------------------------------------------------------- Did Lisa not send a message to the roll-out list a few days ago, about problems with the Ibrix file server? You may want to mount the thing with the "noac" option. > We have hundreds of the grid-monitor-processes. Today the users in > question are new people and they both came in through the CERN RB. > In some cases the grid-monitor-manager doesn't finish so the job > state is not updated and users don't get the results, because they > never change state. > > The grid-monitor-manager perl script seems to be set to delete itself > after it starts and it's submitted with the job, so I'm not sure how > to debug. The instance of the script recorded in the gass cache > appears to be consistent across users. At the moment we have to > watch and respond before the system has so few resources that it has > to be rebooted to clear them. > > -Ian > > > On May 24, 2005, at 12:13 PM, David Smith wrote: > > > On Tue, 24 May 2005, Ian Fisk wrote: > > > > > >> We could use some guidance on solving this grid-manager-monitor > >> problem. > >> > >> The lock file for one user is corrupted and we are seeing the grid- > >> manager-monitor thrashing reported a few days ago. Somehow the > >> problem > >> seems reproducible for the user, but he is submitting through the > >> same RB as > >> several other people and using the same infrastructure to > >> configure the jobs. > >> He has submitted 200 jobs, but another user without the problem > >> has also > >> submitted more than 200. > >> > >> The gateway has all 4GB of real memory used and 2GB of swap used. > >> > >> The lock file is corrupted for cms002. It should be about 16 > >> bytes long > >> > > > > Hello Ian, > > > > I wasn't able to understand exactly what was happening - there was a > > problem with some internal locks for the cms002 user > > (~cms002/.lcgjm/*lock) which was making the job managers hang. I > > think it > > is likely this was from earlier problems with the machine running > > low on > > memory. I was able to correct that, and now things appear to running > > smoothly. However since I wasn't able to understand the corruption > > of the > > lock file nor the cause of the original problem exactly it is quite > > possible there is a further problem or bug that needs to be found. > > I'll > > try to keep a period check on your CE, and likewise if you notice any > > problem send another mail, and CC me directly on it. > > > > Yours, > > David > > -- > > ---------------------------------------------------------------------- > > --- > > David Smith e-mail: [log in to unmask] tel: +41 22 76 > > 74462 > > Address: D. Smith, CERN G06610, Bat 28 R-007, 1211 Geneva 23, > > Switzerland > > ---------------------------------------------------------------------- > > --- > > >