Print

Print


Hi Cristina,

Cristina Aiftimiei wrote:
> we had a prod-site that had problems with job-submission, solved only by 
> removing the old files present in the /opt/globus/tmp/gram_job_state/ 
> directory.
> 
> The simptoms were that a submitted job managed to pass from the WMS to 
> CE, on the CE - from globus to the batch-system (LSF 7.3), finished 
> correctly,... and everything stoped here, with non error messeges to the 
> user. The status presented allways the job in one of the states 
> "Scheduled" or "Running"... but not the "Done" one.
> 
> The number of the files accumulated in the directory 
> /opt/globus/tmp/gram_job_state/ was ~31000. Once removed... the 
> situation improved... but it's still a little slow in presenting the 
> status "Done" to the user.
> I checked the comunication between the CE-WMS - it's working.
> 
> The versions of CE, WMS are the last one released to the production 
> (Update 41).
> Is there any way I could understand what happend - why the huge number 
> of files in that directory?

Please do the following on your CE node:
Edit the /opt/globus/etc/globus-gma.conf file and add a "debug 1" line 
to it (no equals sign, just a space as a separator).
Restart globus-gma with `service globus-gma restart'
Wait for 20-30 minutes and send me (not to the list as it will be 
megabytes in size) the log file /opt/globus/var/log/globus-gma.log
Please also include the output of `ps auxfww' command from your CE.
-- 
Cheers,
         Andrey Kiryanov.