Hi Antun, > > Seems that your proposed solution works, since after the changes you > suggested, I successfully executed one SAM job on our gCE using the SAM > Admin's page, and this was not possible for several days now. Thanks! > > I would personally prefer more elegant solution, i.e. your option 2. One of my colleagues is testing what is the best solution as I mentioned. Currently he prefers point 2 as well. > > Note that this should be widely publicized, since many gCEs are failing SAM > tests due to this problem, and this badly affects their availabilities. Sure, we will update the tricky somewhere for example, goc wiki page in Taiwan, after the test. Cheers, Di > > Thanks again! > > Best regards, Antun > > > ----- > Antun Balaz > Research Assistant > E-mail: [log in to unmask] > Web: http://scl.phy.bg.ac.yu/ > > Phone: +381 11 3713152 > Fax: +381 11 3162190 > > Scientific Computing Laboratory > Institute of Physics, Belgrade, Serbia > ----- > > ---------- Original Message ----------- > From: Di Qing <[log in to unmask]> > To: [log in to unmask] > Sent: Tue, 15 May 2007 11:59:46 +0200 > Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour > >> Antun Balaz wrote: >>> Hi Di, >>> >>> This is the situation: >>> >>> 1) /opt/globus/tmp/gram_job_state was not empty, but contained some files >>> created until February. I removed them, and created a soft link: >>> >>> ln -s /var/glite/gram_job_state /opt/globus/tmp/gram_job_state >>> >>> Btw. /var/glite/gram_job_state contains a lot of gram_job_state files. Cleanup >>> cron job could do some good to that directory. >> If it works properly, there should be not so many files left. >> >>> 2) Both /opt/glite/etc/grid-services/jobmanager-fork and >>> /opt/globus/etc/grid-services/jobmanager exist. >>> /opt/globus/etc/grid-services/jobmanager is a link to >>> /opt/globus/etc/grid-services/jobmanager-fork. This file uses >>> /opt/globus/etc/globus-job-manager.conf file, while >>> /opt/glite/etc/grid-services/jobmanager-fork uses >>> /opt/glite/etc/globus-job-manager.conf. At the end of the story, these two >>> conf files differ just in the directory for gram_job_state, which is fixed in >>> point 1), so no need to change anything here. Correct? >>> >> Yes, if you already tried to fix it according to point 1), you don't >> need to do this. They are two different solutions. We are still >> testing which one is the best. >> >>> It is enough to restart gLite on this gCE to see if situation improved? >> I don't think you need to restart gLite since now gridmonitor can >> find the state files in the symbol linked directory. >> >> Di >> >>> Thanks, Antun >>> >>> ----- >>> Antun Balaz >>> Research Assistant >>> E-mail: [log in to unmask] >>> Web: http://scl.phy.bg.ac.yu/ >>> >>> Phone: +381 11 3713152 >>> Fax: +381 11 3162190 >>> >>> Scientific Computing Laboratory >>> Institute of Physics, Belgrade, Serbia >>> ----- >>> >>> ---------- Original Message ----------- >>> From: Di Qing <[log in to unmask]> >>> To: [log in to unmask] >>> Sent: Tue, 15 May 2007 11:17:47 +0200 >>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour >>> >>>> Hi Antun, >>>> >>>> There are two possible solutions on CE: 1) create link >>>> /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state or >>>> /opt/globus/tmp/gram_job_state to /var/glite/gram_job_state, 2) >>>> create /opt/glite/etc/grid-services/jobmanager-fork to >>>> /opt/globus/etc/grid-services/jobmanager . Very recently Francesco >>>> Prelz found there was an inconsistence between WMS and glite CE, >>>> e.g., GRAM is saving its state files under >>>> /var/glite/gram_job_state/ specified by >>>> /opt/glite/etc/gatekeeper.conf, however, for job status checking, >>>> the gridmonitor is looking for the state file directory starting >>>> from $GLOBUS_LOCATION/etc/grid-services, which leads to the globus default >>>> location /opt/globus/tmp/gram_job_state/ which is empty, so the >>>> state of fork jobs is never correctly updated. >>>> >>>> We are testing if it is true and which one is the best solution for >>>> this. But you can try it as well. >>>> >>>> Cheers, >>>> >>>> Di >>>> >>>> Antun Balaz wrote: >>>>> Hi Di, >>>>> >>>>> It is OK if you are admin of the WMS, but here we are talking about WMS >>>>> servers used by SAM - if they are malfunctioning due to this bug, then a > whole >>>>> lot of sites is affected! >>>>> >>>>> Something similar probably happened to rb108.cern.ch which started to give >>>>> PeriodicHold errors almost each time SAM tests was sent through it. It was >>>>> recently replaced by rb118.cern.ch, but this started to happen from time to >>>>> time again. >>>>> >>>>> I was thinking more about a permanent fix... >>>>> >>>>> Thanks, Antun >>>>> >>>>> ----- >>>>> Antun Balaz >>>>> Research Assistant >>>>> E-mail: [log in to unmask] >>>>> Web: http://scl.phy.bg.ac.yu/ >>>>> >>>>> Phone: +381 11 3713152 >>>>> Fax: +381 11 3162190 >>>>> >>>>> Scientific Computing Laboratory >>>>> Institute of Physics, Belgrade, Serbia >>>>> ----- >>>>> >>>>> ---------- Original Message ----------- >>>>> From: Di Qing <[log in to unmask]> >>>>> To: [log in to unmask] >>>>> Sent: Tue, 15 May 2007 10:49:55 +0200 >>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour >>>>> >>>>>> Hi Antun, >>>>>> >>>>>> Usually we try to find out that launch job in the condor queue on >>>>>> WMS, its name is like condorc-launcher-s, then remove it by condor_rm. >>>>>> >>>>>> Di >>>>>> >>>>>> Antun Balaz wrote: >>>>>>> Hi Di, >>>>>>> >>>>>>> And how to solve this problem? >>>>>>> >>>>>>> Thanks, Antun >>>>>>> >>>>>>> ----- >>>>>>> Antun Balaz >>>>>>> Research Assistant >>>>>>> E-mail: [log in to unmask] >>>>>>> Web: http://scl.phy.bg.ac.yu/ >>>>>>> >>>>>>> Phone: +381 11 3713152 >>>>>>> Fax: +381 11 3162190 >>>>>>> >>>>>>> Scientific Computing Laboratory >>>>>>> Institute of Physics, Belgrade, Serbia >>>>>>> ----- >>>>>>> >>>>>>> ---------- Original Message ----------- >>>>>>> From: Di Qing <[log in to unmask]> >>>>>>> To: [log in to unmask] >>>>>>> Sent: Tue, 15 May 2007 10:41:56 +0200 >>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour >>>>>>> >>>>>>>> If the condor instances for the jobs submitted by SAM portal are >>>>>>>> running on glite CE, when new jobs coming, WMS will bypass >>>>>>>> gatekeeper and directly submit jobs to the condor instance. For the >>>>>>>> periodical log message in gatekeeper log or /var/log/message, I >>>>>>>> think it is that WMS tried to launch the condor instance, but failed, >>>>>>>> then it retried again and again. >>>>>>>> >>>>>>>> Di >>>>>>>> >>>>>>>> Alexander Piavka wrote: >>>>>>>>> Hi Antun, >>>>>>>>> >>>>>>>>> What is more disturbing me is that on PPS site the SAM portal jobs >>>>>>>>> are successfully executed but the only >>>>>>>>> trace of lcas is in /var/log/gridftp-lcas_lcmaps.log >>>>>>>>> There are no traces at /var/log/glite/gatekeeper.log & /var/log/messages >>>>>>>>> So it looks like a security problem, but i can't undertand how this be >>>>>>>>> happening only for jobs submited from SAM poprtal and not for all jobs, >>>>>>>>> since it's a gatekeeper authentication which is always running and it is >>>>>>>>> not related to https://gus.fzk.de/pages/ticket_details.php?ticket=20625 >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Alex >>>>>>>>> >>>>>>>>> On Tue, 15 May 2007, Antun Balaz wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> We see this almost all the time, and it is a long standing problem. >>>>> Since it >>>>>>>>>> appears from time to time (it is not always there), without any changes >>>>> from >>>>>>>>>> our side, we think that it is related to some WMS problem, and not > to gCE >>>>>>>>>> problems. >>>>>>>>>> >>>>>>>>>> Somewhat related is the following ticket (although no mapping problems >>>>>>> there): >>>>>>>>>> https://gus.fzk.de/pages/ticket_details.php?ticket=20625 >>>>>>>>>> >>>>>>>>>> However, I don't know what is the status of improvements mentioned >>> there... >>>>>>>>>> Regards, Antun >>>>>>>>>> >>>>>>>>>> ----- >>>>>>>>>> Antun Balaz >>>>>>>>>> Research Assistant >>>>>>>>>> E-mail: [log in to unmask] >>>>>>>>>> Web: http://scl.phy.bg.ac.yu/ >>>>>>>>>> >>>>>>>>>> Phone: +381 11 3713152 >>>>>>>>>> Fax: +381 11 3162190 >>>>>>>>>> >>>>>>>>>> Scientific Computing Laboratory >>>>>>>>>> Institute of Physics, Belgrade, Serbia >>>>>>>>>> ----- >>>>>>>>>> >>>>>>>>>> ---------- Original Message ----------- >>>>>>>>>> From: Esteban Freire Garcia <[log in to unmask]> >>>>>>>>>> To: [log in to unmask] >>>>>>>>>> Sent: Mon, 14 May 2007 22:50:47 +0200 >>>>>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour >>>>>>>>>> >>>>>>>>>>> Hi Alex, >>>>>>>>>>> >>>>>>>>>>> From the upgrade 29 we have a very similar incidence on PPS, > similar >>>>>>>>>>> logs..although I am not sure that the problem happen since the >>>>>>>>>>> upgrade, in principle I didn't observe anything strange after to >>>>>>>>>>> upgrade. What is curious, is that from the page of monitoring, the >>>>>>>>>>> tests that are made automatically every hour has a status of Ok on >>>>>>>>>>> PPS, however if I try to send a test from the Sam Admin�s page, this >>>>>>>>>>> job is aborted with the following error :(reason = Got a job held >>>>>>>>>>> event, reason: "The job attribute PeriodicHold expression 'Matched >>>>>>>>>>> =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ) After >>>>>>>>>>> reviewing all the services running, I do not observe anything >>>>>>>>>>> strange, and I think that it is an authentication problem, although >>>>>>>>>>> I do not observe anything stranger in this sense. So, I from here >>>>>>>>>>> send the same question that you, Has anyone seen similar behaviour? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Esteban >>>>>>>>>>> >>>>>>>>>>>> Hi all, >>>>>>>>>>>> >>>>>>>>>>>> Both on my production & pps sites on gliteCEs i've got the following >>>>>>>>>>>> logged exactly every 5 minutes and 30 seconds: >>>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>>> Notice: 6: Got connection 131.154.100.148 at Sun May 13 07:08:59 2007 >>>>>>>>>>>> >>>>>>>>>>>> Notice: 5: Trying to use delegated user proxy >>>>>>>>>>>> Notice: 5: Authenticated globus user: /C=PL/O=GRID/O=PSNC/CN=Rafal >>>>>>>>>>>> Lichwala - OPS Notice: 0: GRID_SECURITY_HTTP_BODY_FD=9 >>>>>>>>>>>> Notice: 0: JOB_REPOSITORY_ID >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 (unique id used for >>>>>>>>>>>> Job Repository) Notice: 0: FORMAT: >>>>>>>>>>>> YYYY-MM-DD.hh:mm:ss.micros.pid.connection Notice: 0: (Format: >>>>>>>>>>>> <date>.<time (with >>>>>>>>>>>> microsecs)>.<pid>.<connection counter>) >>>>>>>>>>>> Notice: 0: temporarily ALLOW empty credentials >>>>>>>>>>>> Notice: 0: Using dlopen version of LCAS >>>>>>>>>>>> Notice: 0: lcasmod_name = /opt/glite/lib/lcas.mod >>>>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> LCAS 7: 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> Initialization LCAS version 1.3.1 LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas.mod-lcas_init(): Reading LCAS database > /opt/glite/etc/lcas/lcas.db >>>>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> LCAS 5: 2007-05-13.07:09:00.123457.0000000507.0000004146 : LCAS >>>>>>>>>>>> authorization request LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas.mod-lcas_run_va(): user is /C=PL/O=GRID/O=PSNC/CN=Rafal > Lichwala - >>>>>>>>>>>> OPS LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas_userban.mod-plugin_confirm_authorization(): checking banned > users >>>>>>>>>>>> in /opt/glite/etc/lcas/ban_users.db LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas.mod-lcas_run_va(): authorization granted by plugin >>>>>>>>>>>> /opt/glite/lib/modules/lcas_userban.mod LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Generic >>>>>>>>>>>> verification error for VOMS (failure)! LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms > plugin >>>>>>>>>>>> failed LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas.mod-lcas_run_va(): authorization failed for plugin >>>>>>>>>>>> /opt/glite/lib/modules/lcas_voms.mod LCAS 0: >>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 : >>>>>>>>>>>> lcas.mod-lcas_run_va(): failed Failure: LCAS failed authorization. >>>>>>>>>>>> Failure: LCAS failed authorization. >>>>>>>>>>>> ----------------------------------------------------- >>>>>>>>>>>> >>>>>>>>>>>> AFAIK /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS is the dn used to >>>>>>>>>>>> submit tests from SAM Admin Portal. The connection is coming from >>>>>>>>>>>> glite-rb-01.cnaf.infn.it WMS. >>>>>>>>>>>> Any ideas why it tries exactly every 5::30 minutes? Does the WMS > try to >>>>>>>>>>>> monitor some previously sent jobs or what? >>>>>>>>>>>> >>>>>>>>>>>> What is more interesting is that then i try to submit jobs from SAM >>>>>>>>>>>> Admin Portal >>>>>>>>>>>> to production gliteCE the Job gets Abroted due to: >>>>>>>>>>>> Job got an error while in the CondorG queue. >>>>>>>>>>>> hit job shallow retry count (0) >>>>>>>>>>>> In the job logging info i see tha the job is submited by >>>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS >>>>>>>>>>>> But nothing is logged at /var/log/glite/gatekeeper.log & >>>>>>>>>>>> /var/log/messages regarding lcas & lcamaps authentication. >>>>>>>>>>>> Also there is nothing in /var/log/gridftp-lcas_lcmaps.log for the > user. >>>>>>>>>>>> But the there is a mapping under /etc/grid-security/gridmapdir > for the >>>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS dn to ops003 >>>>>>>>>>>> >>>>>>>>>>>> But what is even more strange is then i submit from SAM Admin Portal >>>>>>>>>>>> to pps gliteCE, the job is sucessfully submited and executed by > pbs and >>>>>>>>>>>> blah record is insteted to > /var/log/glite/accounting/blahp.log-200705 , >>>>>>>>>>>> but again nothing is logged both at /var/log/glite/gatekeeper.log & >>>>>>>>>>>> /var/log/messages Howether the authentication is logged at >>>>>>>>>>>> /var/log/gridftp-lcas_lcmaps.log >>>>>>>>>>>> >>>>>>>>>>>> How this can be? I've both at pps & production authentication working >>>>>>>>>>>> ok for all other users with lcas & lcamaps messages logged as > usual at >>>>>>>>>>>> /var/log/glite/gatekeeper.log & /var/log/messages/ >>>>>>>>>>>> Any why the submition work for pps site only? >>>>>>>>>>>> >>>>>>>>>>>> Has anyone seen similar behaviour? >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Alex >>>>>>>>>> ------- End of Original Message ------- >>>>>>>>>> >>>>>>> ------- End of Original Message ------- >>>>> ------- End of Original Message ------- >>> ------- End of Original Message ------- > ------- End of Original Message -------