Hi Di,
Seems that your proposed solution works, since after the changes you
suggested, I successfully executed one SAM job on our gCE using the SAM
Admin's page, and this was not possible for several days now. Thanks!
I would personally prefer more elegant solution, i.e. your option 2.
Note that this should be widely publicized, since many gCEs are failing SAM
tests due to this problem, and this badly affects their availabilities.
Thanks again!
Best regards, Antun
-----
Antun Balaz
Research Assistant
E-mail: [log in to unmask]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3713152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade, Serbia
-----
---------- Original Message -----------
From: Di Qing <[log in to unmask]>
To: [log in to unmask]
Sent: Tue, 15 May 2007 11:59:46 +0200
Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> Antun Balaz wrote:
> > Hi Di,
> >
> > This is the situation:
> >
> > 1) /opt/globus/tmp/gram_job_state was not empty, but contained some files
> > created until February. I removed them, and created a soft link:
> >
> > ln -s /var/glite/gram_job_state /opt/globus/tmp/gram_job_state
> >
> > Btw. /var/glite/gram_job_state contains a lot of gram_job_state files. Cleanup
> > cron job could do some good to that directory.
>
> If it works properly, there should be not so many files left.
>
> > 2) Both /opt/glite/etc/grid-services/jobmanager-fork and
> > /opt/globus/etc/grid-services/jobmanager exist.
> > /opt/globus/etc/grid-services/jobmanager is a link to
> > /opt/globus/etc/grid-services/jobmanager-fork. This file uses
> > /opt/globus/etc/globus-job-manager.conf file, while
> > /opt/glite/etc/grid-services/jobmanager-fork uses
> > /opt/glite/etc/globus-job-manager.conf. At the end of the story, these two
> > conf files differ just in the directory for gram_job_state, which is fixed in
> > point 1), so no need to change anything here. Correct?
> >
> Yes, if you already tried to fix it according to point 1), you don't
> need to do this. They are two different solutions. We are still
> testing which one is the best.
>
> > It is enough to restart gLite on this gCE to see if situation improved?
>
> I don't think you need to restart gLite since now gridmonitor can
> find the state files in the symbol linked directory.
>
> Di
>
> >
> > Thanks, Antun
> >
> > -----
> > Antun Balaz
> > Research Assistant
> > E-mail: [log in to unmask]
> > Web: http://scl.phy.bg.ac.yu/
> >
> > Phone: +381 11 3713152
> > Fax: +381 11 3162190
> >
> > Scientific Computing Laboratory
> > Institute of Physics, Belgrade, Serbia
> > -----
> >
> > ---------- Original Message -----------
> > From: Di Qing <[log in to unmask]>
> > To: [log in to unmask]
> > Sent: Tue, 15 May 2007 11:17:47 +0200
> > Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >
> >> Hi Antun,
> >>
> >> There are two possible solutions on CE: 1) create link
> >> /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state or
> >> /opt/globus/tmp/gram_job_state to /var/glite/gram_job_state, 2)
> >> create /opt/glite/etc/grid-services/jobmanager-fork to
> >> /opt/globus/etc/grid-services/jobmanager . Very recently Francesco
> >> Prelz found there was an inconsistence between WMS and glite CE,
> >> e.g., GRAM is saving its state files under
> >> /var/glite/gram_job_state/ specified by
> >> /opt/glite/etc/gatekeeper.conf, however, for job status checking,
> >> the gridmonitor is looking for the state file directory starting
> >> from $GLOBUS_LOCATION/etc/grid-services, which leads to the globus default
> >> location /opt/globus/tmp/gram_job_state/ which is empty, so the
> >> state of fork jobs is never correctly updated.
> >>
> >> We are testing if it is true and which one is the best solution for
> >> this. But you can try it as well.
> >>
> >> Cheers,
> >>
> >> Di
> >>
> >> Antun Balaz wrote:
> >>> Hi Di,
> >>>
> >>> It is OK if you are admin of the WMS, but here we are talking about WMS
> >>> servers used by SAM - if they are malfunctioning due to this bug, then a
whole
> >>> lot of sites is affected!
> >>>
> >>> Something similar probably happened to rb108.cern.ch which started to give
> >>> PeriodicHold errors almost each time SAM tests was sent through it. It was
> >>> recently replaced by rb118.cern.ch, but this started to happen from time to
> >>> time again.
> >>>
> >>> I was thinking more about a permanent fix...
> >>>
> >>> Thanks, Antun
> >>>
> >>> -----
> >>> Antun Balaz
> >>> Research Assistant
> >>> E-mail: [log in to unmask]
> >>> Web: http://scl.phy.bg.ac.yu/
> >>>
> >>> Phone: +381 11 3713152
> >>> Fax: +381 11 3162190
> >>>
> >>> Scientific Computing Laboratory
> >>> Institute of Physics, Belgrade, Serbia
> >>> -----
> >>>
> >>> ---------- Original Message -----------
> >>> From: Di Qing <[log in to unmask]>
> >>> To: [log in to unmask]
> >>> Sent: Tue, 15 May 2007 10:49:55 +0200
> >>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >>>
> >>>> Hi Antun,
> >>>>
> >>>> Usually we try to find out that launch job in the condor queue on
> >>>> WMS, its name is like condorc-launcher-s, then remove it by condor_rm.
> >>>>
> >>>> Di
> >>>>
> >>>> Antun Balaz wrote:
> >>>>> Hi Di,
> >>>>>
> >>>>> And how to solve this problem?
> >>>>>
> >>>>> Thanks, Antun
> >>>>>
> >>>>> -----
> >>>>> Antun Balaz
> >>>>> Research Assistant
> >>>>> E-mail: [log in to unmask]
> >>>>> Web: http://scl.phy.bg.ac.yu/
> >>>>>
> >>>>> Phone: +381 11 3713152
> >>>>> Fax: +381 11 3162190
> >>>>>
> >>>>> Scientific Computing Laboratory
> >>>>> Institute of Physics, Belgrade, Serbia
> >>>>> -----
> >>>>>
> >>>>> ---------- Original Message -----------
> >>>>> From: Di Qing <[log in to unmask]>
> >>>>> To: [log in to unmask]
> >>>>> Sent: Tue, 15 May 2007 10:41:56 +0200
> >>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >>>>>
> >>>>>> If the condor instances for the jobs submitted by SAM portal are
> >>>>>> running on glite CE, when new jobs coming, WMS will bypass
> >>>>>> gatekeeper and directly submit jobs to the condor instance. For the
> >>>>>> periodical log message in gatekeeper log or /var/log/message, I
> >>>>>> think it is that WMS tried to launch the condor instance, but failed,
> >>>>>> then it retried again and again.
> >>>>>>
> >>>>>> Di
> >>>>>>
> >>>>>> Alexander Piavka wrote:
> >>>>>>> Hi Antun,
> >>>>>>>
> >>>>>>> What is more disturbing me is that on PPS site the SAM portal jobs
> >>>>>>> are successfully executed but the only
> >>>>>>> trace of lcas is in /var/log/gridftp-lcas_lcmaps.log
> >>>>>>> There are no traces at /var/log/glite/gatekeeper.log & /var/log/messages
> >>>>>>> So it looks like a security problem, but i can't undertand how this be
> >>>>>>> happening only for jobs submited from SAM poprtal and not for all jobs,
> >>>>>>> since it's a gatekeeper authentication which is always running and it is
> >>>>>>> not related to https://gus.fzk.de/pages/ticket_details.php?ticket=20625
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Tue, 15 May 2007, Antun Balaz wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> We see this almost all the time, and it is a long standing problem.
> >>> Since it
> >>>>>>>> appears from time to time (it is not always there), without any changes
> >>> from
> >>>>>>>> our side, we think that it is related to some WMS problem, and not
to gCE
> >>>>>>>> problems.
> >>>>>>>>
> >>>>>>>> Somewhat related is the following ticket (although no mapping problems
> >>>>> there):
> >>>>>>>> https://gus.fzk.de/pages/ticket_details.php?ticket=20625
> >>>>>>>>
> >>>>>>>> However, I don't know what is the status of improvements mentioned
> > there...
> >>>>>>>> Regards, Antun
> >>>>>>>>
> >>>>>>>> -----
> >>>>>>>> Antun Balaz
> >>>>>>>> Research Assistant
> >>>>>>>> E-mail: [log in to unmask]
> >>>>>>>> Web: http://scl.phy.bg.ac.yu/
> >>>>>>>>
> >>>>>>>> Phone: +381 11 3713152
> >>>>>>>> Fax: +381 11 3162190
> >>>>>>>>
> >>>>>>>> Scientific Computing Laboratory
> >>>>>>>> Institute of Physics, Belgrade, Serbia
> >>>>>>>> -----
> >>>>>>>>
> >>>>>>>> ---------- Original Message -----------
> >>>>>>>> From: Esteban Freire Garcia <[log in to unmask]>
> >>>>>>>> To: [log in to unmask]
> >>>>>>>> Sent: Mon, 14 May 2007 22:50:47 +0200
> >>>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >>>>>>>>
> >>>>>>>>> Hi Alex,
> >>>>>>>>>
> >>>>>>>>> From the upgrade 29 we have a very similar incidence on PPS,
similar
> >>>>>>>>> logs..although I am not sure that the problem happen since the
> >>>>>>>>> upgrade, in principle I didn't observe anything strange after to
> >>>>>>>>> upgrade. What is curious, is that from the page of monitoring, the
> >>>>>>>>> tests that are made automatically every hour has a status of Ok on
> >>>>>>>>> PPS, however if I try to send a test from the Sam Admin�s page, this
> >>>>>>>>> job is aborted with the following error :(reason = Got a job held
> >>>>>>>>> event, reason: "The job attribute PeriodicHold expression 'Matched
> >>>>>>>>> =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ) After
> >>>>>>>>> reviewing all the services running, I do not observe anything
> >>>>>>>>> strange, and I think that it is an authentication problem, although
> >>>>>>>>> I do not observe anything stranger in this sense. So, I from here
> >>>>>>>>> send the same question that you, Has anyone seen similar behaviour?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Esteban
> >>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Both on my production & pps sites on gliteCEs i've got the following
> >>>>>>>>>> logged exactly every 5 minutes and 30 seconds:
> >>>>>>>>>> -----------------------------------------------------
> >>>>>>>>>> Notice: 6: Got connection 131.154.100.148 at Sun May 13 07:08:59 2007
> >>>>>>>>>>
> >>>>>>>>>> Notice: 5: Trying to use delegated user proxy
> >>>>>>>>>> Notice: 5: Authenticated globus user: /C=PL/O=GRID/O=PSNC/CN=Rafal
> >>>>>>>>>> Lichwala - OPS Notice: 0: GRID_SECURITY_HTTP_BODY_FD=9
> >>>>>>>>>> Notice: 0: JOB_REPOSITORY_ID
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 (unique id used for
> >>>>>>>>>> Job Repository) Notice: 0: FORMAT:
> >>>>>>>>>> YYYY-MM-DD.hh:mm:ss.micros.pid.connection Notice: 0: (Format:
> >>>>>>>>>> <date>.<time (with
> >>>>>>>>>> microsecs)>.<pid>.<connection counter>)
> >>>>>>>>>> Notice: 0: temporarily ALLOW empty credentials
> >>>>>>>>>> Notice: 0: Using dlopen version of LCAS
> >>>>>>>>>> Notice: 0: lcasmod_name = /opt/glite/lib/lcas.mod
> >>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> LCAS 7: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> Initialization LCAS version 1.3.1 LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas.mod-lcas_init(): Reading LCAS database
/opt/glite/etc/lcas/lcas.db
> >>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> LCAS 5: 2007-05-13.07:09:00.123457.0000000507.0000004146 : LCAS
> >>>>>>>>>> authorization request LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas.mod-lcas_run_va(): user is /C=PL/O=GRID/O=PSNC/CN=Rafal
Lichwala -
> >>>>>>>>>> OPS LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas_userban.mod-plugin_confirm_authorization(): checking banned
users
> >>>>>>>>>> in /opt/glite/etc/lcas/ban_users.db LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas.mod-lcas_run_va(): authorization granted by plugin
> >>>>>>>>>> /opt/glite/lib/modules/lcas_userban.mod LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Generic
> >>>>>>>>>> verification error for VOMS (failure)! LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms
plugin
> >>>>>>>>>> failed LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas.mod-lcas_run_va(): authorization failed for plugin
> >>>>>>>>>> /opt/glite/lib/modules/lcas_voms.mod LCAS 0:
> >>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>>>> lcas.mod-lcas_run_va(): failed Failure: LCAS failed authorization.
> >>>>>>>>>> Failure: LCAS failed authorization.
> >>>>>>>>>> -----------------------------------------------------
> >>>>>>>>>>
> >>>>>>>>>> AFAIK /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS is the dn used to
> >>>>>>>>>> submit tests from SAM Admin Portal. The connection is coming from
> >>>>>>>>>> glite-rb-01.cnaf.infn.it WMS.
> >>>>>>>>>> Any ideas why it tries exactly every 5::30 minutes? Does the WMS
try to
> >>>>>>>>>> monitor some previously sent jobs or what?
> >>>>>>>>>>
> >>>>>>>>>> What is more interesting is that then i try to submit jobs from SAM
> >>>>>>>>>> Admin Portal
> >>>>>>>>>> to production gliteCE the Job gets Abroted due to:
> >>>>>>>>>> Job got an error while in the CondorG queue.
> >>>>>>>>>> hit job shallow retry count (0)
> >>>>>>>>>> In the job logging info i see tha the job is submited by
> >>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS
> >>>>>>>>>> But nothing is logged at /var/log/glite/gatekeeper.log &
> >>>>>>>>>> /var/log/messages regarding lcas & lcamaps authentication.
> >>>>>>>>>> Also there is nothing in /var/log/gridftp-lcas_lcmaps.log for the
user.
> >>>>>>>>>> But the there is a mapping under /etc/grid-security/gridmapdir
for the
> >>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS dn to ops003
> >>>>>>>>>>
> >>>>>>>>>> But what is even more strange is then i submit from SAM Admin Portal
> >>>>>>>>>> to pps gliteCE, the job is sucessfully submited and executed by
pbs and
> >>>>>>>>>> blah record is insteted to
/var/log/glite/accounting/blahp.log-200705 ,
> >>>>>>>>>> but again nothing is logged both at /var/log/glite/gatekeeper.log &
> >>>>>>>>>> /var/log/messages Howether the authentication is logged at
> >>>>>>>>>> /var/log/gridftp-lcas_lcmaps.log
> >>>>>>>>>>
> >>>>>>>>>> How this can be? I've both at pps & production authentication working
> >>>>>>>>>> ok for all other users with lcas & lcamaps messages logged as
usual at
> >>>>>>>>>> /var/log/glite/gatekeeper.log & /var/log/messages/
> >>>>>>>>>> Any why the submition work for pps site only?
> >>>>>>>>>>
> >>>>>>>>>> Has anyone seen similar behaviour?
> >>>>>>>>>>
> >>>>>>>>>> Thanks
> >>>>>>>>>> Alex
> >>>>>>>> ------- End of Original Message -------
> >>>>>>>>
> >>>>> ------- End of Original Message -------
> >>> ------- End of Original Message -------
> > ------- End of Original Message -------
------- End of Original Message -------
|