Hi Di,
This is the situation:
1) /opt/globus/tmp/gram_job_state was not empty, but contained some files
created until February. I removed them, and created a soft link:
ln -s /var/glite/gram_job_state /opt/globus/tmp/gram_job_state
Btw. /var/glite/gram_job_state contains a lot of gram_job_state files. Cleanup
cron job could do some good to that directory.
2) Both /opt/glite/etc/grid-services/jobmanager-fork and
/opt/globus/etc/grid-services/jobmanager exist.
/opt/globus/etc/grid-services/jobmanager is a link to
/opt/globus/etc/grid-services/jobmanager-fork. This file uses
/opt/globus/etc/globus-job-manager.conf file, while
/opt/glite/etc/grid-services/jobmanager-fork uses
/opt/glite/etc/globus-job-manager.conf. At the end of the story, these two
conf files differ just in the directory for gram_job_state, which is fixed in
point 1), so no need to change anything here. Correct?
It is enough to restart gLite on this gCE to see if situation improved?
Thanks, Antun
-----
Antun Balaz
Research Assistant
E-mail: [log in to unmask]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3713152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade, Serbia
-----
---------- Original Message -----------
From: Di Qing <[log in to unmask]>
To: [log in to unmask]
Sent: Tue, 15 May 2007 11:17:47 +0200
Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> Hi Antun,
>
> There are two possible solutions on CE: 1) create link
> /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state or
> /opt/globus/tmp/gram_job_state to /var/glite/gram_job_state, 2)
> create /opt/glite/etc/grid-services/jobmanager-fork to
> /opt/globus/etc/grid-services/jobmanager . Very recently Francesco
> Prelz found there was an inconsistence between WMS and glite CE,
> e.g., GRAM is saving its state files under
> /var/glite/gram_job_state/ specified by
> /opt/glite/etc/gatekeeper.conf, however, for job status checking,
> the gridmonitor is looking for the state file directory starting
> from $GLOBUS_LOCATION/etc/grid-services, which leads to the globus default
> location /opt/globus/tmp/gram_job_state/ which is empty, so the
> state of fork jobs is never correctly updated.
>
> We are testing if it is true and which one is the best solution for
> this. But you can try it as well.
>
> Cheers,
>
> Di
>
> Antun Balaz wrote:
> > Hi Di,
> >
> > It is OK if you are admin of the WMS, but here we are talking about WMS
> > servers used by SAM - if they are malfunctioning due to this bug, then a whole
> > lot of sites is affected!
> >
> > Something similar probably happened to rb108.cern.ch which started to give
> > PeriodicHold errors almost each time SAM tests was sent through it. It was
> > recently replaced by rb118.cern.ch, but this started to happen from time to
> > time again.
> >
> > I was thinking more about a permanent fix...
> >
> > Thanks, Antun
> >
> > -----
> > Antun Balaz
> > Research Assistant
> > E-mail: [log in to unmask]
> > Web: http://scl.phy.bg.ac.yu/
> >
> > Phone: +381 11 3713152
> > Fax: +381 11 3162190
> >
> > Scientific Computing Laboratory
> > Institute of Physics, Belgrade, Serbia
> > -----
> >
> > ---------- Original Message -----------
> > From: Di Qing <[log in to unmask]>
> > To: [log in to unmask]
> > Sent: Tue, 15 May 2007 10:49:55 +0200
> > Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >
> >> Hi Antun,
> >>
> >> Usually we try to find out that launch job in the condor queue on
> >> WMS, its name is like condorc-launcher-s, then remove it by condor_rm.
> >>
> >> Di
> >>
> >> Antun Balaz wrote:
> >>> Hi Di,
> >>>
> >>> And how to solve this problem?
> >>>
> >>> Thanks, Antun
> >>>
> >>> -----
> >>> Antun Balaz
> >>> Research Assistant
> >>> E-mail: [log in to unmask]
> >>> Web: http://scl.phy.bg.ac.yu/
> >>>
> >>> Phone: +381 11 3713152
> >>> Fax: +381 11 3162190
> >>>
> >>> Scientific Computing Laboratory
> >>> Institute of Physics, Belgrade, Serbia
> >>> -----
> >>>
> >>> ---------- Original Message -----------
> >>> From: Di Qing <[log in to unmask]>
> >>> To: [log in to unmask]
> >>> Sent: Tue, 15 May 2007 10:41:56 +0200
> >>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >>>
> >>>> If the condor instances for the jobs submitted by SAM portal are
> >>>> running on glite CE, when new jobs coming, WMS will bypass
> >>>> gatekeeper and directly submit jobs to the condor instance. For the
> >>>> periodical log message in gatekeeper log or /var/log/message, I
> >>>> think it is that WMS tried to launch the condor instance, but failed,
> >>>> then it retried again and again.
> >>>>
> >>>> Di
> >>>>
> >>>> Alexander Piavka wrote:
> >>>>> Hi Antun,
> >>>>>
> >>>>> What is more disturbing me is that on PPS site the SAM portal jobs
> >>>>> are successfully executed but the only
> >>>>> trace of lcas is in /var/log/gridftp-lcas_lcmaps.log
> >>>>> There are no traces at /var/log/glite/gatekeeper.log & /var/log/messages
> >>>>> So it looks like a security problem, but i can't undertand how this be
> >>>>> happening only for jobs submited from SAM poprtal and not for all jobs,
> >>>>> since it's a gatekeeper authentication which is always running and it is
> >>>>> not related to https://gus.fzk.de/pages/ticket_details.php?ticket=20625
> >>>>>
> >>>>> Thanks
> >>>>> Alex
> >>>>>
> >>>>> On Tue, 15 May 2007, Antun Balaz wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> We see this almost all the time, and it is a long standing problem.
> > Since it
> >>>>>> appears from time to time (it is not always there), without any changes
> > from
> >>>>>> our side, we think that it is related to some WMS problem, and not to gCE
> >>>>>> problems.
> >>>>>>
> >>>>>> Somewhat related is the following ticket (although no mapping problems
> >>> there):
> >>>>>> https://gus.fzk.de/pages/ticket_details.php?ticket=20625
> >>>>>>
> >>>>>> However, I don't know what is the status of improvements mentioned
there...
> >>>>>>
> >>>>>> Regards, Antun
> >>>>>>
> >>>>>> -----
> >>>>>> Antun Balaz
> >>>>>> Research Assistant
> >>>>>> E-mail: [log in to unmask]
> >>>>>> Web: http://scl.phy.bg.ac.yu/
> >>>>>>
> >>>>>> Phone: +381 11 3713152
> >>>>>> Fax: +381 11 3162190
> >>>>>>
> >>>>>> Scientific Computing Laboratory
> >>>>>> Institute of Physics, Belgrade, Serbia
> >>>>>> -----
> >>>>>>
> >>>>>> ---------- Original Message -----------
> >>>>>> From: Esteban Freire Garcia <[log in to unmask]>
> >>>>>> To: [log in to unmask]
> >>>>>> Sent: Mon, 14 May 2007 22:50:47 +0200
> >>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> >>>>>>
> >>>>>>> Hi Alex,
> >>>>>>>
> >>>>>>> From the upgrade 29 we have a very similar incidence on PPS, similar
> >>>>>>> logs..although I am not sure that the problem happen since the
> >>>>>>> upgrade, in principle I didn't observe anything strange after to
> >>>>>>> upgrade. What is curious, is that from the page of monitoring, the
> >>>>>>> tests that are made automatically every hour has a status of Ok on
> >>>>>>> PPS, however if I try to send a test from the Sam Admin�s page, this
> >>>>>>> job is aborted with the following error :(reason = Got a job held
> >>>>>>> event, reason: "The job attribute PeriodicHold expression 'Matched
> >>>>>>> =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ) After
> >>>>>>> reviewing all the services running, I do not observe anything
> >>>>>>> strange, and I think that it is an authentication problem, although
> >>>>>>> I do not observe anything stranger in this sense. So, I from here
> >>>>>>> send the same question that you, Has anyone seen similar behaviour?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Esteban
> >>>>>>>
> >>>>>>>> Hi all,
> >>>>>>>>
> >>>>>>>> Both on my production & pps sites on gliteCEs i've got the following
> >>>>>>>> logged exactly every 5 minutes and 30 seconds:
> >>>>>>>> -----------------------------------------------------
> >>>>>>>> Notice: 6: Got connection 131.154.100.148 at Sun May 13 07:08:59 2007
> >>>>>>>>
> >>>>>>>> Notice: 5: Trying to use delegated user proxy
> >>>>>>>> Notice: 5: Authenticated globus user: /C=PL/O=GRID/O=PSNC/CN=Rafal
> >>>>>>>> Lichwala - OPS Notice: 0: GRID_SECURITY_HTTP_BODY_FD=9
> >>>>>>>> Notice: 0: JOB_REPOSITORY_ID
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 (unique id used for
> >>>>>>>> Job Repository) Notice: 0: FORMAT:
> >>>>>>>> YYYY-MM-DD.hh:mm:ss.micros.pid.connection Notice: 0: (Format:
> >>>>>>>> <date>.<time (with
> >>>>>>>> microsecs)>.<pid>.<connection counter>)
> >>>>>>>> Notice: 0: temporarily ALLOW empty credentials
> >>>>>>>> Notice: 0: Using dlopen version of LCAS
> >>>>>>>> Notice: 0: lcasmod_name = /opt/glite/lib/lcas.mod
> >>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> LCAS 7: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> Initialization LCAS version 1.3.1 LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas.mod-lcas_init(): Reading LCAS database /opt/glite/etc/lcas/lcas.db
> >>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> LCAS 5: 2007-05-13.07:09:00.123457.0000000507.0000004146 : LCAS
> >>>>>>>> authorization request LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas.mod-lcas_run_va(): user is /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala -
> >>>>>>>> OPS LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas_userban.mod-plugin_confirm_authorization(): checking banned users
> >>>>>>>> in /opt/glite/etc/lcas/ban_users.db LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas.mod-lcas_run_va(): authorization granted by plugin
> >>>>>>>> /opt/glite/lib/modules/lcas_userban.mod LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Generic
> >>>>>>>> verification error for VOMS (failure)! LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms plugin
> >>>>>>>> failed LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas.mod-lcas_run_va(): authorization failed for plugin
> >>>>>>>> /opt/glite/lib/modules/lcas_voms.mod LCAS 0:
> >>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
> >>>>>>>> lcas.mod-lcas_run_va(): failed Failure: LCAS failed authorization.
> >>>>>>>> Failure: LCAS failed authorization.
> >>>>>>>> -----------------------------------------------------
> >>>>>>>>
> >>>>>>>> AFAIK /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS is the dn used to
> >>>>>>>> submit tests from SAM Admin Portal. The connection is coming from
> >>>>>>>> glite-rb-01.cnaf.infn.it WMS.
> >>>>>>>> Any ideas why it tries exactly every 5::30 minutes? Does the WMS try to
> >>>>>>>> monitor some previously sent jobs or what?
> >>>>>>>>
> >>>>>>>> What is more interesting is that then i try to submit jobs from SAM
> >>>>>>>> Admin Portal
> >>>>>>>> to production gliteCE the Job gets Abroted due to:
> >>>>>>>> Job got an error while in the CondorG queue.
> >>>>>>>> hit job shallow retry count (0)
> >>>>>>>> In the job logging info i see tha the job is submited by
> >>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS
> >>>>>>>> But nothing is logged at /var/log/glite/gatekeeper.log &
> >>>>>>>> /var/log/messages regarding lcas & lcamaps authentication.
> >>>>>>>> Also there is nothing in /var/log/gridftp-lcas_lcmaps.log for the user.
> >>>>>>>> But the there is a mapping under /etc/grid-security/gridmapdir for the
> >>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS dn to ops003
> >>>>>>>>
> >>>>>>>> But what is even more strange is then i submit from SAM Admin Portal
> >>>>>>>> to pps gliteCE, the job is sucessfully submited and executed by pbs and
> >>>>>>>> blah record is insteted to /var/log/glite/accounting/blahp.log-200705 ,
> >>>>>>>> but again nothing is logged both at /var/log/glite/gatekeeper.log &
> >>>>>>>> /var/log/messages Howether the authentication is logged at
> >>>>>>>> /var/log/gridftp-lcas_lcmaps.log
> >>>>>>>>
> >>>>>>>> How this can be? I've both at pps & production authentication working
> >>>>>>>> ok for all other users with lcas & lcamaps messages logged as usual at
> >>>>>>>> /var/log/glite/gatekeeper.log & /var/log/messages/
> >>>>>>>> Any why the submition work for pps site only?
> >>>>>>>>
> >>>>>>>> Has anyone seen similar behaviour?
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Alex
> >>>>>> ------- End of Original Message -------
> >>>>>>
> >>> ------- End of Original Message -------
> > ------- End of Original Message -------
------- End of Original Message -------
|