Antun Balaz wrote:
> Hi Di,
>
> This is the situation:
>
> 1) /opt/globus/tmp/gram_job_state was not empty, but contained some files
> created until February. I removed them, and created a soft link:
>
> ln -s /var/glite/gram_job_state /opt/globus/tmp/gram_job_state
>
> Btw. /var/glite/gram_job_state contains a lot of gram_job_state files. Cleanup
> cron job could do some good to that directory.
If it works properly, there should be not so many files left.
> 2) Both /opt/glite/etc/grid-services/jobmanager-fork and
> /opt/globus/etc/grid-services/jobmanager exist.
> /opt/globus/etc/grid-services/jobmanager is a link to
> /opt/globus/etc/grid-services/jobmanager-fork. This file uses
> /opt/globus/etc/globus-job-manager.conf file, while
> /opt/glite/etc/grid-services/jobmanager-fork uses
> /opt/glite/etc/globus-job-manager.conf. At the end of the story, these two
> conf files differ just in the directory for gram_job_state, which is fixed in
> point 1), so no need to change anything here. Correct?
>
Yes, if you already tried to fix it according to point 1), you don't
need to do this. They are two different solutions. We are still testing
which one is the best.
> It is enough to restart gLite on this gCE to see if situation improved?
I don't think you need to restart gLite since now gridmonitor can find
the state files in the symbol linked directory.
Di
>
> Thanks, Antun
>
> -----
> Antun Balaz
> Research Assistant
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
>
> Phone: +381 11 3713152
> Fax: +381 11 3162190
>
> Scientific Computing Laboratory
> Institute of Physics, Belgrade, Serbia
> -----
>
> ---------- Original Message -----------
> From: Di Qing <[log in to unmask]>
> To: [log in to unmask]
> Sent: Tue, 15 May 2007 11:17:47 +0200
> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>
>> Hi Antun,
>>
>> There are two possible solutions on CE: 1) create link
>> /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state or
>> /opt/globus/tmp/gram_job_state to /var/glite/gram_job_state, 2)
>> create /opt/glite/etc/grid-services/jobmanager-fork to
>> /opt/globus/etc/grid-services/jobmanager . Very recently Francesco
>> Prelz found there was an inconsistence between WMS and glite CE,
>> e.g., GRAM is saving its state files under
>> /var/glite/gram_job_state/ specified by
>> /opt/glite/etc/gatekeeper.conf, however, for job status checking,
>> the gridmonitor is looking for the state file directory starting
>> from $GLOBUS_LOCATION/etc/grid-services, which leads to the globus default
>> location /opt/globus/tmp/gram_job_state/ which is empty, so the
>> state of fork jobs is never correctly updated.
>>
>> We are testing if it is true and which one is the best solution for
>> this. But you can try it as well.
>>
>> Cheers,
>>
>> Di
>>
>> Antun Balaz wrote:
>>> Hi Di,
>>>
>>> It is OK if you are admin of the WMS, but here we are talking about WMS
>>> servers used by SAM - if they are malfunctioning due to this bug, then a whole
>>> lot of sites is affected!
>>>
>>> Something similar probably happened to rb108.cern.ch which started to give
>>> PeriodicHold errors almost each time SAM tests was sent through it. It was
>>> recently replaced by rb118.cern.ch, but this started to happen from time to
>>> time again.
>>>
>>> I was thinking more about a permanent fix...
>>>
>>> Thanks, Antun
>>>
>>> -----
>>> Antun Balaz
>>> Research Assistant
>>> E-mail: [log in to unmask]
>>> Web: http://scl.phy.bg.ac.yu/
>>>
>>> Phone: +381 11 3713152
>>> Fax: +381 11 3162190
>>>
>>> Scientific Computing Laboratory
>>> Institute of Physics, Belgrade, Serbia
>>> -----
>>>
>>> ---------- Original Message -----------
>>> From: Di Qing <[log in to unmask]>
>>> To: [log in to unmask]
>>> Sent: Tue, 15 May 2007 10:49:55 +0200
>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>
>>>> Hi Antun,
>>>>
>>>> Usually we try to find out that launch job in the condor queue on
>>>> WMS, its name is like condorc-launcher-s, then remove it by condor_rm.
>>>>
>>>> Di
>>>>
>>>> Antun Balaz wrote:
>>>>> Hi Di,
>>>>>
>>>>> And how to solve this problem?
>>>>>
>>>>> Thanks, Antun
>>>>>
>>>>> -----
>>>>> Antun Balaz
>>>>> Research Assistant
>>>>> E-mail: [log in to unmask]
>>>>> Web: http://scl.phy.bg.ac.yu/
>>>>>
>>>>> Phone: +381 11 3713152
>>>>> Fax: +381 11 3162190
>>>>>
>>>>> Scientific Computing Laboratory
>>>>> Institute of Physics, Belgrade, Serbia
>>>>> -----
>>>>>
>>>>> ---------- Original Message -----------
>>>>> From: Di Qing <[log in to unmask]>
>>>>> To: [log in to unmask]
>>>>> Sent: Tue, 15 May 2007 10:41:56 +0200
>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>>>
>>>>>> If the condor instances for the jobs submitted by SAM portal are
>>>>>> running on glite CE, when new jobs coming, WMS will bypass
>>>>>> gatekeeper and directly submit jobs to the condor instance. For the
>>>>>> periodical log message in gatekeeper log or /var/log/message, I
>>>>>> think it is that WMS tried to launch the condor instance, but failed,
>>>>>> then it retried again and again.
>>>>>>
>>>>>> Di
>>>>>>
>>>>>> Alexander Piavka wrote:
>>>>>>> Hi Antun,
>>>>>>>
>>>>>>> What is more disturbing me is that on PPS site the SAM portal jobs
>>>>>>> are successfully executed but the only
>>>>>>> trace of lcas is in /var/log/gridftp-lcas_lcmaps.log
>>>>>>> There are no traces at /var/log/glite/gatekeeper.log & /var/log/messages
>>>>>>> So it looks like a security problem, but i can't undertand how this be
>>>>>>> happening only for jobs submited from SAM poprtal and not for all jobs,
>>>>>>> since it's a gatekeeper authentication which is always running and it is
>>>>>>> not related to https://gus.fzk.de/pages/ticket_details.php?ticket=20625
>>>>>>>
>>>>>>> Thanks
>>>>>>> Alex
>>>>>>>
>>>>>>> On Tue, 15 May 2007, Antun Balaz wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We see this almost all the time, and it is a long standing problem.
>>> Since it
>>>>>>>> appears from time to time (it is not always there), without any changes
>>> from
>>>>>>>> our side, we think that it is related to some WMS problem, and not to gCE
>>>>>>>> problems.
>>>>>>>>
>>>>>>>> Somewhat related is the following ticket (although no mapping problems
>>>>> there):
>>>>>>>> https://gus.fzk.de/pages/ticket_details.php?ticket=20625
>>>>>>>>
>>>>>>>> However, I don't know what is the status of improvements mentioned
> there...
>>>>>>>> Regards, Antun
>>>>>>>>
>>>>>>>> -----
>>>>>>>> Antun Balaz
>>>>>>>> Research Assistant
>>>>>>>> E-mail: [log in to unmask]
>>>>>>>> Web: http://scl.phy.bg.ac.yu/
>>>>>>>>
>>>>>>>> Phone: +381 11 3713152
>>>>>>>> Fax: +381 11 3162190
>>>>>>>>
>>>>>>>> Scientific Computing Laboratory
>>>>>>>> Institute of Physics, Belgrade, Serbia
>>>>>>>> -----
>>>>>>>>
>>>>>>>> ---------- Original Message -----------
>>>>>>>> From: Esteban Freire Garcia <[log in to unmask]>
>>>>>>>> To: [log in to unmask]
>>>>>>>> Sent: Mon, 14 May 2007 22:50:47 +0200
>>>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>>>>>>
>>>>>>>>> Hi Alex,
>>>>>>>>>
>>>>>>>>> From the upgrade 29 we have a very similar incidence on PPS, similar
>>>>>>>>> logs..although I am not sure that the problem happen since the
>>>>>>>>> upgrade, in principle I didn't observe anything strange after to
>>>>>>>>> upgrade. What is curious, is that from the page of monitoring, the
>>>>>>>>> tests that are made automatically every hour has a status of Ok on
>>>>>>>>> PPS, however if I try to send a test from the Sam Admin�s page, this
>>>>>>>>> job is aborted with the following error :(reason = Got a job held
>>>>>>>>> event, reason: "The job attribute PeriodicHold expression 'Matched
>>>>>>>>> =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" ) After
>>>>>>>>> reviewing all the services running, I do not observe anything
>>>>>>>>> strange, and I think that it is an authentication problem, although
>>>>>>>>> I do not observe anything stranger in this sense. So, I from here
>>>>>>>>> send the same question that you, Has anyone seen similar behaviour?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Esteban
>>>>>>>>>
>>>>>>>>>> Hi all,
>>>>>>>>>>
>>>>>>>>>> Both on my production & pps sites on gliteCEs i've got the following
>>>>>>>>>> logged exactly every 5 minutes and 30 seconds:
>>>>>>>>>> -----------------------------------------------------
>>>>>>>>>> Notice: 6: Got connection 131.154.100.148 at Sun May 13 07:08:59 2007
>>>>>>>>>>
>>>>>>>>>> Notice: 5: Trying to use delegated user proxy
>>>>>>>>>> Notice: 5: Authenticated globus user: /C=PL/O=GRID/O=PSNC/CN=Rafal
>>>>>>>>>> Lichwala - OPS Notice: 0: GRID_SECURITY_HTTP_BODY_FD=9
>>>>>>>>>> Notice: 0: JOB_REPOSITORY_ID
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 (unique id used for
>>>>>>>>>> Job Repository) Notice: 0: FORMAT:
>>>>>>>>>> YYYY-MM-DD.hh:mm:ss.micros.pid.connection Notice: 0: (Format:
>>>>>>>>>> <date>.<time (with
>>>>>>>>>> microsecs)>.<pid>.<connection counter>)
>>>>>>>>>> Notice: 0: temporarily ALLOW empty credentials
>>>>>>>>>> Notice: 0: Using dlopen version of LCAS
>>>>>>>>>> Notice: 0: lcasmod_name = /opt/glite/lib/lcas.mod
>>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> LCAS 7: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> Initialization LCAS version 1.3.1 LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas.mod-lcas_init(): Reading LCAS database /opt/glite/etc/lcas/lcas.db
>>>>>>>>>> LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> LCAS 5: 2007-05-13.07:09:00.123457.0000000507.0000004146 : LCAS
>>>>>>>>>> authorization request LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas.mod-lcas_run_va(): user is /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala -
>>>>>>>>>> OPS LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas_userban.mod-plugin_confirm_authorization(): checking banned users
>>>>>>>>>> in /opt/glite/etc/lcas/ban_users.db LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas.mod-lcas_run_va(): authorization granted by plugin
>>>>>>>>>> /opt/glite/lib/modules/lcas_userban.mod LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Generic
>>>>>>>>>> verification error for VOMS (failure)! LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms plugin
>>>>>>>>>> failed LCAS 0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas.mod-lcas_run_va(): authorization failed for plugin
>>>>>>>>>> /opt/glite/lib/modules/lcas_voms.mod LCAS 0:
>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>> lcas.mod-lcas_run_va(): failed Failure: LCAS failed authorization.
>>>>>>>>>> Failure: LCAS failed authorization.
>>>>>>>>>> -----------------------------------------------------
>>>>>>>>>>
>>>>>>>>>> AFAIK /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS is the dn used to
>>>>>>>>>> submit tests from SAM Admin Portal. The connection is coming from
>>>>>>>>>> glite-rb-01.cnaf.infn.it WMS.
>>>>>>>>>> Any ideas why it tries exactly every 5::30 minutes? Does the WMS try to
>>>>>>>>>> monitor some previously sent jobs or what?
>>>>>>>>>>
>>>>>>>>>> What is more interesting is that then i try to submit jobs from SAM
>>>>>>>>>> Admin Portal
>>>>>>>>>> to production gliteCE the Job gets Abroted due to:
>>>>>>>>>> Job got an error while in the CondorG queue.
>>>>>>>>>> hit job shallow retry count (0)
>>>>>>>>>> In the job logging info i see tha the job is submited by
>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS
>>>>>>>>>> But nothing is logged at /var/log/glite/gatekeeper.log &
>>>>>>>>>> /var/log/messages regarding lcas & lcamaps authentication.
>>>>>>>>>> Also there is nothing in /var/log/gridftp-lcas_lcmaps.log for the user.
>>>>>>>>>> But the there is a mapping under /etc/grid-security/gridmapdir for the
>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS dn to ops003
>>>>>>>>>>
>>>>>>>>>> But what is even more strange is then i submit from SAM Admin Portal
>>>>>>>>>> to pps gliteCE, the job is sucessfully submited and executed by pbs and
>>>>>>>>>> blah record is insteted to /var/log/glite/accounting/blahp.log-200705 ,
>>>>>>>>>> but again nothing is logged both at /var/log/glite/gatekeeper.log &
>>>>>>>>>> /var/log/messages Howether the authentication is logged at
>>>>>>>>>> /var/log/gridftp-lcas_lcmaps.log
>>>>>>>>>>
>>>>>>>>>> How this can be? I've both at pps & production authentication working
>>>>>>>>>> ok for all other users with lcas & lcamaps messages logged as usual at
>>>>>>>>>> /var/log/glite/gatekeeper.log & /var/log/messages/
>>>>>>>>>> Any why the submition work for pps site only?
>>>>>>>>>>
>>>>>>>>>> Has anyone seen similar behaviour?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Alex
>>>>>>>> ------- End of Original Message -------
>>>>>>>>
>>>>> ------- End of Original Message -------
>>> ------- End of Original Message -------
> ------- End of Original Message -------
|