JISCMail - LCG-ROLLOUT Archives

Hi Antun,

> 
> Seems that your proposed solution works, since after the changes you
> suggested, I successfully executed one SAM job on our gCE using the SAM
> Admin's page, and this was not possible for several days now. Thanks!
> 
> I would personally prefer more elegant solution, i.e. your option 2.

One of my colleagues is testing what is the best solution as I 
mentioned. Currently he prefers point 2 as well.

> 
> Note that this should be widely publicized, since many gCEs are failing SAM
> tests due to this problem, and this badly affects their availabilities.

Sure, we will update the tricky somewhere for example, goc wiki page in 
Taiwan, after the test.

Cheers,

Di
> 
> Thanks again!
> 
> Best regards, Antun
> 
> 
> -----
> Antun Balaz
> Research Assistant
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
> 
> Phone: +381 11 3713152
> Fax: +381 11 3162190
> 
> Scientific Computing Laboratory
> Institute of Physics, Belgrade, Serbia
> -----
> 
> ---------- Original Message -----------
> From: Di Qing <[log in to unmask]>
> To: [log in to unmask]
> Sent: Tue, 15 May 2007 11:59:46 +0200
> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
> 
>> Antun Balaz wrote:
>>> Hi Di,
>>>
>>> This is the situation:
>>>
>>> 1) /opt/globus/tmp/gram_job_state was not empty, but contained some files
>>> created until February. I removed them, and created a soft link:
>>>
>>> ln -s /var/glite/gram_job_state /opt/globus/tmp/gram_job_state
>>>
>>> Btw. /var/glite/gram_job_state contains a lot of gram_job_state files. Cleanup
>>> cron job could do some good to that directory.
>> If it works properly, there should be not so many files left.
>>
>>> 2) Both /opt/glite/etc/grid-services/jobmanager-fork and
>>> /opt/globus/etc/grid-services/jobmanager exist.
>>> /opt/globus/etc/grid-services/jobmanager is a link to
>>> /opt/globus/etc/grid-services/jobmanager-fork. This file uses
>>> /opt/globus/etc/globus-job-manager.conf file, while
>>> /opt/glite/etc/grid-services/jobmanager-fork uses
>>> /opt/glite/etc/globus-job-manager.conf. At the end of the story, these two
>>> conf files differ just in the directory for gram_job_state, which is fixed in
>>> point 1), so no need to change anything here. Correct?
>>>
>> Yes, if you already tried to fix it according to point 1), you don't 
>> need to do this. They are two different solutions. We are still 
>> testing which one is the best.
>>
>>> It is enough to restart gLite on this gCE to see if situation improved?
>> I don't think you need to restart gLite since now gridmonitor can 
>> find the state files in the symbol linked directory.
>>
>> Di
>>
>>> Thanks, Antun
>>>
>>> -----
>>> Antun Balaz
>>> Research Assistant
>>> E-mail: [log in to unmask]
>>> Web: http://scl.phy.bg.ac.yu/
>>>
>>> Phone: +381 11 3713152
>>> Fax: +381 11 3162190
>>>
>>> Scientific Computing Laboratory
>>> Institute of Physics, Belgrade, Serbia
>>> -----
>>>
>>> ---------- Original Message -----------
>>> From: Di Qing <[log in to unmask]>
>>> To: [log in to unmask]
>>> Sent: Tue, 15 May 2007 11:17:47 +0200
>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>
>>>> Hi Antun,
>>>>
>>>> There are two possible solutions on CE: 1)  create link 
>>>> /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state or 
>>>> /opt/globus/tmp/gram_job_state to /var/glite/gram_job_state, 2)  
>>>> create   /opt/glite/etc/grid-services/jobmanager-fork to 
>>>> /opt/globus/etc/grid-services/jobmanager . Very recently Francesco 
>>>> Prelz found there was an inconsistence between WMS and glite CE, 
>>>> e.g., GRAM is saving its state files under 
>>>> /var/glite/gram_job_state/ specified by 
>>>> /opt/glite/etc/gatekeeper.conf, however, for job status checking,
>>>>  the gridmonitor is looking for the state file directory starting 
>>>> from  $GLOBUS_LOCATION/etc/grid-services, which leads to the globus default
>>>>   location /opt/globus/tmp/gram_job_state/ which is empty, so the 
>>>> state of  fork jobs is never correctly updated.
>>>>
>>>> We are testing if it is true and which one is the best solution for 
>>>> this. But you can try it as well.
>>>>
>>>> Cheers,
>>>>
>>>> Di
>>>>
>>>> Antun Balaz wrote:
>>>>> Hi Di,
>>>>>
>>>>> It is OK if you are admin of the WMS, but here we are talking about WMS
>>>>> servers used by SAM - if they are malfunctioning due to this bug, then a
> whole
>>>>> lot of sites is affected!
>>>>>
>>>>> Something similar probably happened to rb108.cern.ch which started to give
>>>>> PeriodicHold errors almost each time SAM tests was sent through it. It was
>>>>> recently replaced by rb118.cern.ch, but this started to happen from time to
>>>>> time again.
>>>>>
>>>>> I was thinking more about a permanent fix...
>>>>>
>>>>> Thanks, Antun
>>>>>
>>>>> -----
>>>>> Antun Balaz
>>>>> Research Assistant
>>>>> E-mail: [log in to unmask]
>>>>> Web: http://scl.phy.bg.ac.yu/
>>>>>
>>>>> Phone: +381 11 3713152
>>>>> Fax: +381 11 3162190
>>>>>
>>>>> Scientific Computing Laboratory
>>>>> Institute of Physics, Belgrade, Serbia
>>>>> -----
>>>>>
>>>>> ---------- Original Message -----------
>>>>> From: Di Qing <[log in to unmask]>
>>>>> To: [log in to unmask]
>>>>> Sent: Tue, 15 May 2007 10:49:55 +0200
>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>>>
>>>>>> Hi Antun,
>>>>>>
>>>>>> Usually we try to find out that launch job in the condor queue on 
>>>>>> WMS, its name is like condorc-launcher-s, then remove it by condor_rm.
>>>>>>
>>>>>> Di
>>>>>>
>>>>>> Antun Balaz wrote:
>>>>>>> Hi Di,
>>>>>>>
>>>>>>> And how to solve this problem?
>>>>>>>
>>>>>>> Thanks, Antun
>>>>>>>
>>>>>>> -----
>>>>>>> Antun Balaz
>>>>>>> Research Assistant
>>>>>>> E-mail: [log in to unmask]
>>>>>>> Web: http://scl.phy.bg.ac.yu/
>>>>>>>
>>>>>>> Phone: +381 11 3713152
>>>>>>> Fax: +381 11 3162190
>>>>>>>
>>>>>>> Scientific Computing Laboratory
>>>>>>> Institute of Physics, Belgrade, Serbia
>>>>>>> -----
>>>>>>>
>>>>>>> ---------- Original Message -----------
>>>>>>> From: Di Qing <[log in to unmask]>
>>>>>>> To: [log in to unmask]
>>>>>>> Sent: Tue, 15 May 2007 10:41:56 +0200
>>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>>>>>
>>>>>>>> If the condor instances for the jobs submitted by SAM portal are 
>>>>>>>> running on glite CE, when new jobs coming, WMS will bypass 
>>>>>>>> gatekeeper and directly submit jobs to the condor instance. For the 
>>>>>>>> periodical log message in gatekeeper log or /var/log/message, I 
>>>>>>>> think it is that WMS tried to launch the condor instance, but failed,
>>>>>>>>  then it retried again and again.
>>>>>>>>
>>>>>>>> Di
>>>>>>>>
>>>>>>>> Alexander Piavka wrote:
>>>>>>>>>  Hi Antun,
>>>>>>>>>
>>>>>>>>> What is more disturbing me is that on PPS site the SAM portal jobs
>>>>>>>>> are successfully executed but the only
>>>>>>>>> trace of lcas is in /var/log/gridftp-lcas_lcmaps.log
>>>>>>>>> There are no traces at /var/log/glite/gatekeeper.log & /var/log/messages
>>>>>>>>> So it looks like a security problem, but i can't undertand how this be
>>>>>>>>> happening only for jobs submited from SAM poprtal and not for all jobs,
>>>>>>>>> since it's a gatekeeper authentication which is always running and it is
>>>>>>>>> not related to https://gus.fzk.de/pages/ticket_details.php?ticket=20625
>>>>>>>>>
>>>>>>>>>  Thanks
>>>>>>>>>  Alex
>>>>>>>>>
>>>>>>>>> On Tue, 15 May 2007, Antun Balaz wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We see this almost all the time, and it is a long standing problem.
>>>>> Since it
>>>>>>>>>> appears from time to time (it is not always there), without any changes
>>>>> from
>>>>>>>>>> our side, we think that it is related to some WMS problem, and not
> to gCE
>>>>>>>>>> problems.
>>>>>>>>>>
>>>>>>>>>> Somewhat related is the following ticket (although no mapping problems
>>>>>>> there):
>>>>>>>>>> https://gus.fzk.de/pages/ticket_details.php?ticket=20625
>>>>>>>>>>
>>>>>>>>>> However, I don't know what is the status of improvements mentioned
>>> there...
>>>>>>>>>> Regards, Antun
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> Antun Balaz
>>>>>>>>>> Research Assistant
>>>>>>>>>> E-mail: [log in to unmask]
>>>>>>>>>> Web: http://scl.phy.bg.ac.yu/
>>>>>>>>>>
>>>>>>>>>> Phone: +381 11 3713152
>>>>>>>>>> Fax: +381 11 3162190
>>>>>>>>>>
>>>>>>>>>> Scientific Computing Laboratory
>>>>>>>>>> Institute of Physics, Belgrade, Serbia
>>>>>>>>>> -----
>>>>>>>>>>
>>>>>>>>>> ---------- Original Message -----------
>>>>>>>>>> From: Esteban Freire Garcia <[log in to unmask]>
>>>>>>>>>> To: [log in to unmask]
>>>>>>>>>> Sent: Mon, 14 May 2007 22:50:47 +0200
>>>>>>>>>> Subject: Re: [LCG-ROLLOUT] LCAS/LCMAPS strange behaviour
>>>>>>>>>>
>>>>>>>>>>> Hi Alex,
>>>>>>>>>>>
>>>>>>>>>>>     From the upgrade 29 we have a very similar incidence on PPS,
> similar
>>>>>>>>>>> logs..although I am not sure that the problem happen since the
>>>>>>>>>>> upgrade, in principle I didn't observe anything strange after to
>>>>>>>>>>> upgrade. What is curious, is that from the page of monitoring, the
>>>>>>>>>>> tests that are made automatically every hour has a status of Ok on
>>>>>>>>>>> PPS, however if I try to send a test from the Sam Admin�s page, this
>>>>>>>>>>> job is aborted with the following error :(reason  =   Got a job held
>>>>>>>>>>> event, reason: "The job attribute PeriodicHold expression 'Matched
>>>>>>>>>>> =!= TRUE && CurrentTime > QDate + 900' evaluated to TRUE" )    After
>>>>>>>>>>> reviewing all the services running, I do not observe anything
>>>>>>>>>>> strange, and I think that it is an authentication problem, although
>>>>>>>>>>> I do not observe anything stranger in this sense.    So, I from here
>>>>>>>>>>> send the same question that you, Has anyone seen similar behaviour?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Esteban
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> Both on my production & pps sites on gliteCEs i've got the following
>>>>>>>>>>>> logged exactly every 5 minutes and 30 seconds:
>>>>>>>>>>>> -----------------------------------------------------
>>>>>>>>>>>> Notice: 6: Got connection 131.154.100.148 at Sun May 13 07:08:59 2007
>>>>>>>>>>>>
>>>>>>>>>>>> Notice: 5: Trying to use delegated user proxy
>>>>>>>>>>>> Notice: 5: Authenticated globus user: /C=PL/O=GRID/O=PSNC/CN=Rafal
>>>>>>>>>>>> Lichwala - OPS Notice: 0: GRID_SECURITY_HTTP_BODY_FD=9
>>>>>>>>>>>> Notice: 0: JOB_REPOSITORY_ID
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 (unique id used for
>>>>>>>>>>>> Job Repository) Notice: 0: FORMAT:
>>>>>>>>>>>> YYYY-MM-DD.hh:mm:ss.micros.pid.connection Notice: 0: (Format:
>>>>>>>>>>>> <date>.<time (with
>>>>>>>>>>>> microsecs)>.<pid>.<connection counter>)
>>>>>>>>>>>> Notice: 0: temporarily ALLOW empty credentials
>>>>>>>>>>>> Notice: 0: Using dlopen version of LCAS
>>>>>>>>>>>> Notice: 0: lcasmod_name = /opt/glite/lib/lcas.mod
>>>>>>>>>>>> LCAS   0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> LCAS   7: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> Initialization LCAS version 1.3.1 LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas.mod-lcas_init(): Reading LCAS database
> /opt/glite/etc/lcas/lcas.db
>>>>>>>>>>>> LCAS   0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> LCAS   5: 2007-05-13.07:09:00.123457.0000000507.0000004146 : LCAS
>>>>>>>>>>>> authorization request LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas.mod-lcas_run_va(): user is /C=PL/O=GRID/O=PSNC/CN=Rafal
> Lichwala -
>>>>>>>>>>>> OPS LCAS   0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas_userban.mod-plugin_confirm_authorization(): checking banned
> users
>>>>>>>>>>>> in /opt/glite/etc/lcas/ban_users.db LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas.mod-lcas_run_va(): authorization granted by plugin
>>>>>>>>>>>> /opt/glite/lib/modules/lcas_userban.mod LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Generic
>>>>>>>>>>>> verification error for VOMS (failure)! LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas_plugin_voms-plugin_confirm_authorization_from_x509(): voms
> plugin
>>>>>>>>>>>> failed LCAS   0: 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas.mod-lcas_run_va(): authorization failed for plugin
>>>>>>>>>>>> /opt/glite/lib/modules/lcas_voms.mod LCAS   0:
>>>>>>>>>>>> 2007-05-13.07:09:00.123457.0000000507.0000004146 :
>>>>>>>>>>>> lcas.mod-lcas_run_va(): failed Failure: LCAS failed authorization.
>>>>>>>>>>>> Failure: LCAS failed authorization.
>>>>>>>>>>>> -----------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> AFAIK /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS is the dn used to
>>>>>>>>>>>> submit tests from SAM Admin Portal. The connection  is coming from
>>>>>>>>>>>> glite-rb-01.cnaf.infn.it WMS.
>>>>>>>>>>>> Any ideas why it tries exactly every 5::30 minutes? Does the WMS
> try to
>>>>>>>>>>>> monitor some previously sent jobs or what?
>>>>>>>>>>>>
>>>>>>>>>>>> What is more interesting is that then i try to submit jobs from SAM
>>>>>>>>>>>> Admin Portal
>>>>>>>>>>>> to production gliteCE the Job gets Abroted due to:
>>>>>>>>>>>> Job got an error while in the CondorG queue.
>>>>>>>>>>>> hit job shallow retry count (0)
>>>>>>>>>>>> In the job logging info i see tha the job is submited by
>>>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS
>>>>>>>>>>>> But nothing is logged at /var/log/glite/gatekeeper.log &
>>>>>>>>>>>> /var/log/messages regarding lcas & lcamaps authentication.
>>>>>>>>>>>> Also there is nothing in /var/log/gridftp-lcas_lcmaps.log for the
> user.
>>>>>>>>>>>> But the there is a mapping under /etc/grid-security/gridmapdir
> for the
>>>>>>>>>>>> /C=PL/O=GRID/O=PSNC/CN=Rafal Lichwala - OPS dn to ops003
>>>>>>>>>>>>
>>>>>>>>>>>> But what is even more strange is then i submit from  SAM Admin Portal
>>>>>>>>>>>> to pps gliteCE, the job is sucessfully submited and executed by
> pbs and
>>>>>>>>>>>> blah record is insteted to
> /var/log/glite/accounting/blahp.log-200705 ,
>>>>>>>>>>>> but again nothing is logged both at /var/log/glite/gatekeeper.log &
>>>>>>>>>>>> /var/log/messages Howether the authentication is logged at
>>>>>>>>>>>> /var/log/gridftp-lcas_lcmaps.log
>>>>>>>>>>>>
>>>>>>>>>>>> How this can be? I've both at pps & production authentication working
>>>>>>>>>>>> ok for all other users with lcas & lcamaps messages logged as
> usual at
>>>>>>>>>>>> /var/log/glite/gatekeeper.log & /var/log/messages/
>>>>>>>>>>>> Any why the submition work for pps site only?
>>>>>>>>>>>>
>>>>>>>>>>>> Has anyone seen similar behaviour?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>> Alex
>>>>>>>>>> ------- End of Original Message -------
>>>>>>>>>>
>>>>>>> ------- End of Original Message -------
>>>>> ------- End of Original Message -------
>>> ------- End of Original Message -------
> ------- End of Original Message -------