For what it's worth, no 'segfault' seen in 2008 logs but we're running
glite 3.0 on our CE :-( or maybe that should be :-)
Graeme Stewart ([log in to unmask]) wrote:
> This has also been cropping up at Glasgow. In particular we can
> identify that the gatekeeper segfaults, e.g., an excerpt from /var/log/
> messages looks like:
>
> Apr 15 16:23:08 svr021 GRAM gatekeeper[31646]: Authenticated globus
> user: /O=GRID-FR/C=FR/O=INSERM/OU=GEMPC/CN=Alexandru Munteanu
> Apr 15 16:23:08 svr021 kernel: globus-gatekeep[31685]: segfault at
> 0000000000000046 rip 0000000000b86259 rsp 00000000ffff9d68 error 4
>
> strace doesn't seem to add much to the equation.
>
> It doesn't segfault reliably, but it only does it when people have
> VOMS credentials. It's not one VO, but it's reasonably consistent
> about who it will segfault on at any particular time.
>
> Restarts or reconfiguration seems to provoke it to segfault for
> different users (we iterate restarts until it doesn't fail SAM tests
> or any of our heavy users - then we tend to leave it be).
>
> If the segfault happens when the job is being submitted it provokes a
> "10 data transfer to the server failed" when gridftp fails. If it
> happens during job polling it seems to be fairly harmless.
>
> I have tried commenting out the lcas_voms.mod in lcas.db, but this
> doesn't seem to help. It seems that the segfault happens a while after
> this.
>
> Does any one else see this? ("grep segfault /var/log/messages" is a
> quick way to look.)
>
> Cheers
>
> Graeme
>
> On 15 Apr 2008, at 11:15, Phil Roffe wrote:
> >I deleted the file /etc/grid-security/vomsdir/atlas/lcg-
> >voms.cern.ch.lsc but no success...
> >
> >TIME: Tue Apr 15 11:05:47 2008
> >PID: 8611 -- Notice: 5: Authenticated globus user: /C=UK/O=eScience/
> >OU=QueenMaryLondon/L=Physics/CN=steve lloyd
> >lcas client name: /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/
> >CN=steve lloyd
> >LCAS 0:
> >LCAS 1: Initialization LCAS version 1.3.7.0
> >allowing empty credentials
> >LCAS 2: LCAS authorization request
> >LCAS 0: lcas_userban.mod-plugin_confirm_authorization():
> >checking banned users in /opt/glite/etc/lcas/ban_users.db
> >LCAS 0: lcas_plugin_voms-
> >plugin_confirm_authorization_from_x509(): Did not find a matching VO
> >entry in the authorization file
> >LCAS 0: 2008-04-15.11:05:47 : lcas_plugin_voms-
> >plugin_confirm_authorization_from_x509(): voms plugin failed
> >LCAS 0: lcas.mod-lcas_run_va(): authorization failed for plugin /
> >opt/glite/lib/modules/lcas_voms.mod
> >LCAS 0: lcas.mod-lcas_run_va(): failed
> >
> >I'm going to spend a bit of time this afternoon looking at it again
> >so I'll keep you informed with my progress. If you have any ideas
> >let me know.
> >
> >Cheers,
> >Phil
> >
> >---
> >Phil Roffe - [log in to unmask]
> >
> >IPPP, Department of Physics, Durham University,
> >Science Laboratories, South Road, Durham, DH1 3LE
> >Direct Dial: +44 (0)191 3343704
> >Office: +44 (0)191 334 3811
> >
> >
> >
> >Alessandra Forti wrote:
> >>Hi Phil,
> >>
> >>it shouldn't make any difference but lcg-voms.cern.ch as it is
> >>obsolete.
> >>Could you try?
> >>
> >>cheers
> >>alessandra
> >>
> >>
> >>
> >>Phil Roffe wrote:
> >>>Hi Alessandra,
> >>>
> >>>[root@ce01 ~]# ls -la /etc/grid-security/vomsdir/
> >>>total 144
> >>>drwxr-xr-x 20 root root 4096 Apr 10 17:57 .
> >>>drwxr-xr-x 6 root root 4096 Apr 14 17:16 ..
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 atlas
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 biomed
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 camont
> >>>-rw-r--r-- 1 root root 4627 Feb 21 03:48 cclcgvomsli01.in2p3.fr.
> >>>1881
> >>>-rw-r--r-- 1 root root 4612 Feb 21 03:48 cclcgvomsli01.in2p3.fr.
> >>>3292
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 cdf
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 cms
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 dteam
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 gridpp
> >>>-rw-r--r-- 1 root root 5132 Mar 5 13:32 grid-voms.desy.de
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 ilc
> >>>-rw-r--r-- 1 root root 5932 Feb 21 03:48 lcg-voms.cern.ch.
> >>>2007-05-07
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 lhcb
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 mice
> >>>-rw-r--r-- 1 root root 2138 Apr 10 17:09
> >>>NEW.voms.gridpp.ac.uk.cert.pem
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 ngs.ac.uk
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 ops
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 pheno
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 supernemo.vo.eu-egee.org
> >>>drwxr-xr-x 7 root root 4096 Mar 28 10:09 .svn
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 totalep
> >>>-rw-r--r-- 1 root root 5912 Feb 21 03:48 voms.cern.ch.2007-10-15
> >>>-rw-r--r-- 1 root root 2138 Jan 2 23:08
> >>>voms.gridpp.ac.uk.hostcert.pem
> >>>-rw-r--r-- 1 root root 5938 Feb 21 03:48 voms-test.cern.ch.
> >>>2007-10-15
> >>>-rw-r--r-- 1 root root 4821 Feb 21 03:48 vo.racf.bnl.gov.15998
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 vo.scotgrid.ac.uk
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 zeus
> >>>
> >>>[root@ce01 ~]# ls -la /etc/grid-security/vomsdir/atlas/
> >>>total 16
> >>>drwxr-xr-x 2 root root 4096 Mar 19 20:10 .
> >>>drwxr-xr-x 20 root root 4096 Apr 10 17:57 ..
> >>>-rw-r--r-- 1 root root 103 Apr 11 10:25 lcg-voms.cern.ch.lsc
> >>>-rw-r--r-- 1 root root 99 Apr 11 10:25 voms.cern.ch.lsc
> >>>
> >>>I investigated the certificates and even removed them all and
> >>>replaced them with the working ones from the backup before the
> >>>upgrade... but no success. I even copied the certs from Glasgows
> >>>working CE but again to no avail. From what I can tell the
> >>>vomsdir is correct... unless you can see something?
> >>>
> >>>Also, the atlas VO is configured correctly...
> >>>[root@ce01 ~]# cat /opt/glite/yaim/etc/vo.d/atlas
> >>>SW_DIR=$VO_SW_DIR/atlas
> >>>DEFAULT_SE=$DPM_HOST
> >>>STORAGE_DIR=$DPM_BASE_PATH/atlas
> >>>VOMS_SERVERS="'vomss://voms.cern.ch:8443/voms/atlas?/atlas/'"
> >>>VOMSES="'atlas lcg-voms.cern.ch 15001 /DC=ch/DC=cern/OU=computers/
> >>>CN=lcg-voms.cern.ch atlas' 'atlas voms.cern.ch 15001 /DC=ch/
> >>>DC=cern/OU=computers/CN=voms.cern.ch atlas'"
> >>>VOMS_CA_DN="'/DC=ch/DC=cern/CN=CERN Trusted Certification
> >>>Authority' '/DC=ch/DC=cern/CN=CERN Trusted Certification Authority'"
> >>>
> >>>Cheers,
> >>>Phil
> >>>
> >>>Alessandra Forti wrote:
> >>>>Hi Phil,
> >>>>
> >>>>what is the content of your /etc/grid-security/vomsdir ?
> >>>>
> >>>>Do you have the correct configuration for the CERN VOMS DN and
> >>>>VOMS CA DNs?
> >>>>
> >>>>cheers
> >>>>alessandra
> >>>>
> >>>>
> >>>>Phil Roffe wrote:
> >>>>>Morning all,
> >>>>>
> >>>>>Durham are having a problem passing Steve Lloyd's tests due to
> >>>>>LCAS authentication. Last month performed a clean reinstall to
> >>>>>SL4 CE and WNs and the problem has occurred since.
> >>>>>Interestingly some users are authenticated fine, but others are
> >>>>>not (e.g. Steve Lloyd). The error message is...
> >>>>>
> >>>>>TIME: Mon Apr 14 09:43:05 2008
> >>>>>PID: 6131 -- Notice: 5: Authenticated globus user: /C=UK/
> >>>>>O=eScience/OU=QueenMaryLondon/L=Physics/CN=steve lloyd
> >>>>>lcas client name: /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/
> >>>>>CN=steve lloyd
> >>>>>LCAS 0:
> >>>>>LCAS 1: Initialization LCAS version 1.3.7.0
> >>>>>allowing empty credentials
> >>>>>LCAS 2: LCAS authorization request
> >>>>>LCAS 0: lcas_userban.mod-plugin_confirm_authorization():
> >>>>>checking banned users in /opt/glite/etc/lcas/ban_users.db
> >>>>>LCAS 0: lcas_plugin_voms-
> >>>>>plugin_confirm_authorization_from_x509(): Did not find a
> >>>>>matching VO entry in the authorization file
> >>>>>LCAS 0: 2008-04-14.09:43:05 : lcas_plugin_voms-
> >>>>>plugin_confirm_authorization_from_x509(): voms plugin failed
> >>>>>LCAS 0: lcas.mod-lcas_run_va(): authorization failed for
> >>>>>plugin /opt/glite/lib/modules/lcas_voms.mod
> >>>>>LCAS 0: lcas.mod-lcas_run_va(): failed
> >>>>>
> >>>>>This results in the "Job RetryCount (3) hit" and "10 data
> >>>>>transfer to the server failed" error.
> >>>>>http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/atest.html
> >>>>>
> >>>>>I have narrowed the problem down to the LCAS plugin in /opt/
> >>>>>glite/etc/lcas/lcas.db...
> >>>>>#pluginname=lcas_voms.mod,pluginargs="-vomsdir /etc/grid-
> >>>>>security/vomsdir/ -certdir /etc/grid-security/certificates/ -
> >>>>>authfile /etc/grid-security/grid-mapfile -authformat simple -
> >>>>>use_user_dn"
> >>>>>
> >>>>>With the above line commented out (which I tried over the w/e)
> >>>>>then the user is successfully mapped and the job runs
> >>>>>successfully. This does not seem to be the right solution to me
> >>>>>so I put the line back in and we are failing ATLAS tests again.
> >>>>>Glasgow have seen something similar with the CE segfaulting, but
> >>>>>fixed it by reconfiguring and running YAIM... which doesn't seem
> >>>>>to work for me.
> >>>>>
> >>>>>Currently up-to-date with all RPMS, running 32bit SL4.6. I have
> >>>>>tried reinstalling the RPMs and running YAIM again but to no
> >>>>>avail. Certificates seem to be installed correctly and user is
> >>>>>in the the grid-mapfile.
> >>>>># rpm -qa | grep lcas
> >>>>>glite-security-lcas-plugins-check-executable-1.2.1-1.slc4
> >>>>>glite-security-lcas-interface-1.3.6-1.slc4
> >>>>>glite-security-lcas-plugins-voms-1.3.3-1.slc4
> >>>>>glite-security-lcas-1.3.7-0.slc4
> >>>>>glite-security-lcas-plugins-basic-1.3.2-2.slc4
> >>>>>glite-security-lcas-lcmaps-gt4-interface-0.0.13-1.slc4
> >>>>># rpm -qa | grep lcg-CA
> >>>>>lcg-CA-1.20-1
> >>>>>
> >>>>>Has anyone else seen this issue? Any ideas?
> >>>>>
> >>>>>Cheers,
> >>>>>Phil
> >>>>>
> >>>>
> >>>
> >>
|