The other thing you can try is to check if the crl for the CA which
issued the user certificate is updated in all WNs. As an example, my
certificate is issued by LIPCA.
[root@wn001 ~]# rpm -qa | grep LIP
ca_LIPCA-1.31-1
[root@wn001 ~]# rpm -ql ca_LIPCA-1.31-1
/etc/grid-security/certificates
/etc/grid-security/certificates/11b4a5a2.0
/etc/grid-security/certificates/11b4a5a2.crl_url
/etc/grid-security/certificates/11b4a5a2.info
/etc/grid-security/certificates/11b4a5a2.namespaces
/etc/grid-security/certificates/11b4a5a2.signing_policy
The LIPCA crl should be /etc/grid-security/certificates/11b4a5a2.r0
[root@wn001 ~]# ll /etc/grid-security/certificates/11b4a5a2.r0
-rw-r--r-- 1 root root 3469 Sep 18 09:53
/etc/grid-security/certificates/11b4a5a2.r0
You can check if it is updated using openssl:
[root@wn001 ~]# openssl crl -text -noout -in
/etc/grid-security/certificates/11b4a5a2.r0 | grep -A 2 Issuer
Issuer: /C=PT/O=LIPCA/CN=LIP Certification Authority
Last Update: Sep 4 14:15:56 2009 GMT
Next Update: Oct 4 14:15:56 2009 GMT
Hope it helps
Cheers
Goncalo
On 09/18/2009 02:00 PM, Arnau Bria wrote:
> On Fri, 18 Sep 2009 13:53:31 +0100
> Gonçalo Borges wrote:
>
>
>> Hi Arnau...
>>
> Hi Gonçalo!
>
>
>
>> I think what I'm going to say is stated in some wiki, and probably
>> you already try it. Nevertheless here it goes...
>>
>> One source for that error is that globus-url-copy is not working
>> between your WN and WMS or CE. If you know in which WN the job run
>> (which you could indirectly infer if you know the user/time of the
>> job, and searching through PBS accounting file...), you could try a
>> globus-url-copy as:
>>
> Our torque logs show no errors about cmprd003 user. All their jobs
> finished well from torque's point of view.
> i don't know the time when the error happens, I only know how many jobs
> fail in one day...
>
> [...]
>
> I've already tried it with my dteam proxy. Worked fine in ALL nodes.
>
> thank you anyway.
>
>
>
>> As an example, some time ago, I had the same error message in some
>> WNs installed with gLite 3.2. Because these WNs also had infiniband
>> software installed, that gave origin to some conflicts during gLite
>> installation, and some VDT packages were not installed, among them
>> the one delivering globus-url-copy command.
>>
> We do use puppet for configuring nods, and I've checked that all nodes
> have same number/version of packages. So the are no diff between nodes.
> I've updateded yaim conf files and ran yaim in all nodes... so all
> nodes share same conf. I'm really lost on this random error.
>
> From Yesterday other VOs will start running jobs in those nodes, so
> maybe atlas olhcb logs will point out some relevant info...
>
> I'll update this issue then...
>
> thanks for your reply!
>
>
>> Cheers
>> Goncalo
>>
> Cheers,
> Arnau
>
>
|