David,
This is an old and long story, but, to make it short:
- I wrote a script which splits automatically the wrapper script
into 3 parts
- This splitting script runs within the "globus-script-lsf-submit"
part of the job-manager process, when preparing the stuff to be submitted
to the Batch Scheduler (LSF in this case)
- The 3 parts:
. Part 1 is running within the job-manager process (on the CE), before
submitting the LSF request, and it fetches all the required input files from the
RB into a special request-specific dir into the .gass_cache of the user
. Part 2 is simply the remaining part of the original wrapper script
which runs the actual request, and it is running on the WN, as you could guess,
leaving the output files into the special request-specific dir
. Part 3 is running on the CE after the batch request was discovered to be done
(which is rather NOT easy to guess)
on the WN, and it uploads all these output files from the CE to the RB
- This is rather simple to explain, but error prone to implement, and it
overloads the CE itself, of course. But it solves temporarily the question
of the Worker Nodes being hidden behind the firewall.
The splitting script is currently an awful bunch of a hack (I'm not proud of
it at all), and was already requiring to be rewritten in December, before
edg-1.3.4. Since then, some developpers introduced more stuff into the wrapper
script which made my script tatolly obsolete, albeit running. If this is really
useful, I can try to rewrite it, in a general way, in Perl (sorry),
taking care of this new stuff which was introduced. This could be a very
great opportunity to really check the return status of the "globus-url-copy"
which I found here. This to check that there are no spurious messages
which could make you guess that the latter command was not run appropriately,
even though its return status was 0 ;;-<, which I spotted in a high number of
occurences.
But to plug it into the "globus-script-XXX-submit" is probably more up
to the relevant customers.
HTH, Cheers, Gilbert.
On Tue, 4 Mar 2003, Dr D J Colling wrote:
> Hi Gilbert,
>
> I would be very interested to know how you hacked the wrapper script... as
> you know I am looking at the possibility of providing a general solution
> for this (at least for the WP1 stuff).
>
> All the best,
> david
>
> On Tue, 4 Mar 2003, Gilbert Grosdidier wrote:
>
> > Hi David,
> >
> > You gave me a very strong hint below !
> >
> > On Tue, 4 Mar 2003, Dr D J Colling wrote:
> >
> > > Ok people,
> > >
> > > I see two different problems here, one more serious than the other. So we
> > > shall deal with the serious first. There is an authentication problem for
> > > Liverpool, Oxford, Cambridge and SLAC. All that I see in the log files are
> > > really useless messages such as " Reason: authentication with the remote
> > > server failed" and "Globus Failure".
> > >
> > > Gilbert tells me that the SLAC CRLs are up to date.... I have no reason to
> > > doubt him, besides I would normally get a more informative message in the
> > > logs if that were the case. The problem appears to happen at somepoint
> > > where the some files are geing transferred (maybe by the wrapper script).
> > > Have there been firewall changes at any of these sites? I shall have a
> > > deeper look at this and see if I can see something, but any ideas would be
> > > welcome.
> > > [...]
> >
> > Yes, this rings a bell for me: yes, I did have to hack the wrapper script,
> > since the worker nodes are located behind a firewall at SLAC, and I can't
> > make them access anything in the outside world. But this was running rather
> > smoothly until February 17th at least.
> >
> > But I got no trace of the request being even forwarded to the SLAC CE, since
> > this is the only message showing in the gatekeeper log file:
> >
> > ------------------
> > GRAM contact:
> > bbr-gate01.slac.stanford.edu:2119:/O=doesciencegrid.org/OU=Services/CN=bbr-gate01.slac.stanford.edu
> > Notice: 6: Got connection 155.198.216.19 at Mon Mar 3 05:31:46 2003
> >
> > Failed reading length 0
> > GSS authentication failure
> > Communications Error
> > globus_gss_assist token :3: failure: Connection closed
> > GSS status: major:00090000 minor: 00008000 token: 00000003
> > Failure: GSS failed Major:00090000 Minor:00008000 Token:00000003
> >
> > Failure: GSS failed Major:00090000 Minor:00008000 Token:00000003
> > ------------------
> >
> > which IMHO shows that the authentication step was not passed.
> >
> > Beyond this, here is the date stamp of all the CRL files that I know of:
> > [bbr-gate01] ~ > ll /etc/grid-security/certificates/*.r0
> > -rw-r--r-- 1 root root 30721 Mar 3 09:25
> > /etc/grid-security/certificates/0ed6468a.r0
> > -rw-r--r-- 1 root root 2354 Mar 3 09:25
> > /etc/grid-security/certificates/16da7552.r0
> > -rw-r--r-- 1 root root 2158 Mar 3 09:25
> > /etc/grid-security/certificates/1e43b9cc.r0
> > -rw-r--r-- 1 root root 3417 Mar 3 09:25
> > /etc/grid-security/certificates/1f0e8352.r0
> > -rw-r--r-- 1 root root 1783 Mar 3 09:25
> > /etc/grid-security/certificates/34a509c3.r0
> > -rw-r--r-- 1 root root 1847 Mar 3 09:25
> > /etc/grid-security/certificates/41380387.r0
> > -rw-r--r-- 1 root root 7614 Mar 3 09:25
> > /etc/grid-security/certificates/49f18420.r0
> > -rw-r--r-- 1 root root 1459 Mar 3 09:25
> > /etc/grid-security/certificates/6349a761.r0
> > -rw-r--r-- 1 root root 5805 Mar 3 09:25
> > /etc/grid-security/certificates/6b4ddd18.r0
> > -rw-r--r-- 1 root root 5065 Mar 3 09:25
> > /etc/grid-security/certificates/6df70cb1.r0
> > -rw-r--r-- 1 root root 4840 Mar 3 09:25
> > /etc/grid-security/certificates/90e2484f.r0
> > -rw-r--r-- 1 root root 11017 Mar 3 09:25
> > /etc/grid-security/certificates/9d8753eb.r0
> > -rw-r--r-- 1 root root 1888 Mar 3 09:25
> > /etc/grid-security/certificates/bc870044.r0
> > -rw-r--r-- 1 root root 1763 Mar 3 09:25
> > /etc/grid-security/certificates/cf4ba8c8.r0
> > -rw-r--r-- 1 root root 2182 Mar 3 09:25
> > /etc/grid-security/certificates/d64ccb53.r0
> > -rw-r--r-- 1 root root 5660 Mar 3 09:25
> > /etc/grid-security/certificates/df312a4e.r0
> > -rw-r--r-- 1 root root 3667 Mar 3 09:25
> > /etc/grid-security/certificates/ed99a497.r0
> >
> > Did I miss something ?
> >
> > Thanks, Cheers, Gilbert.
> >
> > --
> > *---------------------------------------------------------------------*
> > Gilbert Grosdidier mailto:[log in to unmask]
> > Ext 74462 when at CERN
> > LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
> > Faculte des Sciences, Bat. 200 Fax : +33 1 6446 8546
> > B.P. 34, F-91898 Orsay Cedex (FRANCE)
> >
> >
> >
> >
>
>
--
*---------------------------------------------------------------------*
Gilbert Grosdidier mailto:[log in to unmask]
Ext 74462 when at CERN
LAL / IN2P3 / CNRS Phone : +33 1 6446 8909
Faculte des Sciences, Bat. 200 Fax : +33 1 6446 8546
B.P. 34, F-91898 Orsay Cedex (FRANCE)
|