Jeroen Craens wrote:
> Dear all,
>
> We are currently setting up a testbed grid (still LCG 2.2, we might
> upgrade next month) behind a nat router, consisting out of a ce and some
> wn's (and a lcfg).
> To make sure a rb can transfer the jobs to our ce, we need to forward
> the SITE_GLOBUS_TCP_RANGE which normally is 20000-25000.
> Because the router can't handle forwarding of a range of ports, we are
> wondering if we could change the default range parameter in site-cfg.h
> to a range 20000-20100 without losing functionalities: the nodes of the
> site we will submit our jobs on (to our ce) will have the 20000-25000
> range but our site will then have the 20000-20100 range.
> Has anyone tried this before? Could we change the default value to the
> one proposed without experiencing problems?
You might see a problem occasionally. See below.
> By the way: how does the ce choose to which port the rb can send its
> data: (assuming none of these ports have been taken) randomly, or 20000
> for the first transfer, 20001 for the next one,...?
There seems to be a misconception here. What happens is this:
-----------------------------------------------------------------------------
1. The RB contacts the CE on port 2119 and indicates on which port the RB
should be called back by the globus-job-manager. That port is the first
free port in the port range on the RB. The range usually is 20000-25000,
so the first free port *usually* is 20000 + O(10).
2. The CE calls the RB back on that port.
3. The job wrapper gets submitted to the batch system and globus-job-manager
is told to exit.
4. The job wrapper eventually starts on the WN and copies the input sandbox
from the RB using globus-url-copy. The data port on the RB will again be
in the port range of the RB.
5. The user part of the job runs. It may do a globus-url-copy to/from an SE,
using a data port in the port range of that SE.
6. The job wrapper copies the output sandbox (and the "Maradona" file) back
to the RB and exits.
7. The grid_monitor running on the CE informs the RB that the job has exited.
The RB contacts the CE again on port 2119 to restart globus-job-manager,
which then cleans things up and sends back the stderr and stdout of the
job wrapper (stdout contains the exit status of the user part).
-----------------------------------------------------------------------------
So, your NAT router must allow outbound connections from the CE and WNs to
ports 20000+ of service nodes outside the local domain (RBs, SEs).
If the upper bound is 20100, you may occasionally see a problem with a job
submission callback, input or output sandbox transfer when the RB is busy,
or in the user part of the job with a globus-url-copy to/from a very busy SE.
|