Hi Maarten,
This is really useful: is there a page or three on this in the
troubleshooting wiki?
cheers,
Owen.
Maarten Litmaath, CERN wrote:
> On Mon, 7 Feb 2005, Maarten Litmaath wrote:
>
>
>>Jeroen Craens wrote:
>>
>>
>>>Dear all,
>>>
>>>We are currently setting up a testbed grid (still LCG 2.2, we might
>>>upgrade next month) behind a nat router, consisting out of a ce and some
>>>wn's (and a lcfg).
>>>To make sure a rb can transfer the jobs to our ce, we need to forward
>>>the SITE_GLOBUS_TCP_RANGE which normally is 20000-25000.
>>>Because the router can't handle forwarding of a range of ports, we are
>>>wondering if we could change the default range parameter in site-cfg.h
>>>to a range 20000-20100 without losing functionalities: the nodes of the
>>>site we will submit our jobs on (to our ce) will have the 20000-25000
>>>range but our site will then have the 20000-20100 range.
>>>Has anyone tried this before? Could we change the default value to the
>>>one proposed without experiencing problems?
>>
>>You might see a problem occasionally. See below.
>>
>>
>>>By the way: how does the ce choose to which port the rb can send its
>>>data: (assuming none of these ports have been taken) randomly, or 20000
>>>for the first transfer, 20001 for the next one,...?
>>
>>There seems to be a misconception here. What happens is this:
>
>
> WARNING: the scenario I gave before is INCOMPLETE!!!
>
> Below I have inserted the missing pieces:
>
>
>>-----------------------------------------------------------------------------
>>1. The RB contacts the CE on port 2119 and indicates on which port the RB
>> should be called back by the globus-job-manager. That port is the first
>> free port in the port range on the RB. The range usually is 20000-25000,
>> so the first free port *usually* is 20000 + O(10).
>
>
> In fact there are 2 ports on which the RB is called back
> (say 20000 and 20001).
>
> 1a. The RB contacts globus-job-manager on various ports in the CE port range!
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>
>>2. The CE calls the RB back on that port.
>
>
> It calls the RB back a few times on both of the ports from step 1.
>
>
>>3. The job wrapper gets submitted to the batch system and globus-job-manager
>> is told to exit.
>>
>>4. The job wrapper eventually starts on the WN and copies the input sandbox
>> from the RB using globus-url-copy. The data port on the RB will again be
>> in the port range of the RB.
>>
>>5. The user part of the job runs. It may do a globus-url-copy to/from an SE,
>> using a data port in the port range of that SE.
>>
>>6. The job wrapper copies the output sandbox (and the "Maradona" file) back
>> to the RB and exits.
>>
>>7. The grid_monitor running on the CE informs the RB that the job has exited.
>> The RB contacts the CE again on port 2119 to restart globus-job-manager,
>> which then cleans things up and sends back the stderr and stdout of the
>> job wrapper (stdout contains the exit status of the user part).
>>-----------------------------------------------------------------------------
>>
>>So, your NAT router must allow outbound connections from the CE and WNs to
>>ports 20000+ of service nodes outside the local domain (RBs, SEs).
>
>
> It also must allow inbound connections to the CE on the CE port range.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Mea maxima culpa... :-(
>
>
>>If the upper bound is 20100, you may occasionally see a problem with a job
>>submission callback, input or output sandbox transfer when the RB is busy,
>>or in the user part of the job with a globus-url-copy to/from a very busy SE.
>
>
> It is still correct that a port range of 20000-20500 is largely sufficient.
>
> To show what is going on between the RB and the CE, I have captured all calls
> to bind(), connect(), listen() and accept() made by the "gahp_server" process
> on the RB for a single job submission to a CE, and the subsequent cleanup:
>
> ------------------------------------------------------------
> bind( 6, {AF_INET, 20000, 0)}, 16 ) = 0
> listen( 6, 128 ) = 0
> bind( 7, {AF_INET, 20001, 0)}, 16 ) = 0
> listen( 7, 128 ) = 0
> bind( 8, {AF_INET, 20002, 0)}, 16 ) = 0
> connect( 8, {AF_INET, 2119, CE)}, 16 ) = -1 EINPROGRESS
> bind( 8, {AF_INET, 20003, 0)}, 16 ) = 0
> connect( 8, {AF_INET, 2119, CE)}, 16 ) = -1 EINPROGRESS
> bind( 9, {AF_INET, 20005, 0)}, 16 ) = 0
> connect( 9, {AF_INET, 2119, CE)}, 16 ) = -1 EINPROGRESS
> bind( 8, {AF_INET, 20007, 0)}, 16 ) = 0
> connect( 8, {AF_INET, 20007, CE)}, 16 ) = -1 EINPROGRESS
> accept( 6, {AF_INET, 20009, CE)}, [16]) = 10
> accept( 6, {AF_INET, 20010, CE)}, [16]) = 8
> bind(10, {AF_INET, 20007, 0)}, 16 ) = 0
> connect(10, {AF_INET, 20007, CE)}, 16 ) = -1 EINPROGRESS
> accept( 6, {AF_INET, 20011, CE)}, [16]) = 8
> accept( 7, {AF_INET, 20012, CE)}, [16]) = 8
> accept( 7, {AF_INET, 20013, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20014, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20007, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20007, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20007, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20000, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20000, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20000, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20000, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20000, CE)}, [16]) = 9
> bind( 9, {AF_INET, 20002, 0)}, 16 ) = 0
> connect( 9, {AF_INET, 2119, CE)}, 16 ) = -1 EINPROGRESS
> bind( 9, {AF_INET, 20003, 0)}, 16 ) = 0
> connect( 9, {AF_INET, 20010, CE)}, 16 ) = -1 EINPROGRESS
> accept( 6, {AF_INET, 20013, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20014, CE)}, [16]) = 9
> accept( 7, {AF_INET, 20015, CE)}, [16]) = 9
> accept( 6, {AF_INET, 20016, CE)}, [16]) = 9
> bind( 9, {AF_INET, 20007, 0)}, 16 ) = 0
> connect( 9, {AF_INET, 20010, CE)}, 16 ) = -1 EINPROGRESS
> bind( 9, {AF_INET, 20007, 0)}, 16 ) = 0
> connect( 9, {AF_INET, 20010, CE)}, 16 ) = -1 EINPROGRESS
> ------------------------------------------------------------
>
> (Whenever a file descriptor is reused, it was first closed.)
>
> The calls to the gatekeeper (2119) are not only to submit and cleanup
> the user job, but also for the grid_monitor job, that runs on the CE
> to monitor the user's real jobs.
>
> If the RB is not allowed to connect to the CE in the CE port range,
> one typically gets the following error for jobs submitted via the RB:
>
> -----------------------------------------------------------------------
> Got a job held event, reason: Globus error 79: connecting to the job
> manager failed. Possible reasons: job terminated, invalid job contact,
> network problems, ...
> -----------------------------------------------------------------------
>
> (Debugging such a problem I discovered my earlier mistake.)
>
> A direct globus-job-run will work, however, because it does not use
> the two-phase commit feature of GRAM, that is used by the RB.
>
> Cheers,
> Maarten
--
=======================================================
Dr O J E Maroney # London Tier 2 Technical Co-ordinator
Tel. (+44)20 759 47802
Imperial College London
High Energy Physics Department
The Blackett Laboratory
Prince Consort Road, London, SW7 2BW
====================================
|