Hi all,
Indeed my WNs are on a private network segment - should've mentioned
that!
So what's the recommended procedure in this situation.
Do I configure my firewall ala Globus' recommended method: 100 ports per
WN and firewall forward statements so that ACTIVE connections end up in
the correct place?
What do people do when their site is full of jobs doing nothing? Is it
reasonable for me to go along and just qdel all of those jobs? Do I need
to notify the user first or just hope that they will work it out?
Thanx!
Marco
On Fri, 2006-10-06 at 18:46 +0200, Maarten Litmaath wrote:
> Rod Walker wrote:
>
> > Hi,
> > I thought multi-stream gridftp always failed with a firewall because it
> > was 'ACTIVE' and needed inbound connectivity. For sure, the ATLAS
> > production uses single stream lcg-cp for this reason.
>
> Well spotted! I did not catch that the jobs were trying to download
> rather than upload files that way. Indeed, that can never work.
> Ergo: user error.
>
> > Cheers,
> > Rod.
> >
> > On Fri, 6 Oct 2006, Maarten Litmaath wrote:
> >
> >> Marco La Rosa wrote:
> >>
> >>> Hi all,
> >>>
> >>> This is a long mail... Apologies in advance!
> >>>
> >>> Periodically I have errors with ATLAS user jobs and globus-url-copy.
> >>> Unfortunately it's not related to a specific user nor to a specific SE.
> >>> So, i'm finding it difficult to track down the problem.
> >>
> >>
> >> You may have a campus firewall that does not like source ports
> >> immediately
> >> getting reused for independent connections. Try to ensure the
> >> environment
> >> variable GLOBUS_TCP_PORT_RANGE is unset on your WNs, e.g. through an
> >> extra
> >> script in /etc/profile.d like this:
> >>
> >> --------------------------------------------------------------------------
> >>
> >> #!/bin/sh
> >> unset GLOBUS_TCP_PORT_RANGE
> >> --------------------------------------------------------------------------
> >>
> >>
> >>> What I do know.
> >>>
> >>> 1. It's almost always a globus-url-copy - occasionally it's lcg-cr
> >>> 2. At the moment I have 18 jobs from a particular user. Every job is
> >>> trying to do a globus-url-copy and seems to be frozen.
> >>> 3. It doesn't happen with all ATLAS users - just some - and it seems to
> >>> be the same ones over and again.
> >>>
> >>> Sample qstat output:
> >>>
> >>> ---snip---
> >>> 12465.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12466.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12467.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12469.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12470.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12471.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12473.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12475.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> 12476.charm-mgt STDIN atlas029 00:00:09 R atlas
> >>> ---snip---
> >>>
> >>> Notice the jobs have only run for a few seconds. In reality, they've
> >>> been on the site since last night some time.
> >>>
> >>> At the moment, the following SEs are involved:
> >>> gsiftp://harry.hagrid.it.uu.se
> >>> gsiftp://ss1.hpc2n.umu.se
> >>> gsiftp://dcgftp.usatlas.bnl.gov
> >>>
> >>> In the past it's also involved SEs at *.usatlas.bnl.gov and *.se.
> >>> Unfortunately I don't have more info than that.
> >>>
> >>>> From previous attempts at trying to find the problem, I've tried
> >>>> running
> >>>
> >>> the users command myself on the WNs and I get errors like:
> >>>
> >>> (Note: attempted another time and the SE was different then)
> >>>
> >>> [atlas029@pnet25 tmp]$ globus-url-copy -p 10 -dbg
> >>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
> >>> file:///tmp/test.file
> >>> debug: starting to get
> >>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
> >>>
> >>> debug: connecting to
> >>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
> >>>
> >>> debug: error reading response from
> >>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root:
> >>> an end-of-file was reached
> >>> debug: fault on connection to
> >>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root:
> >>> an end-of-file was reached
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb74d5008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb6bcc008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb6ccd008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb6dce008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb6ecf008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb6fd0008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb70d1008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb71d2008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb72d3008, length 0, offset=0, eof=true
> >>> debug: data callback, error an end-of-file was reached, buffer
> >>> 0xb73d4008, length 0, offset=0, eof=true
> >>> debug: operation complete
> >>> error: an end-of-file was reached
> >>>
> >>> globus-url-copy -dbg -p 10
> >>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root
> >>>
> >>>
> >>> ---snip----
> >>>
> >>> debug: response from
> >>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
> >>>
> >>> 150 Opening connection.
> >>>
> >>> debug: reading into data buffer 0xb74d8008, maximum length 1048576
> >>> debug: reading into data buffer 0xb6bcf008, maximum length 1048576
> >>> debug: reading into data buffer 0xb6cd0008, maximum length 1048576
> >>> debug: reading into data buffer 0xb6dd1008, maximum length 1048576
> >>> debug: reading into data buffer 0xb6ed2008, maximum length 1048576
> >>> debug: reading into data buffer 0xb6fd3008, maximum length 1048576
> >>> debug: reading into data buffer 0xb70d4008, maximum length 1048576
> >>> debug: reading into data buffer 0xb71d5008, maximum length 1048576
> >>> debug: reading into data buffer 0xb72d6008, maximum length 1048576
> >>> debug: reading into data buffer 0xb73d7008, maximum length 1048576
> >>> debug: response from
> >>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
> >>>
> >>> 426 Transfer terminated
> >>>
> >>> The interesting thing is that, in the past when I've tried the copy's
> >>> myself (su'ed to the unix atlas* user in question at the time),
> >>> sometimes it would work and start transferring, and other times I would
> >>> get an error like those above.
> >>>
> >>> I suppose the point is that I don't know what to do to track this
> >>> problem down. Can anyone suggest how I might go about sorting out what's
> >>> happening?
> >>>
> >>> Thanx in advance!
> >>> Marco
> >>
> >>
> >>
> >
>
|