Hi Marco,
I find it's often useful to ask the user what's wrong. We found out a
few problems this way; one which happens is our LFC gets overloaded
(doesn't let go of threads) and we see this rather quickly in the job
efficiency -- the jobs try to get an LFN, fail, wait 10 min, and try again.
on the other hand we've seen this happen because a VO's metadata
catalogue was overloaded (off-site).
Let's put it this way, if all your jobs have a cpu/wall ratio of >90%
then your site is OK. If not, your site *might* have problems.
JT
Marco La Rosa wrote:
> Hi all,
>
> Indeed my WNs are on a private network segment - should've mentioned
> that!
>
> So what's the recommended procedure in this situation.
>
> Do I configure my firewall ala Globus' recommended method: 100 ports per
> WN and firewall forward statements so that ACTIVE connections end up in
> the correct place?
>
> What do people do when their site is full of jobs doing nothing? Is it
> reasonable for me to go along and just qdel all of those jobs? Do I need
> to notify the user first or just hope that they will work it out?
>
> Thanx!
> Marco
>
>
>
> On Fri, 2006-10-06 at 18:46 +0200, Maarten Litmaath wrote:
>> Rod Walker wrote:
>>
>>> Hi,
>>> I thought multi-stream gridftp always failed with a firewall because it
>>> was 'ACTIVE' and needed inbound connectivity. For sure, the ATLAS
>>> production uses single stream lcg-cp for this reason.
>> Well spotted! I did not catch that the jobs were trying to download
>> rather than upload files that way. Indeed, that can never work.
>> Ergo: user error.
>>
>>> Cheers,
>>> Rod.
>>>
>>> On Fri, 6 Oct 2006, Maarten Litmaath wrote:
>>>
>>>> Marco La Rosa wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> This is a long mail... Apologies in advance!
>>>>>
>>>>> Periodically I have errors with ATLAS user jobs and globus-url-copy.
>>>>> Unfortunately it's not related to a specific user nor to a specific SE.
>>>>> So, i'm finding it difficult to track down the problem.
>>>>
>>>> You may have a campus firewall that does not like source ports
>>>> immediately
>>>> getting reused for independent connections. Try to ensure the
>>>> environment
>>>> variable GLOBUS_TCP_PORT_RANGE is unset on your WNs, e.g. through an
>>>> extra
>>>> script in /etc/profile.d like this:
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> #!/bin/sh
>>>> unset GLOBUS_TCP_PORT_RANGE
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>>> What I do know.
>>>>>
>>>>> 1. It's almost always a globus-url-copy - occasionally it's lcg-cr
>>>>> 2. At the moment I have 18 jobs from a particular user. Every job is
>>>>> trying to do a globus-url-copy and seems to be frozen.
>>>>> 3. It doesn't happen with all ATLAS users - just some - and it seems to
>>>>> be the same ones over and again.
>>>>>
>>>>> Sample qstat output:
>>>>>
>>>>> ---snip---
>>>>> 12465.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12466.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12467.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12469.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12470.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12471.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12473.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12475.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> 12476.charm-mgt STDIN atlas029 00:00:09 R atlas
>>>>> ---snip---
>>>>>
>>>>> Notice the jobs have only run for a few seconds. In reality, they've
>>>>> been on the site since last night some time.
>>>>>
>>>>> At the moment, the following SEs are involved:
>>>>> gsiftp://harry.hagrid.it.uu.se
>>>>> gsiftp://ss1.hpc2n.umu.se
>>>>> gsiftp://dcgftp.usatlas.bnl.gov
>>>>>
>>>>> In the past it's also involved SEs at *.usatlas.bnl.gov and *.se.
>>>>> Unfortunately I don't have more info than that.
>>>>>
>>>>>> From previous attempts at trying to find the problem, I've tried
>>>>>> running
>>>>> the users command myself on the WNs and I get errors like:
>>>>>
>>>>> (Note: attempted another time and the SE was different then)
>>>>>
>>>>> [atlas029@pnet25 tmp]$ globus-url-copy -p 10 -dbg
>>>>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
>>>>> file:///tmp/test.file
>>>>> debug: starting to get
>>>>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
>>>>>
>>>>> debug: connecting to
>>>>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
>>>>>
>>>>> debug: error reading response from
>>>>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root:
>>>>> an end-of-file was reached
>>>>> debug: fault on connection to
>>>>> gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root:
>>>>> an end-of-file was reached
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb74d5008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb6bcc008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb6ccd008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb6dce008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb6ecf008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb6fd0008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb70d1008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb71d2008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb72d3008, length 0, offset=0, eof=true
>>>>> debug: data callback, error an end-of-file was reached, buffer
>>>>> 0xb73d4008, length 0, offset=0, eof=true
>>>>> debug: operation complete
>>>>> error: an end-of-file was reached
>>>>>
>>>>> globus-url-copy -dbg -p 10
>>>>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root
>>>>>
>>>>>
>>>>> ---snip----
>>>>>
>>>>> debug: response from
>>>>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
>>>>>
>>>>> 150 Opening connection.
>>>>>
>>>>> debug: reading into data buffer 0xb74d8008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb6bcf008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb6cd0008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb6dd1008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb6ed2008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb6fd3008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb70d4008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb71d5008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb72d6008, maximum length 1048576
>>>>> debug: reading into data buffer 0xb73d7008, maximum length 1048576
>>>>> debug: response from
>>>>> gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
>>>>>
>>>>> 426 Transfer terminated
>>>>>
>>>>> The interesting thing is that, in the past when I've tried the copy's
>>>>> myself (su'ed to the unix atlas* user in question at the time),
>>>>> sometimes it would work and start transferring, and other times I would
>>>>> get an error like those above.
>>>>>
>>>>> I suppose the point is that I don't know what to do to track this
>>>>> problem down. Can anyone suggest how I might go about sorting out what's
>>>>> happening?
>>>>>
>>>>> Thanx in advance!
>>>>> Marco
>>>>
>>>>
|