Print

Print


Hi all,

This is a long mail... Apologies in advance!

Periodically I have errors with ATLAS user jobs and globus-url-copy.
Unfortunately it's not related to a specific user nor to a specific SE.
So, i'm finding it difficult to track down the problem.

What I do know.

1. It's almost always a globus-url-copy - occasionally it's lcg-cr
2. At the moment I have 18 jobs from a particular user. Every job is
trying to do a globus-url-copy and seems to be frozen.
3. It doesn't happen with all ATLAS users - just some - and it seems to
be the same ones over and again.

Sample qstat output:

---snip---
12465.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12466.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12467.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12469.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12470.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12471.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12473.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12475.charm-mgt    STDIN            atlas029         00:00:09 R atlas
12476.charm-mgt    STDIN            atlas029         00:00:09 R atlas
---snip---

Notice the jobs have only run for a few seconds. In reality, they've
been on the site since last night some time.

At the moment, the following SEs are involved:
gsiftp://harry.hagrid.it.uu.se
gsiftp://ss1.hpc2n.umu.se
gsiftp://dcgftp.usatlas.bnl.gov

In the past it's also involved SEs at *.usatlas.bnl.gov and *.se.
Unfortunately I don't have more info than that. 

>From previous attempts at trying to find the problem, I've tried running
the users command myself on the WNs and I get errors like:

(Note: attempted another time and the SE was different then)

 [atlas029@pnet25 tmp]$ globus-url-copy -p 10 -dbg
gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root file:///tmp/test.file
debug: starting to get
gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
debug: connecting to
gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root
debug: error reading response from
gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root: an end-of-file was reached
debug: fault on connection to
gsiftp://ss2.hpc2n.umu.se:2811/ss2_se2/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00448.pool.root: an end-of-file was reached
debug: data callback, error an end-of-file was reached, buffer
0xb74d5008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb6bcc008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb6ccd008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb6dce008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb6ecf008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb6fd0008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb70d1008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb71d2008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb72d3008, length 0, offset=0, eof=true
debug: data callback, error an end-of-file was reached, buffer
0xb73d4008, length 0, offset=0, eof=true
debug: operation complete
error: an end-of-file was reached

globus-url-copy -dbg -p 10
gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root

---snip----

debug: response from
gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
150 Opening connection.

debug: reading into data buffer 0xb74d8008, maximum length 1048576
debug: reading into data buffer 0xb6bcf008, maximum length 1048576
debug: reading into data buffer 0xb6cd0008, maximum length 1048576
debug: reading into data buffer 0xb6dd1008, maximum length 1048576
debug: reading into data buffer 0xb6ed2008, maximum length 1048576
debug: reading into data buffer 0xb6fd3008, maximum length 1048576
debug: reading into data buffer 0xb70d4008, maximum length 1048576
debug: reading into data buffer 0xb71d5008, maximum length 1048576
debug: reading into data buffer 0xb72d6008, maximum length 1048576
debug: reading into data buffer 0xb73d7008, maximum length 1048576
debug: response from
gsiftp://pikolit.ijs.si:2811/SE1/atlas/sc3/csc11.005009.J0_pythia_jetjet.digit.RDO.v11004203._00478.pool.root:
426 Transfer terminated

The interesting thing is that, in the past when I've tried the copy's
myself (su'ed to the unix atlas* user in question at the time),
sometimes it would work and start transferring, and other times I would
get an error like those above.

I suppose the point is that I don't know what to do to track this
problem down. Can anyone suggest how I might go about sorting out what's
happening?

Thanx in advance!
Marco