Print

Print


Hi,

Could someone please flip the switch again which says "Make EDG Testbed
Work".  I guess it was flipped to ON for the EDG final review, but maybe
it got flipped to OFF accidentally last week.

Right now 50% of my jobs are going in, and of those only 50% are
returning output (even though they say they are completing successfully).

It is taking ages to submit jobs (5 to 10 minutes, each), and I'm
wondering how I can figure out why (i.e. an explanation of why it is
slow would be great, but I would rather know how I can figure out the
source of the problem myself, and use this in the future).

I am trying to submit a job to specific sites, following this algorithm:

edg-job-match -o sites --vo lhcb test.jdl

for site in `cat sites`; do
        edg-job-submit -o jobid -r $site --vo lhcb test.jdl
done

There are about 12 sites, each with about 3 queues, so a total of about
36 job submissions.

I have been waiting about an hour and a half, and only 16 jobs have gone
through so far.  They are all using boszwijn.nikhef.nl as the RB.

Of these 16 jobs, only half have successfully been submitted.  The
others failed with:

ce.gridpp.shef.ac.uk (x2): NS_SUBMIT_FAIL
ccgridli05.in2p3.fr:       NS_SUBMIT_FAIL
ce001.fzk.de:              NS_SUBMIT_FAIL
ce01.ph.qmul.ac.uk:        NS_SUBMIT_FAIL
dgrid-2.srce.hr:           NS_SUBMIT_FAIL
epcf36.ph.bham.ac.uk:      NS_SUBMIT_FAIL
farm003.hep.phy.cam.ac.uk: NS_SUBMIT_FAIL
gppce05.gridpp.rl.ac.uk:   NS_SUBMIT_FAIL
grid-w2.ifae.es:           NS_SUBMIT_FAIL
grid0007.esrin.esa.int:    NS_SUBMIT_FAIL
gridy4.begrid.be:          NS_SUBMIT_FAIL
ce.gridpp.shef.ac.uk:      Seg fault

The NS_SUBMIT_FAIL failures reported:

"SandboxIOException: Globus Ftp API Failure in creating remote
Directories." received when submitting a job to NS

However, Sheffield, Lyon (in2p3.fr), FZK, SRCE (HR), RAL and B'HAM had
other jobs which succeeded.

All the jobs which were submitted completed (amazingly!), and took a
very consistent 7-8 minutes to move through the system (according to the
LB data), with the last 1-3 minutes occupied with executing "/bin/hostname".

My problem, now, is retrieving output.  Is there a known bug that zero
length files cannot be retrieved, or a problem with the standard error
output file?

Here is what I typically get:

error: a system call failed (Connection timed out)
**** Warning: NS_FILE_RETRIEVAL ****
Unable to retrieve the following output files:
/var/edgwl/SandboxDir/g8/https_3a_2f_2fboszwijn.nikhef.nl_3a9000_2fg8XmN8rT9OwrF2uc6CPndQ/output/std.err
for the job:
"https://boszwijn.nikhef.nl:9000/g8XmN8rT9OwrF2uc6CPndQ"

1 output file(s) out of 2 have been successfully retrieved
Do you wish anyway to remove the directory:
  /userdisk/stokes/JobOutput/stokes_g8XmN8rT9OwrF2uc6CPndQ? [y/n]n :

The wording of this message is *very* poor.  Now that I understand what
it does, I would change the message to "Do you want to abort this
operation and delete the local directory and retrieved files?"

But bizarely *both* the std.err and std.out files have been retreived
and it is the std.out file that is, unexpectedly, empty (the error
message suggests that it is std.err that couldn't be retreived).

Only these have been retreived successfully:

dgrid-2.srce.hr
dgrid-2.srce.hr
gppwn05.gridpp.rl.ac.uk
epcf33.ph.bham.ac.uk
wn001.fzk.de
gppwn05.gridpp.rl.ac.uk

For my last point in this long saga (It's taken an hour just to do the
tests and write this email), I now have the edg-job-submit (which is
still trucking away after 2 hours, trying to submit 36 "/bin/hostname"
jobs) telling me that my job list file already exists, and do I want to
overwrite it.  No, I want it to append job IDs to it, just like it was
doing reasonably successfully for the first 20 odd jobs it encountered
(even if only half of them didn't fail).  What could have made it
suddenly decide that because the file already exists the only options
are "Abort" or "Overwrite"?  What happened to "Append", which should be
the default behaviour?

Appropriately, I'll finish with an error message:

------ submitting to grid001.pd.infn.it:2119/jobmanager-pbs-medium

Selected Virtual Organisation name (from --vo option): lhcb
**** Warning: UI_FILE_EXISTS ****
"/userdisk/stokes/test/edg/joblist" file already exists.

Do you want to overwrite? [y/n]n :

bye
------ submitting to grid001.pd.infn.it:2119/jobmanager-pbs-short

Selected Virtual Organisation name (from --vo option): lhcb
**** Warning: UI_FILE_EXISTS ****
"/userdisk/stokes/test/edg/joblist" file already exists.

Do you want to overwrite? [y/n]n :y




--
Ian Stokes-Rees                 [log in to unmask]
Particle Physics, Oxford        http://www-pnp.physics.ox.ac.uk/~stokes