Hi, Could someone please flip the switch again which says "Make EDG Testbed Work". I guess it was flipped to ON for the EDG final review, but maybe it got flipped to OFF accidentally last week. Right now 50% of my jobs are going in, and of those only 50% are returning output (even though they say they are completing successfully). It is taking ages to submit jobs (5 to 10 minutes, each), and I'm wondering how I can figure out why (i.e. an explanation of why it is slow would be great, but I would rather know how I can figure out the source of the problem myself, and use this in the future). I am trying to submit a job to specific sites, following this algorithm: edg-job-match -o sites --vo lhcb test.jdl for site in `cat sites`; do edg-job-submit -o jobid -r $site --vo lhcb test.jdl done There are about 12 sites, each with about 3 queues, so a total of about 36 job submissions. I have been waiting about an hour and a half, and only 16 jobs have gone through so far. They are all using boszwijn.nikhef.nl as the RB. Of these 16 jobs, only half have successfully been submitted. The others failed with: ce.gridpp.shef.ac.uk (x2): NS_SUBMIT_FAIL ccgridli05.in2p3.fr: NS_SUBMIT_FAIL ce001.fzk.de: NS_SUBMIT_FAIL ce01.ph.qmul.ac.uk: NS_SUBMIT_FAIL dgrid-2.srce.hr: NS_SUBMIT_FAIL epcf36.ph.bham.ac.uk: NS_SUBMIT_FAIL farm003.hep.phy.cam.ac.uk: NS_SUBMIT_FAIL gppce05.gridpp.rl.ac.uk: NS_SUBMIT_FAIL grid-w2.ifae.es: NS_SUBMIT_FAIL grid0007.esrin.esa.int: NS_SUBMIT_FAIL gridy4.begrid.be: NS_SUBMIT_FAIL ce.gridpp.shef.ac.uk: Seg fault The NS_SUBMIT_FAIL failures reported: "SandboxIOException: Globus Ftp API Failure in creating remote Directories." received when submitting a job to NS However, Sheffield, Lyon (in2p3.fr), FZK, SRCE (HR), RAL and B'HAM had other jobs which succeeded. All the jobs which were submitted completed (amazingly!), and took a very consistent 7-8 minutes to move through the system (according to the LB data), with the last 1-3 minutes occupied with executing "/bin/hostname". My problem, now, is retrieving output. Is there a known bug that zero length files cannot be retrieved, or a problem with the standard error output file? Here is what I typically get: error: a system call failed (Connection timed out) **** Warning: NS_FILE_RETRIEVAL **** Unable to retrieve the following output files: /var/edgwl/SandboxDir/g8/https_3a_2f_2fboszwijn.nikhef.nl_3a9000_2fg8XmN8rT9OwrF2uc6CPndQ/output/std.err for the job: "https://boszwijn.nikhef.nl:9000/g8XmN8rT9OwrF2uc6CPndQ" 1 output file(s) out of 2 have been successfully retrieved Do you wish anyway to remove the directory: /userdisk/stokes/JobOutput/stokes_g8XmN8rT9OwrF2uc6CPndQ? [y/n]n : The wording of this message is *very* poor. Now that I understand what it does, I would change the message to "Do you want to abort this operation and delete the local directory and retrieved files?" But bizarely *both* the std.err and std.out files have been retreived and it is the std.out file that is, unexpectedly, empty (the error message suggests that it is std.err that couldn't be retreived). Only these have been retreived successfully: dgrid-2.srce.hr dgrid-2.srce.hr gppwn05.gridpp.rl.ac.uk epcf33.ph.bham.ac.uk wn001.fzk.de gppwn05.gridpp.rl.ac.uk For my last point in this long saga (It's taken an hour just to do the tests and write this email), I now have the edg-job-submit (which is still trucking away after 2 hours, trying to submit 36 "/bin/hostname" jobs) telling me that my job list file already exists, and do I want to overwrite it. No, I want it to append job IDs to it, just like it was doing reasonably successfully for the first 20 odd jobs it encountered (even if only half of them didn't fail). What could have made it suddenly decide that because the file already exists the only options are "Abort" or "Overwrite"? What happened to "Append", which should be the default behaviour? Appropriately, I'll finish with an error message: ------ submitting to grid001.pd.infn.it:2119/jobmanager-pbs-medium Selected Virtual Organisation name (from --vo option): lhcb **** Warning: UI_FILE_EXISTS **** "/userdisk/stokes/test/edg/joblist" file already exists. Do you want to overwrite? [y/n]n : bye ------ submitting to grid001.pd.infn.it:2119/jobmanager-pbs-short Selected Virtual Organisation name (from --vo option): lhcb **** Warning: UI_FILE_EXISTS **** "/userdisk/stokes/test/edg/joblist" file already exists. Do you want to overwrite? [y/n]n :y -- Ian Stokes-Rees [log in to unmask] Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes