Does anyone understand why it takes two hours and 24 tries at 9 sites (IC,
QMUL, RAL, BRIS, BHAM, CAM, LIV, IT, CERN) for the RB to find a site to
execute a "Hello World" job? I never heard back from anyone about what the
causes this, and I think it is quite interesting.
Ian.
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes/
> -----Original Message-----
> From: Ian Stokes-Rees
> Sent: 20 March 2003 17:06
> To: [log in to unmask]
> Subject: Job submission
>
> In an effort to further test Oxford job submission, I tried
> submitting any job from anywhere to anywhere. I couldn't do
> it. The interesting bit is watching the IC RB spend the last
> hour (and it is still going) trying to match my hello world
> job to some location. See below for a synopsis.
>
> RAL UI JOB SUBMIT TO ANY DESTINATION
>
> On gppui04 (RAL), using the IC RB config (BTW, the /etc/motd
> path is wrong) I execute:
>
> dg-job-submit -c /opt/edg/etc/IC_UI_ConfigENV.cfg helloworld.jdl
>
> I then watch the RB pass the job between numerous targets.
> Here is the current list:
>
> ResourceBroker/gm03 Job Accepted -- Waiting 15:41:41
> epcf36.ph.bham.ac.uk:2119/jobmanager-pbs-M 15:42:18
> tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq 15:42:46
> epcf36.ph.bham.ac.uk:2119/jobmanager-pbs-L 15:43:44
> lxshare0227.cern.ch:2119/jobmanager-pbs-short 15:47:40
> lxshare0227.cern.ch:2119/jobmanager-pbs-long 15:56:46
> lxshare0227.cern.ch:2119/jobmanager-pbs-infinite 15:58:20
> grid001.ct.infn.it:2119/jobmanager-pbs-short 16:09:23
> hepbf4.ph.qmul.ac.uk:2119/jobmanager-pbs-L 16:32:08
> grid002.to.infn.it:2119/jobmanager-pbs-medium 16:53:17
>
> ... And it is still going (17:06).
>
> Attempting "default" configuration:
>
> [stokes@gppui04 JSexercise1]$ dg-job-submit helloworld.jdl
>
> **** Warning: RB_CONNECTION_FAILURE ****
> Unable to connect to RB "lxshare0380.cern.ch"
>
> **** Error: UI_NO_RB_CONTACT ****
> Unable to contact any broker supplied
>
> Attempting from gppui06 resulted in (on the first dg-job-status):
>
> Transfer to UI failed: InputSanbox Transfer Error
>
> For completeness here is the JDL:
>
> Executable = "/bin/echo";
> Arguments = "Hello Thurs 1540";
> Stdoutput = "message.txt";
> StdError = "stderror";
> OutputSandbox = {"message.txt","stderror"};
[next email]
> > What's the output from dg-job-get-logging-info? Can you run jobs
> > directly with globus, e.g.
> >
> > globus-job-run epcf36.ph.bham.ac.uk:2119/jobmanager-pbs
> > /bin/hostname epcf35.ph.bham.ac.uk
>
> Globus-job-run works for me on most of those sites -- some of
> them never responded, but most were OK. The logging output
> can be seen here:
>
http://www-pnp.physics.ox.ac.uk/~stokes/drop/log.out
Which shows that it went through 24 cycles of
Accept->Pending->Match->Refuse, where the refusal every time was:
Submitting job(s)ERROR: Failed to connect to local queue manager - condor
command failed
Interestingly, at the end there was a job fail at IC with:
Cannot read JobWrapper output, both from Condor and from Maradona.
Finally, two hours after being submitted, it ran at grid002.to.infn.it and
produced:
Hello Thurs 1540
|