HI Ian,
I am afraid it is now too late for me trace this as those logs are long
gone, sorry I didn't do it at athe time. Why not try it again...
david
On Fri, 28 Mar 2003, Ian Stokes-Rees wrote:
> Does anyone understand why it takes two hours and 24 tries at 9 sites (IC,
> QMUL, RAL, BRIS, BHAM, CAM, LIV, IT, CERN) for the RB to find a site to
> execute a "Hello World" job? I never heard back from anyone about what the
> causes this, and I think it is quite interesting.
>
> Ian.
>
> --
> Ian Stokes-Rees [log in to unmask]
> Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes/
>
> > -----Original Message-----
> > From: Ian Stokes-Rees
> > Sent: 20 March 2003 17:06
> > To: [log in to unmask]
> > Subject: Job submission
> >
> > In an effort to further test Oxford job submission, I tried
> > submitting any job from anywhere to anywhere. I couldn't do
> > it. The interesting bit is watching the IC RB spend the last
> > hour (and it is still going) trying to match my hello world
> > job to some location. See below for a synopsis.
> >
> > RAL UI JOB SUBMIT TO ANY DESTINATION
> >
> > On gppui04 (RAL), using the IC RB config (BTW, the /etc/motd
> > path is wrong) I execute:
> >
> > dg-job-submit -c /opt/edg/etc/IC_UI_ConfigENV.cfg helloworld.jdl
> >
> > I then watch the RB pass the job between numerous targets.
> > Here is the current list:
> >
> > ResourceBroker/gm03 Job Accepted -- Waiting 15:41:41
> > epcf36.ph.bham.ac.uk:2119/jobmanager-pbs-M 15:42:18
> > tuber5.phy.bris.ac.uk:2119/jobmanager-pbs-tbq 15:42:46
> > epcf36.ph.bham.ac.uk:2119/jobmanager-pbs-L 15:43:44
> > lxshare0227.cern.ch:2119/jobmanager-pbs-short 15:47:40
> > lxshare0227.cern.ch:2119/jobmanager-pbs-long 15:56:46
> > lxshare0227.cern.ch:2119/jobmanager-pbs-infinite 15:58:20
> > grid001.ct.infn.it:2119/jobmanager-pbs-short 16:09:23
> > hepbf4.ph.qmul.ac.uk:2119/jobmanager-pbs-L 16:32:08
> > grid002.to.infn.it:2119/jobmanager-pbs-medium 16:53:17
> >
> > ... And it is still going (17:06).
> >
> > Attempting "default" configuration:
> >
> > [stokes@gppui04 JSexercise1]$ dg-job-submit helloworld.jdl
> >
> > **** Warning: RB_CONNECTION_FAILURE ****
> > Unable to connect to RB "lxshare0380.cern.ch"
> >
> > **** Error: UI_NO_RB_CONTACT ****
> > Unable to contact any broker supplied
> >
> > Attempting from gppui06 resulted in (on the first dg-job-status):
> >
> > Transfer to UI failed: InputSanbox Transfer Error
> >
> > For completeness here is the JDL:
> >
> > Executable = "/bin/echo";
> > Arguments = "Hello Thurs 1540";
> > Stdoutput = "message.txt";
> > StdError = "stderror";
> > OutputSandbox = {"message.txt","stderror"};
>
> [next email]
>
> > > What's the output from dg-job-get-logging-info? Can you run jobs
> > > directly with globus, e.g.
> > >
> > > globus-job-run epcf36.ph.bham.ac.uk:2119/jobmanager-pbs
> > > /bin/hostname epcf35.ph.bham.ac.uk
> >
> > Globus-job-run works for me on most of those sites -- some of
> > them never responded, but most were OK. The logging output
> > can be seen here:
> >
> http://www-pnp.physics.ox.ac.uk/~stokes/drop/log.out
>
> Which shows that it went through 24 cycles of
> Accept->Pending->Match->Refuse, where the refusal every time was:
>
> Submitting job(s)ERROR: Failed to connect to local queue manager - condor
> command failed
>
> Interestingly, at the end there was a job fail at IC with:
>
> Cannot read JobWrapper output, both from Condor and from Maradona.
>
> Finally, two hours after being submitted, it ran at grid002.to.infn.it and
> produced:
>
> Hello Thurs 1540
>
|