On Thu, 3 May 2007, Antun Balaz wrote:
> Hi,
> You mean something like this:
> https://gus.fzk.de/pages/ticket_details.php?ticket=20732
Yes , but you also mention sgm accounts, what does it have to do with sgm
accounts if the problem is at specific WMS or network path to the WMS?
>
> Please read the comments until the end. I would appreciate if anybody can
> provide some insights into this...
I think that if OutputSandbox can't be stageout back to WMS on the first
try, the gCE should take care of this instead of JobWrapper on WN.
If InputSandbox can't be stageout on first and maybe second try, then
the job would be just aborted, since it has not yet started to make any
computations, so it won't be much pity to abort such job.
Otherwise since we can't guarantee that all WMSes are in a good state,
we'll endup in such stituation from time to time.
ps. Also can someone answer my questions regarding the
torque/maui combination at the end of my original message?
Thanks
Alex
>
> Thanks, Antun
>
>
> -----
> Antun Balaz
> Research Assistant
> E-mail: [log in to unmask]
> Web: http://scl.phy.bg.ac.yu/
>
> Phone: +381 11 3713152
> Fax: +381 11 3162190
>
> Scientific Computing Laboratory
> Institute of Physics, Belgrade, Serbia
> -----
>
> ---------- Original Message -----------
> From: Alexander Piavka <[log in to unmask]>
> To: [log in to unmask]
> Sent: Thu, 3 May 2007 22:08:37 +0300
> Subject: [LCG-ROLLOUT] sleeping gCE JobWrappers on WN wasting resources.
>
> > Hi all,
> >
> > At IL-BGU site i have a shared Torque server between lcgCE gCE.
> > I've noticed that all jobs running on a WN are idle , these jobs are
> > all gCE jobs, if lcgCE is run it finishes ok. The reason to these
> > gCE sleeping jobs is that they fail to globus-url-copy for
> > globus_url_retry_copy function in JobWrapper script:
> > globus_url_retry_copy "file://${workdir}/${f}" "${__output_base_url}${ff}"
> > The ${f} is __output_file[1]="testjob-results.tgz"
> > Meaning this is SAM job and it has finished all and now tries to passwd
> > the result testjob-results.tgz back to WMS.
> > All these jobs are trying to globus-url-copy rb108.cern.ch
> > The globus_url_retry_copy tries up to __file_tx_retry_count=6 times,
> > to globus-url-copy to the WMS. The situation is that all batch job
> > slots are allocated by these sleeping JobWrapper scripts, and no new
> > jobs can be proccessed.
> >
> > - It would be nice to know: how can i reduce the __file_tx_retry_count=
> > from the pbs/glite environment to minimize problem?
> >
> > - Also it seems more reasonable that in a case that a gCE pbs job
> > has finished it's computations and all that is left to do is to
> > stageout the output files to WMS, and in case it failed to do so the
> > first time with globus-url-copy it( the JobWrapper script) would
> > make the condor-c on the gCE node itself to stage out these files to
> > WMS(then the WMS is back in good state), thus allowing new jobs to
> > be run on the WNs. So it would be nice if JobWrapper would be
> > modified to allow this functionality.
> >
> > - Also some questions regarding the torque/maui combination:
> > >From maui point of view the WN the node is busy but load is low
> > ------------------------------------------------
> > checknode wn02
> >
> > checking node wn02.bgu.ac.il
> >
> > State: Busy (in current state for 00:16:58)
> > Configured Resources: PROCS: 4 MEM: 3945M SWAP: 7583M DISK: 1M
> > Utilized Resources: PROCS: 4
> > Dedicated Resources: PROCS: 4
> > Opsys: linux Arch: [NONE]
> > Speed: 1.00 Load: 0.000
> > Network: [DEFAULT]
> > Features: [lcgpro]
> > Attributes: [Batch]
> > Classes: [dteam 3:4][ops 3:4][alice 4:4][atlas 2:4][biomed 4:4][cms
> > 4:4][lhcb 4:4]
> >
> > Total Time: 19:38:11 Up: 19:37:50 (99.97%) Active: 10:55:31 (55.64%)
> >
> > Reservations:
> > Job '335'(x1) -1:03:49 -> 2:22:56:11 (3:00:00:00)
> > Job '338'(x1) -00:40:35 -> 2:23:19:25 (3:00:00:00)
> > Job '340'(x1) -00:17:09 -> 2:23:42:51 (3:00:00:00)
> > Job '331'(x1) -2:13:47 -> 2:21:46:13 (3:00:00:00)
> > Job '341'(x1) 2:21:46:13 -> 5:21:46:13 (3:00:00:00)
> > JobList: 331,335,338,340
> > ALERT: node is in state Busy but load is low (0.000)
> > ------------------------------------------------
> >
> > Also from torque point of view WN is busy
> > ------------------------------------------------
> > # pbsnodes wn02.bgu.ac.il
> > wn02.bgu.ac.il
> > state = job-exclusive
> > np = 4
> > properties = lcgpro
> > ntype = cluster
> > jobs = 0/335.cs-grid1.bgu.ac.il, 1/338.cs-grid1.bgu.ac.il,
> > 2/340.cs-grid1.bgu.ac.il, 3/331.cs-grid1.bgu.ac.il
> > status = opsys=linux,uname=Linux wn02.bgu.ac.il 2.6.9-
> > 42.0.2.EL.1.cernsmp #1 SMP Fri Sep 8 15:19:18 CEST 2006 x86_64,
> > sessions=11752 22736 2049 7827,nsessions=4,nusers=3,idletime=8918,
> > totmem=7853836kb,availmem=7765660kb, physmem=4039748kb,ncpus=4,
> > loadave=0.00,netload=2057275256,state=free,jobs=331.cs-
> > grid1.bgu.ac.il 335.cs-grid1.bgu.ac.il 338.cs-grid1.bgu.ac.il 340.cs-
> > grid1.bgu.ac.il,rectime=1178217867
> > ------------------------------------------------
> >
> > Maybe there is a way to allow the WN accept more than 4 jobs in case
> > the load is near zero or ,instead , less than pbs-mom defined
> > $ideal_load value ?
> >
> > AFAIK also if pbs-mom $max_load value is exceeded , it won't accept new
> > jobs, even then there are empty slots. Please correct me if i'm
> > wrong. Does it mean that all maui reserved jobs would not get
> > executed on time?
> >
> > How can glite+pbs check that ,for example , simple single cpu job
> > does not start to use more that one cpu thus causing other jobs
> > starvation? The same question for any WN resouces that my cause
> > other jobs starvation, due to misbehaving jobs.
> >
> > Thanks a lot
> > Alex
> ------- End of Original Message -------
|