Hi,
You mean something like this:
https://gus.fzk.de/pages/ticket_details.php?ticket=20732
Please read the comments until the end. I would appreciate if anybody can
provide some insights into this...
Thanks, Antun
-----
Antun Balaz
Research Assistant
E-mail: [log in to unmask]
Web: http://scl.phy.bg.ac.yu/
Phone: +381 11 3713152
Fax: +381 11 3162190
Scientific Computing Laboratory
Institute of Physics, Belgrade, Serbia
-----
---------- Original Message -----------
From: Alexander Piavka <[log in to unmask]>
To: [log in to unmask]
Sent: Thu, 3 May 2007 22:08:37 +0300
Subject: [LCG-ROLLOUT] sleeping gCE JobWrappers on WN wasting resources.
> Hi all,
>
> At IL-BGU site i have a shared Torque server between lcgCE gCE.
> I've noticed that all jobs running on a WN are idle , these jobs are
> all gCE jobs, if lcgCE is run it finishes ok. The reason to these
> gCE sleeping jobs is that they fail to globus-url-copy for
> globus_url_retry_copy function in JobWrapper script:
> globus_url_retry_copy "file://${workdir}/${f}" "${__output_base_url}${ff}"
> The ${f} is __output_file[1]="testjob-results.tgz"
> Meaning this is SAM job and it has finished all and now tries to passwd
> the result testjob-results.tgz back to WMS.
> All these jobs are trying to globus-url-copy rb108.cern.ch
> The globus_url_retry_copy tries up to __file_tx_retry_count=6 times,
> to globus-url-copy to the WMS. The situation is that all batch job
> slots are allocated by these sleeping JobWrapper scripts, and no new
> jobs can be proccessed.
>
> - It would be nice to know: how can i reduce the __file_tx_retry_count=
> from the pbs/glite environment to minimize problem?
>
> - Also it seems more reasonable that in a case that a gCE pbs job
> has finished it's computations and all that is left to do is to
> stageout the output files to WMS, and in case it failed to do so the
> first time with globus-url-copy it( the JobWrapper script) would
> make the condor-c on the gCE node itself to stage out these files to
> WMS(then the WMS is back in good state), thus allowing new jobs to
> be run on the WNs. So it would be nice if JobWrapper would be
> modified to allow this functionality.
>
> - Also some questions regarding the torque/maui combination:
> >From maui point of view the WN the node is busy but load is low
> ------------------------------------------------
> checknode wn02
>
> checking node wn02.bgu.ac.il
>
> State: Busy (in current state for 00:16:58)
> Configured Resources: PROCS: 4 MEM: 3945M SWAP: 7583M DISK: 1M
> Utilized Resources: PROCS: 4
> Dedicated Resources: PROCS: 4
> Opsys: linux Arch: [NONE]
> Speed: 1.00 Load: 0.000
> Network: [DEFAULT]
> Features: [lcgpro]
> Attributes: [Batch]
> Classes: [dteam 3:4][ops 3:4][alice 4:4][atlas 2:4][biomed 4:4][cms
> 4:4][lhcb 4:4]
>
> Total Time: 19:38:11 Up: 19:37:50 (99.97%) Active: 10:55:31 (55.64%)
>
> Reservations:
> Job '335'(x1) -1:03:49 -> 2:22:56:11 (3:00:00:00)
> Job '338'(x1) -00:40:35 -> 2:23:19:25 (3:00:00:00)
> Job '340'(x1) -00:17:09 -> 2:23:42:51 (3:00:00:00)
> Job '331'(x1) -2:13:47 -> 2:21:46:13 (3:00:00:00)
> Job '341'(x1) 2:21:46:13 -> 5:21:46:13 (3:00:00:00)
> JobList: 331,335,338,340
> ALERT: node is in state Busy but load is low (0.000)
> ------------------------------------------------
>
> Also from torque point of view WN is busy
> ------------------------------------------------
> # pbsnodes wn02.bgu.ac.il
> wn02.bgu.ac.il
> state = job-exclusive
> np = 4
> properties = lcgpro
> ntype = cluster
> jobs = 0/335.cs-grid1.bgu.ac.il, 1/338.cs-grid1.bgu.ac.il,
> 2/340.cs-grid1.bgu.ac.il, 3/331.cs-grid1.bgu.ac.il
> status = opsys=linux,uname=Linux wn02.bgu.ac.il 2.6.9-
> 42.0.2.EL.1.cernsmp #1 SMP Fri Sep 8 15:19:18 CEST 2006 x86_64,
> sessions=11752 22736 2049 7827,nsessions=4,nusers=3,idletime=8918,
> totmem=7853836kb,availmem=7765660kb, physmem=4039748kb,ncpus=4,
> loadave=0.00,netload=2057275256,state=free,jobs=331.cs-
> grid1.bgu.ac.il 335.cs-grid1.bgu.ac.il 338.cs-grid1.bgu.ac.il 340.cs-
> grid1.bgu.ac.il,rectime=1178217867
> ------------------------------------------------
>
> Maybe there is a way to allow the WN accept more than 4 jobs in case
> the load is near zero or ,instead , less than pbs-mom defined
> $ideal_load value ?
>
> AFAIK also if pbs-mom $max_load value is exceeded , it won't accept new
> jobs, even then there are empty slots. Please correct me if i'm
> wrong. Does it mean that all maui reserved jobs would not get
> executed on time?
>
> How can glite+pbs check that ,for example , simple single cpu job
> does not start to use more that one cpu thus causing other jobs
> starvation? The same question for any WN resouces that my cause
> other jobs starvation, due to misbehaving jobs.
>
> Thanks a lot
> Alex
------- End of Original Message -------
|