Hi all,
At IL-BGU site i have a shared Torque server between lcgCE gCE.
I've noticed that all jobs running on a WN are idle , these jobs are all
gCE jobs, if lcgCE is run it finishes ok.
The reason to these gCE sleeping jobs is that they fail to globus-url-copy for
globus_url_retry_copy function in JobWrapper script:
globus_url_retry_copy "file://${workdir}/${f}" "${__output_base_url}${ff}"
The ${f} is __output_file[1]="testjob-results.tgz"
Meaning this is SAM job and it has finished all and now tries to passwd
the result testjob-results.tgz back to WMS.
All these jobs are trying to globus-url-copy rb108.cern.ch
The globus_url_retry_copy tries up to __file_tx_retry_count=6 times, to
globus-url-copy to the WMS.
The situation is that all batch job slots are allocated by these sleeping
JobWrapper scripts, and no new jobs can be proccessed.
- It would be nice to know: how can i reduce the __file_tx_retry_count=
from the pbs/glite environment to minimize problem?
- Also it seems more reasonable that in a case that a gCE pbs job has finished
it's computations and all that is left to do is to stageout the output
files to WMS, and in case it failed to do so the first time with
globus-url-copy it( the JobWrapper script) would make the condor-c on
the gCE node itself to stage out these files to WMS(then the WMS is back
in good state), thus allowing new jobs to be run on the WNs. So it would
be nice if JobWrapper would be modified to allow this functionality.
- Also some questions regarding the torque/maui combination:
From maui point of view the WN the node is busy but load is low
------------------------------------------------
checknode wn02
checking node wn02.bgu.ac.il
State: Busy (in current state for 00:16:58)
Configured Resources: PROCS: 4 MEM: 3945M SWAP: 7583M DISK: 1M
Utilized Resources: PROCS: 4
Dedicated Resources: PROCS: 4
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 0.000
Network: [DEFAULT]
Features: [lcgpro]
Attributes: [Batch]
Classes: [dteam 3:4][ops 3:4][alice 4:4][atlas 2:4][biomed 4:4][cms
4:4][lhcb 4:4]
Total Time: 19:38:11 Up: 19:37:50 (99.97%) Active: 10:55:31 (55.64%)
Reservations:
Job '335'(x1) -1:03:49 -> 2:22:56:11 (3:00:00:00)
Job '338'(x1) -00:40:35 -> 2:23:19:25 (3:00:00:00)
Job '340'(x1) -00:17:09 -> 2:23:42:51 (3:00:00:00)
Job '331'(x1) -2:13:47 -> 2:21:46:13 (3:00:00:00)
Job '341'(x1) 2:21:46:13 -> 5:21:46:13 (3:00:00:00)
JobList: 331,335,338,340
ALERT: node is in state Busy but load is low (0.000)
------------------------------------------------
Also from torque point of view WN is busy
------------------------------------------------
# pbsnodes wn02.bgu.ac.il
wn02.bgu.ac.il
state = job-exclusive
np = 4
properties = lcgpro
ntype = cluster
jobs = 0/335.cs-grid1.bgu.ac.il, 1/338.cs-grid1.bgu.ac.il, 2/340.cs-grid1.bgu.ac.il, 3/331.cs-grid1.bgu.ac.il
status = opsys=linux,uname=Linux wn02.bgu.ac.il 2.6.9-42.0.2.EL.1.cernsmp #1 SMP Fri Sep 8 15:19:18 CEST 2006
x86_64,sessions=11752 22736 2049 7827,nsessions=4,nusers=3,idletime=8918,totmem=7853836kb,availmem=7765660kb,
physmem=4039748kb,ncpus=4,loadave=0.00,netload=2057275256,state=free,jobs=331.cs-grid1.bgu.ac.il
335.cs-grid1.bgu.ac.il 338.cs-grid1.bgu.ac.il 340.cs-grid1.bgu.ac.il,rectime=1178217867
------------------------------------------------
Maybe there is a way to allow the WN accept more than 4 jobs in case the
load is near zero or ,instead , less than pbs-mom defined $ideal_load value ?
AFAIK also if pbs-mom $max_load value is exceeded , it won't accept new
jobs, even then there are empty slots. Please correct me if i'm wrong.
Does it mean that all maui reserved jobs would not get executed on time?
How can glite+pbs check that ,for example , simple single cpu job
does not start to use more that one cpu thus causing other jobs
starvation? The same question for any WN resouces that my cause other jobs
starvation, due to misbehaving jobs.
Thanks a lot
Alex
|