Hi David
Which VO are the jobs running under?
I found the policy document under the Deployment Board policies section
here: http://www.gridpp.ac.uk/db/ (using search term "inefficient" in
the search box on the homepage). A direct link would be this for the
summary page http://www.gridpp.ac.uk/db/Inefficient_Jobs.html and this
for the document
http://www.gridpp.ac.uk/db/GridPP-PMB-113-Inefficient_Jobs_v1.1.doc (and
the name led me to find this document also in the PMB documents list
http://www.gridpp.ac.uk/pmb/docs/index.htm)! So, take your pick!
Jeremy
-----Original Message-----
From: Testbed Support for GridPP member institutes
[mailto:[log in to unmask]] On Behalf Of David Colling
Sent: 18 October 2008 20:31
To: [log in to unmask]
Subject: [Fwd: Jobs idling on transfers..]
Dear TB people,
We have been seeing a lot of idle jobs recently because of the transfers
failing (see below). Now that we are in a survey period for our next
round of funding these jobs are burning potential earnings and so should
be killed. However, this is clearly not always good for the poor user. I
know that Graeme wrote a suggested policy on idle jobs on this but I
cannot find it (tried looking in the documents section of GridPP web
pages and google with no luck) so I was wondering if anybody can point
me to a link to it.
What are other people planning to do about this? Clearly, the way to
maximise you GridPP income is to kill as soon as you detect that it has
stopped using CPU time (Ok after a short while so that working transfers
succeed and you don't get blacklisted by any experiments), however it
does seem rather unfair to kill a job that has been running for 65 hours
if it fails to transfer its output in 30minutes.
All the best,
david
-------- Original Message --------
Subject: Jobs idling on transfers..
Date: Thu, 16 Oct 2008 22:35:17 +0100
From: Kostas Georgiou <[log in to unmask]>
To: lcg-site <[log in to unmask]>
Hi,
I had a look at why so many jobs in the farm and it looks that most of
them are just waiting to copy data back to lcgwms02.gridpp.rl.ac.uk
that I suspect is done (doesn't reply to a ping).
The default job script does something like the following:
timewait=300
copy_retry_count=6
retry_count=0
while retry_count <= copy_retry_count
try copy
if failed
timewait = timewait*2
retry_count++
sleep timewait
else
return
end
If my maths don't fail me the jobs will be waiting for more than
10 hours which isn't something that we probably want.
Can you please decide on a sensible number for
GLITE_LOCAL_COPY_RETRY_COUNT and GLITE_LOCAL_COPY_RETRY_FIRST_WAIT and
put the environment variables in the job manager prologue script so we
aren't wasting our CPU?
Kostas
--
Scanned by iCritical for STFC.
|