Print

Print


Hi,

I can take a detailed look next week (this week still on vacation), but
my gut feeling is that the negative response times come from flaws in
the algorithm used to compute values such as "free cpus" and "total
cpus".  I've had to do similar things for the Copernican ERT algorithm,
and it is *really hard* to get it right.  Here is an example:

In the latest release of torque, when a WN goes down, the node is
recognizable as "down" by using a command like 'pbsnodes -a'.  However,
any jobs that were running on that node when it went down are still
listed as running in a command like 'qstat -f'.  So if you calculate
running jobs solely on the basis of 'qstat -f' and free/total CPUs
solely on the basis of 'pbsnodes -a' you can wind up with a negative
value for FreeCPUs, since there are some of your jobs running on
DeadCPUs but you are subtracting these jobs from the AvailableJobSlots
value that does not include (unless your algorithm is really bad)
DeadCPUs.

Clear?  Too bad LCG rollout rejected my last post, explaining the James
Kirk / Elvis Presley theory.  Maybe I will talk about that at the next
LCG workshop.  Fits in well with spacetime warping.

        J "yes I really did try to post it" T

On Fri, 2005-01-07 at 00:47, Dimitris Zilaskos wrote:
> Maarten Litmaath, CERN wrote:
> > On Thu, 6 Jan 2005, Rod Walker wrote:
> >
> >
> >>Hi,
> >>The following CE's are showing negative gluecestateestimatedresponsetime:
> >>
> >>ce001.m45.ihep.su:2119/jobmanager-pbs-infinite -1
> >>heplnx131.pp.rl.ac.uk:2119/jobmanager-lcgpbs-lhcbL -3
> >>lunegw.lancs.ac.uk:2119/jobmanager-lcgpbs-infinite -1
> >>node001.grid.auth.gr:2119/jobmanager-lcgpbs-infinite -2147483647
> >>testbed001.phys.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite -624366
> >>
> >>Actually sinica attracts all my jobs until a few are queueing and then
> >>the ERT goes sensible again.
> >
> >
> > The Greek site is seriously warping space-time, allowing a job submitted
> > today to finish some 68 years ago...  :-)
>
>         Major breakthrough of Grid Computing :)
>
> I suspect that this is caused by some variable wrapping when exceeding a
> particular value , we have around 15 jobs in infinite queue at this
> moment. Perhaps someone  with knowledge on how this
> gluecestateestimatedresponsetime value is calculated can help.
>
> Best regards ,
>
> --
> =============================================================================
>
> Dimitris Zilaskos
>
> Department of Physics @ Aristotle Univercity of Thessaloniki , Greece
> PGP key : http://tassadar.physics.auth.gr/~dzila/pgp_public_key.asc
>            http://egnatia.ee.auth.gr/~dzila/pgp_public_key.asc
> MD5sum  : de2bd8f73d545f0e4caf3096894ad83f  pgp_public_key.asc
> =============================================================================