Hi,
A reasonable setting which may help speed along the Glue issue is to set
the TotalCPU counts for CEs to *zero*. This has the advantage of giving
a definite algorithm that is implementable across all sites.
A refinement which might make sites wishing to implement their own
algorithm is to apply a site-independent scaling factor: compute the
TotalCPU count as you think is correct, and multiply the result by zero.
Seriously the approach suggested in Stephen's link (if it's the one I
think it is) is correct: a *cluster* is a physical group of machines and
should have a TotalCPU count. A *CE* maps onto a LRMS queue and may or
may not share CPUs with other CEs, so "TotalCPUs" for a cluster is
meaningless. Unless perhaps you might like to make TotalCPUs ==
RunningJobs, this actually might make a reasonable alternative to the
"zero" algorithm.
The relevant numbers for a CE are 1) how long a job submitted now might
wait in the queue before executing, and 2) how many of these jobs might
be submitted before they start piling up in the queue. For 2) in other
words, how many jobs can one submit to this queue before one expects the
response time of 1) to start increasing?
These numbers need to be available per VO since they are likely to be
different per VO.
JT
On Sun, 2005-01-30 at 10:19, pierre girard wrote:
> Hi Stephen,
>
> Burke, S (Stephen) a écrit :
>
> >LHC Computer Grid - Rollout
> >
> >
> >>[mailto:[log in to unmask]] On Behalf Of pierre girard said:
> >>The solution consisting in adding CPU counts at Subcluster level,
> >>proposed by stephen, should be the solution to our problem.
> >>
> >>
> >
> >The proposal has now been sent to the glue mailing list for discussion, but
> >I wouldn't like to predict how long it will take before we see it in
> >production!
> >
> >http://www.hicb.org/archives/glue-schema/2005/frm00001.html
> >
> >
> Thanks for this link, it should be very interesting for us.
>
> >
> >
> >>But a solution could be to sum systematically the CPUs of
> >>each queue by site. Indeed, this value has the same meaning for all the
> >>
> >>
> >sites.
> >
> >This doesn't really work for most sites. PBS sites generally have all the
> >CPUs listed in every queue, so if there are ten queues you would overcount
> >by a factor of ten.
> >
> >
> Exactly. So what I meant is that the CPU concept defined from the queue
> is not the same than the "real CPU" concept. It is what I call virtual
> CPU instead of real CPU, but may be we should definitely use other term
> for the queue CPU, something like "job entry point", to avoid any confusion.
>
> As there is, for the moment, no sure way to compute real CPU count, I
> then agree with you for getting it when the glue schema will allow a
> site to directly provide it.
>
> So, what I was simply suggesting is to compute what we have for the
> moment, if we really want to compute something . ;)
> And I think that we can already count the number of job entry points a
> site provides to the grid, that is the sum of site queue CPUs.
>
> What is the interest of this ? Certainly none for the moment, except
> that this is a correct value which is uniformly computed over all the
> sites. Possible comparison between sites could then be possible,
> although I'm not sure that such a comparison would have a really
> interest. But it could if site queues have been reasonably defined by
> the site administrators.
>
> Moreover, when we will be able to get the real CPU count, it could be
> later used to compute an indicator of the quatility of service provided
> by the sites. Indeed, we could then estimate something like the load
> rate (?) of the site CPUs.
>
> Indeed, I guess that a site which defines 10 queues on the same cluster
> will not be able to provide the same quality of service than a site
> which defines only 1 queue on the same kind of cluster. This is close to
> the "overbooking" problem of an air passenger company ;). when all the
> client arrive at same time...you are not able to live up to your offer,
> and then client must wait.
>
> But we are now quite away from our initial problem ;). And I think that
> we finally all agree with the initial problem.
>
> Pierre
>
> >Stephen
> >
> >
> >
>
> --
> ______________________
> Pierre GIRARD
> Grid Computing Team Member
> IN2P3/CNRS Computing Centre - Lyon (FRANCE)
> http://cc.in2p3.fr
> Tel. +33 4.78.93.08.80 | Fax. +33 4.72.69.41.70 | e-mail: [log in to unmask]
|