Burke, S (Stephen) wrote:
>LHC Computer Grid - Rollout
>
>
>>[mailto:[log in to unmask]] On Behalf Of Wei Xing said:
>>We have together 13 WNs, which 6 dedicate BioMed DC to
>>running BioMed
>>jobs. However, since we have other 7 WNs free, RB sends us
>>BioMed jobs
>>continuously (so far we have 164 jobs). How can we let RB know we
>>are now overload?
>>
>>
>
>You need to know what Rank and Requirements expressions they are using,
>e.g. they should probably modify them to reduce the Rank as the number
>of queued jobs goes up. It might be useful for them to talk to people in
>atlas, they have quite a lot of experience dealing with this sort of
>thing.
>
>Stephen
>
>
>
Dear Wei Xing,
We have already observed the same problem you encounter now at
differents sites (PIC, LPC)
* One solution in case of queue being overloading is to close the queues
to Biomed coming jobs:
There is 2 places to do that. One is the PBS queue manager with:
$ qmgr -c 'set queue .. enabled=false'
The other place is on the Information Service, in the ce-static.ldif
file for each queue used by Biomed add that line:
GlueCEStateStatus: Closed
* Another solution as suggested is to prevent the overloading with for
example:
GlueCEPolicyMaxTotalJobs: 10
About the Rank used it is the same as the Atlas VO used:
Rank = (other.GlueCEStateWaitingJobs == 0 ?
other.GlueCEStateFreeCPUs : -other.GlueCEStateWaitingJobs);
But we found that this Rank was not good for sites that dedicated many
CPUs to Biomed and a few for other VOs, what happen in that case is that
there is some jobs from other VOs waiting and a lot of free CPUs for
Biomed. But the Rank above was negative because of the (other VO) jobs
waiting, letting dedicated Biomed CPUs free.
So we suggested to change the Rank to
Rank =(other.GlueCEStateFreeCPUs == 0 ?
-other.GlueCEStateWaitingJobs : other.GlueCEStateFreeCPUs );
That way jobs could go to sites with a lot of dedicated Biomed CPUs.
But now the problem is with sites that have a policy like you have. When
all other sites have a number of free CPUs less than 7 your site receive
all the jobs. But we cannot distinguish the 2 policies (In case
other.GlueCEStateWaitingJobs > 0 and other.GlueCEStateFreeCPUs > 0)
To sum up, as there is no information in the Information System about
the site policy there is no current way to avoid that kind of problems.
What I think personnaly is that the RB has to send jobs with a
probabilty proportionnal with their total CPU speed as stated here:
(http://www.ulb.ac.be/di/ssd/vberten/Papers/RandomBrokering-Full.ps)
Best regards,
--
Emmanuel Medernach
|