The VO tags is defiantly something that we have to worry about, however,
they are a few tricks we can use to obtain that information remotely
without mounting NFS.
Laurence
Brew, CAJ (Chris) wrote:
> Hi,
>
> Ah, but the VO Tags are normally on the CE, but published as part of the
> subcluster. For my pair of CE's I put both the tags and the gridmapdir
> onto NFS mounts of the torque server on the grounds that if that's down
> then the batch system is down anyway.
>
> Chris.
>
>
>> -----Original Message-----
>> From: LHC Computer Grid - Rollout
>> [mailto:[log in to unmask]] On Behalf Of Laurence
>> Sent: 20 September 2007 14:27
>> To: [log in to unmask]
>> Subject: Re: [LCG-ROLLOUT] most sites publishing zero for
>> cluster sizes
>>
>> I completely agree with Ulrich that one thing that we are
>> missing is a CE which can be deployed as a load balanced service.
>>
>> For the publishing, as the batch system can be queried
>> remotely (it is anyway at CERN), the information service part
>> could be moved to the site BDII.
>>
>> Laurence
>>
>>
>> Ulrich Schwickerath wrote:
>>
>>> Hi,
>>>
>>> my 2 cents ;-)
>>>
>>>
>>>
>>>> But isn't that exactly what e.g. the CERN CEs are doing?
>>>> For example, ce101 and ce102 are fronting the same WNs.
>>>>
>>>>
>>> Plus some more identical CEs because a single CE is not
>>>
>> able to cope
>>
>>> with the work load. If this service could be distributed
>>>
>> over a load
>>
>>> balanced cluster of (stateless) machines this problem would
>>>
>> not exist,
>>
>>> right ?
>>>
>>> If only one CE is supposed to publish these values per sub cluster,
>>> which one should be picked ? What happens if this CE goes down (*) ?
>>> Also, the number of cores aka CPUs in this environment is
>>>
>> not at all
>>
>>> static but very dynamic. It changes every day because machines come
>>> and go for various reasons. Is the number of cores/cpus a useful
>>> number at all, if the resources are in fact shared (as they are at
>>> CERN) with local users ? (*)
>>>
>>> I'm surprised to hear statements like "misuse of the schema" while
>>> this is the only possibility for us to actually survive the load.
>>> Please believe me that it is not easy to maintain a cluster of 24
>>> machines in a CE cluster, and if we have to start to make
>>>
>> individual
>>
>>> machines special in some way, the system will become unmaintainable
>>> very soon. As a matter of fact we have a job throughput
>>>
>> right now of
>>
>>> 130k jobs per day and about 50k jobs in the system, increasing.
>>>
>>> Practical suggestions for a solution are very welcome.
>>>
>>> Cheers,
>>> Ulrich
>>>
>>> (*) one could indeed set up a non existing "fake" CE per cluster in
>>> the BDII which would publish these numbers. I do not really
>>>
>> like this
>>
>>> idea because it is just another ugly hack to get around a
>>>
>> limitations
>>
>>> of the system, isn't it ?
>>>
>>>
|