Hi Owen,
We have asked for another option to be added to EGEE broadcast tool to
email all the sys admins directly. This will be used for all
notification such as this and of new releases etc.
Laurence
owen maroney wrote:
> For the record, the fix for this problem is described in the wiki for
> LCG Release Fixes:
>
> http://goc.grid.sinica.edu.tw/gocwiki/LCG_Release_Fixes
>
> (This is indeed a big problem: at IC this morning it was discovered
> the site bdii had stopped late yesterday evening, and lhcb had merrily
> put over 600 jobs in the queue, on the belief that the farm was still
> empty. BDII now updated!)
>
> While the wiki pages are very good and should be kept up to date
>
> (and probably ought to be linked to by the release notes:
> http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=releases)
>
>
> there does need to be some mechanism by which Release Fixes are more
> clearly flagged for sysadmins (eg. message header: NEW RELEASE FIX:
> xyz, with the body of the message a link to the wiki).
>
>
> Laurence wrote:
>
>> Hi Rod,
>>
>> The "sticky bdii" problem is well understood and a fix has been out
>> for some time. The bdii update thread will die under certain
>> circumstances. The GStat Monitor includes a check on BDIIs with stale
>> data. Any sites worried about this having this problem can check the
>> GStat page (which they should do regularly anyway!). Maarten Litmaath
>> sent an email to the Rollout list informing everyone that there is a
>> new rpm that fixes the problem sometime ago. An update will also be
>> in the apt repository.
>>
>> Laurence
>>
>>
>>
>>
>>
>> Rod Walker wrote:
>>
>>> Hi,
>>> I suspect that sg01-lcg.cr.cnaf.infn.it:2170 is stuck, since it is
>>> publishing zero queued jobs despite many of my jobs queuing there
>>> (it's lsf so I can`t check for sure - where is qstat?).
>>> I`ve mailed the admins directly but if this is due to the bad bdii
>>> version distributed with 2.4.0 then I would think many sites still
>>> use this. It`s a particularly nasty bug as it can attract thousands
>>> of jobs to the affected site. As such I would say it`s a candidate
>>> for an "urgent patch", if such a thing exists.
>>>
>>> Cheers,
>>> Rod.
>>>
>>>
>>>
>>>
>
|