Print

Print


FYI

-------- Original Message --------
Subject: ATLAS sites, please attention to this!
Date: Mon, 1 Oct 2007 12:36:18 +0200
From: Simone Campana <[log in to unmask]>
To: atlas-comp-oper (ATLAS Computing Operations) <[log in to unmask]>

Dear ATLAS sites, please pay particular attention to this.

Most sites are publishing wrong numbers in the ATLAS VOView. This is
particularly bad for job distribution, since the WMS looks at the VOView
infos to decide weather a site is empty or full. Therefore jobs are
piling up in sites which are already full and leave empty some sites
which could run jobs.

Jeff offered an explanation (see mail below) with a description of how
he fixed the problem, in addition the problem was reported at the last
LCG operation meeting but the situation looks particularly bad still.

I put a list of problematic CEs in

http://voatlas01.cern.ch/atlas/data/VOViewProblem.log

Beside the CE name, you find some numbers, which represent the number of
waiting and running jobs from the "all inclusive" view (showing infos
about all VOS supported in that queue) and the number of waiting and
running jobs obtained adding up all the VOViews for VOs supported by
that site. Generally the two numbers for both waiting and running jobs
should be the same, but they dont


Some further docs about debugging of Inforation Providers, beside jeff'
s explanation, is in this Twiki from Laurence.

http://twiki.cern.ch/twiki/bin/view/EGEE/TestingDynamicInformation

Could atlas sites please investigate and give a statement, possibly
fixing the problem (there may be some false positives, but generally the
problem is there).

If this situation lasts long it would be quite bad for production and at
some point drastic measures like site banning will be necessarily
enforced.

Thanks for the attention.

Simone



> -----Original Message-----
> From: Jeff Templon [mailto:[log in to unmask]]
> Sent: Friday, September 28, 2007 3:43 PM
> To: Nicholas Thackray; atlas-comp-oper (ATLAS Computing Operations);
> Gergely Debreczeni
> Subject: changed lcg-info-dynamic-scheduler.conf
> 
> Hi *,
> 
>    It was reported at the ops meeting (and associated tickets opened)
> that most ATLAS jobs were invisible from the VOViews published by
> sites.
> 
>    The case here was that the dynamic-scheduler was being configured
to
> map special groups to FQANs, while publishing of these FQANs was
turned
> off in the machine BDII.  Hence yes, the special groups have been
> configured to be invisible.
> 
>    I turned them back on, by configuring the dynamic scheduler (by
> hand) to map all VO special groups to the generic VO.  I suspect this
> is what is needed at other sites as well, perhaps you could check and
> announce the necessary changes.  Please make sure that the
announcement
> is worded correctly, people seem to be getting the impression that
> there is a bug in the information provider, this is definitely not
> responsible for what is seen; the info provider is doing exactly what
> is requested!
> 
>    Gergo could you check : I suspect that YAIM is still doing this
> group-to-FQAN mapping, even though publishing is turned off.  That's
> the only way I can understand that 130 separate queues are affected.
> 
> 			JT
> 
> ps: what I did here to fix it as attachment.  Don't blindly apply the
> patch because your site's mix of VOs may be different.