Thanks for your efforts here, Santanu. You've clearly improved the
plugin a lot just by getting it run correctly.
I would suggest that you submit a patch against the condor plugin, so
that your hard work is not lost to other sites and makes its way into
the middleware.
Things are not, however, quite right yet. I can see 5 atlas jobs
running, but none in the VOView:
azure:~/teaching/linux_c$ !487
ldapsearch -x -H ldap://serv03.hep.phy.cam.ac.uk:2170 -b mds-vo-
name=UKI-SOUTHGRID-CAM-HEP,o=grid '(&(|(objectclass=GlueVOView)
(objectclass=GlueCE))(GlueCEAccessControlBaseRule=VO:atlas))'
GlueCEStateTotalJobs GlueCEStateWaitingJobs GlueCEStateRunningJobs
GlueCEAccessControlBaseRule
# extended LDIF
#
# LDAPv3
# base <mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid> with scope sub
# filter: (&(|(objectclass=GlueVOView)(objectclass=GlueCE))
(GlueCEAccessControlBaseRule=VO:atlas))
# requesting: GlueCEStateTotalJobs GlueCEStateWaitingJobs
GlueCEStateRunningJobs GlueCEAccessControlBaseRule
#
# serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas, UKI-
SOUTHGRID-CAM
-HEP, grid
dn: GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-
atlas,md
s-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid
GlueCEStateRunningJobs: 5
GlueCEStateRunningJobs: 0
GlueCEStateTotalJobs: 5
GlueCEStateTotalJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEAccessControlBaseRule: VO:atlas
# atlas, serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas,
UKI-SOUTHG
RID-CAM-HEP, grid
dn: GlueVOViewLocalID=atlas,GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:
2119/jobma
nager-lcgcondor-atlas,mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid
GlueCEAccessControlBaseRule: VO:atlas
GlueCEStateRunningJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateTotalJobs: 0
Have you checked the lcg-info-dynamic-scheduler.conf file to ensure
it's not mapping production accounts to a FQAN?
Thanks
Graeme
On 8 Oct 2007, at 10:45, Santanu Das wrote:
> Hi Simone,
>
> Just to let you know, that VOView simply doesn't work for a site
> running Condor because of non-functional "lrmsinfo-condor" and
> broken "lcg-info-dynamic-scheduler" script. That's the reason we
> were publishing 4444 always-waiting jobs. I really doubt that Atlas
> team though about other batch systems (other than PBS, of course)
> and took care checking the compatibilities. Apart from that, there
> a number of unprotected operation in python code ... lcg-info-
> dynamic-scheduler is one of the examples.
>
> We've made some changes in the scripts and VOView part is working,
> I believe. Running the script, now we get stuff like these:
>
> [root@serv03 libexec]# /opt/lcg/libexec/lcg-info-dynamic-scheduler -
> c /opt/lcg/etc/lcg-info-dynamic-scheduler.conf
> dn: GlueVOViewLocalID=alice,GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:
> 2119/jobmanager-lcgcondor-alice,mds-vo-name=local,o=grid
> GlueVOViewLocalID: alice
> GlueCEAccessControlBaseRule: VO:alice
> GlueCEStateRunningJobs: 1
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 1
> GlueCEStateFreeJobSlots: 108
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
>
> dn: GlueVOViewLocalID=atlas,GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:
> 2119/jobmanager-lcgcondor-atlas,mds-vo-name=local,o=grid
> GlueVOViewLocalID: atlas
> GlueCEAccessControlBaseRule: VO:atlas
> GlueCEStateRunningJobs: 17
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 17
> GlueCEStateFreeJobSlots: 108
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
>
> dn:
> GlueVOViewLocalID=biomed,GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:
> 2119/jobmanager-lcgcondor-biomed,mds-vo-name=local,o=grid
> GlueVOViewLocalID: biomed
> GlueCEAccessControlBaseRule: VO:biomed
> GlueCEStateRunningJobs: 3
> GlueCEStateWaitingJobs: 1
> GlueCEStateTotalJobs: 4
> GlueCEStateFreeJobSlots: 0
> GlueCEStateEstimatedResponseTime: 465856
> GlueCEStateWorstResponseTime: 2146060842
> ....................
> ....................
>
> and, also:
>
> [root@serv03 libexec]# ldapsearch -x -H ldap://
> serv03.hep.phy.cam.ac.uk:2170 -b mds-vo-name=UKI-SOUTHGRID-CAM-
> HEP,o=grid '(&(|(objectclass=GlueVOView)(objectclass=GlueCE))
> (GlueCEAccessControlBaseRule=VO:atlas))' GlueCEStateTotalJobs
> GlueCEStateWaitingJobs GlueCEStateRunningJobs
> GlueCEAccessControlBaseRule
> version: 2
>
> #
> # filter: (&(|(objectclass=GlueVOView)(objectclass=GlueCE))
> (GlueCEAccessControlBaseRule=VO:atlas))
> # requesting: GlueCEStateTotalJobs GlueCEStateWaitingJobs
> GlueCEStateRunningJobs GlueCEAccessControlBaseRule
> #
>
> # serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas, UKI-
> SOUTHGRID-CAM
> -HEP, grid
> dn: GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:2119/jobmanager-
> lcgcondor-atlas,md
> s-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid
> GlueCEStateRunningJobs: 17
> GlueCEStateRunningJobs: 17
> GlueCEStateTotalJobs: 17
> GlueCEStateTotalJobs: 17
> GlueCEStateWaitingJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEAccessControlBaseRule: VO:atlas
>
> # atlas, serv03.hep.phy.cam.ac.uk:2119/jobmanager-lcgcondor-atlas,
> UKI-SOUTHG
> RID-CAM-HEP, grid
> dn: GlueVOViewLocalID=atlas,GlueCEUniqueID=serv03.hep.phy.cam.ac.uk:
> 2119/jobma
> nager-lcgcondor-atlas,mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid
> GlueCEAccessControlBaseRule: VO:atlas
> GlueCEStateRunningJobs: 17
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 17
>
> # search result
> search: 2
> result: 0 Success [What does it mean??]
>
> # numResponses: 3
> # numEntries: 2
>
>
> The atlas jobs are running not the real production jobs; but only
> from atlasprd. Any idea why we are still not getting any jobs? Any
> advise/suggestion of any kind in this regard would be very much
> appreciated.
>
> Regards,
> Santanu
> HEP, Cavendish Lab
> Cambridge
>
>
>
> Alessandra Forti wrote:
>> FYI -------- Original Message -------- Subject: ATLAS sites,
>> please attention to this! Date: Mon, 1 Oct 2007 12:36:18 +0200
>> From: Simone Campana <[log in to unmask]> To: atlas-comp-oper
>> (ATLAS Computing Operations) <[log in to unmask]> Dear ATLAS
>> sites, please pay particular attention to this. Most sites are
>> publishing wrong numbers in the ATLAS VOView. This is particularly
>> bad for job distribution, since the WMS looks at the VOView infos
>> to decide weather a site is empty or full. Therefore jobs are
>> piling up in sites which are already full and leave empty some
>> sites which could run jobs. Jeff offered an explanation (see mail
>> below) with a description of how he fixed the problem, in addition
>> the problem was reported at the last LCG operation meeting but the
>> situation looks particularly bad still. I put a list of
>> problematic CEs in http://voatlas01.cern.ch/atlas/data/
>> VOViewProblem.log Beside the CE name, you find some numbers, which
>> represent the number of waiting and running jobs from the "all
>> inclusive" view (showing infos about all VOS supported in that
>> queue) and the number of waiting and running jobs obtained adding
>> up all the VOViews for VOs supported by that site. Generally the
>> two numbers for both waiting and running jobs should be the same,
>> but they dont Some further docs about debugging of Inforation
>> Providers, beside jeff' s explanation, is in this Twiki from
>> Laurence. http://twiki.cern.ch/twiki/bin/view/EGEE/
>> TestingDynamicInformation Could atlas sites please investigate and
>> give a statement, possibly fixing the problem (there may be some
>> false positives, but generally the problem is there). If this
>> situation lasts long it would be quite bad for production and at
>> some point drastic measures like site banning will be necessarily
>> enforced. Thanks for the attention. Simone
>>> -----Original Message----- From: Jeff Templon
>>> [mailto:[log in to unmask]] Sent: Friday, September 28, 2007 3:43
>>> PM To: Nicholas Thackray; atlas-comp-oper (ATLAS Computing
>>> Operations); Gergely Debreczeni Subject: changed lcg-info-dynamic-
>>> scheduler.conf Hi *, It was reported at the ops meeting (and
>>> associated tickets opened) that most ATLAS jobs were invisible
>>> from the VOViews published by sites. The case here was that the
>>> dynamic-scheduler was being configured
>> to
>>> map special groups to FQANs, while publishing of these FQANs was
>> turned
>>> off in the machine BDII. Hence yes, the special groups have been
>>> configured to be invisible. I turned them back on, by configuring
>>> the dynamic scheduler (by hand) to map all VO special groups to
>>> the generic VO. I suspect this is what is needed at other sites
>>> as well, perhaps you could check and announce the necessary
>>> changes. Please make sure that the
>> announcement
>>> is worded correctly, people seem to be getting the impression
>>> that there is a bug in the information provider, this is
>>> definitely not responsible for what is seen; the info provider is
>>> doing exactly what is requested! Gergo could you check : I
>>> suspect that YAIM is still doing this group-to-FQAN mapping, even
>>> though publishing is turned off. That's the only way I can
>>> understand that 130 separate queues are affected. JT ps: what I
>>> did here to fix it as attachment. Don't blindly apply the patch
>>> because your site's mix of VOs may be different.
>
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|