Hi all,
I'm trying this ( Stephen's lrmsinfo-condor back end to Jeff's
lcg-info-dynamic-scheduler) out along with Leslie Groer's hacked
lcg-info-dynamic-condor, I'm still missing some pieces of the puzzle,
but I noticed a couple issues to report below.
More later as I work through the issues. I'm still fuzzy on the
relationships between /opt/lcg/libexec/lcg-info-dynamic-condor
(original, or the one hacked by Leslie Groer) and the
lcg-info-dynamic-scheduler by Jeff (with lrmsinfo-condor back-end by
Stephen Childs). They both call condor_q and condor_status. Should they
not both be used at the same time, or are they filling out different
sections of the Glue schema?
Problems, in no particular order...
1) In Leslie's lcg-info-dynamic-condor, if the condor manager isn't
running on the standard port, it won't work. We are running ours on
9660. The full proper value, including the port, seems to be retrievable
using 'condor_config_val COLLECTOR_HOST', but since I am not a Condor
pro, I don't know if the COLLECTOR_HOST and CONDOR_HOST would always be
the same.
2) In Stephen's lrmsinfo-condor, if you are using UNIX secondary groups,
getusergroup=commands.getstatusoutput("id -Gn %s" %user)
will return *all* groups in the group field. Switching to '"id -gn %s"
%user)', i.e. lowercase 'g' for the effective group, worked for me, but
again, I'm not sure what the wider ramifications of using that might be.
3) What are the units for the cycle_time config variable? And is this
referring to how frequently condor worker nodes ask for jobs, i.e.
NEGOTIATOR_INTERVAL? That does appear to be available from
condor_config_val.
4) lrmsinfo-condor appears to fail when there are *no* jobs in Condor
from this CE. See attached Python trace.
Cheers,
--john
Stephen Childs wrote:
> Please could Condor sites test my plugin for the dynamic scheduler info
> provider and give me some feedback. I had a query about how to install
> it. Here's how:
>
> 1. Install the file lrmsinfo-condor at /opt/lcg/libexec/
>
> 2. Create a dynamic scheduler configuration file something like this:
> [root@gridgate plugin]# cat
> /opt/lcg/etc/lcg-info-dynamic-scheduler-lcgcondor.conf
> [Main]
> static_ldif_file: /opt/lcg/var/gip/ldif/static-file-CE.ldif
> vomap :
> dteam:dteam
> lhcb:lhcb
> atlas:atlas
> alice:alice
> cms:cms
> ops:ops
> module_search_path : ../lrms:../ett
> [LRMS]
> lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-condor
> [Scheduler]
> cycle_time : 0
>
> 3. Create a dynamic schduler wrapper that looks something like this:
> [root@gridgate plugin]# cat
> /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper
> #!/bin/sh
> /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
> /opt/lcg/etc/lcg-info-dynamic-scheduler-lcgcondor.conf
>
> 4. Try running the command as edginfo:
> su - edginfo -c
> /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper
>
> You should get sensible output something like this:
>
> dn:
> GlueVOViewLocalID=gitest,GlueCEUniqueID=gridgate.cs.tcd.ie:2119/jobmanager-lcgcondor-condor,mds-vo-name=local,o=grid
>
> GlueVOViewLocalID: gitest
> GlueCEAccessControlBaseRule: VO:gitest
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateFreeJobSlots: 7
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
>
> dn:
> GlueCEUniqueID=gridgate.cs.tcd.ie:2119/jobmanager-lcgcondor-condor,mds-vo-name=local,o=grid
>
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
>
>
> Stephen
--
John R. Hover
RHIC/Atlas Computing Facility, Bldg. 510M
Physics Department
Brookhaven National Laboratory
Upton, NY 11793
email: [log in to unmask]
tel: 631-344-5828
[root@lcg01 root]# condor_q
-- Submitter: lcg01.usatlas.bnl.gov : <130.199.185.48:21848> : lcg01.usatlas.bnl.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
8931.0 atlas004 1/31 11:19 0+00:00:00 I 0 0.0 data
1 jobs; 1 idle, 0 running, 0 held
[root@lcg01 libexec]# ./lrmsinfo-condor
nactive 5007
nfree 4660
now 1170260379
schedCycle 300
{'queue': 'condor', 'state': 'running', 'cpucount': '1', 'group': 'atlas', 'user': 'atlas004', 'maxwalltime': '999999999.0', 'qtime': '1170260361.0', 'jobid': 'lcg01.usatlas.bnl.go v#1170260361#8931.0'}
[root@lcg01 libexec]# condor_q
-- Submitter: lcg01.usatlas.bnl.gov : <130.199.185.48:21848> : lcg01.usatlas.bnl.gov
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held
[root@lcg01 libexec]# ./lrmsinfo-condor
nactive 5007
nfree 4659
now 1170260405
schedCycle 300
Traceback (most recent call last):
File "./lrmsinfo-condor", line 63, in ?
getusergroup=commands.getstatusoutput("id -gn %s" %user)
NameError: name 'user' is not defined
|