Hi all,
I've been trying out the various new Condor info scripts offered on this
list and am having some problems. Namely,
1) /opt/lcg/var/gip/plugin/lcg-info-dynamic-ce (which is just a wrapper
performing '/opt/lcg/libexec/lcg-info-dynamic-condor /opt/condor/bin/
/opt/lcg/var/gip/ldif/static-file-CE.ldif' ) works (giving real values
for, e.g. GlueCEStateTotalJobs).
2) /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper (which is
just a wrapper performing '/opt/lcg/libexec/lcg-info-dynamic-scheduler
-c /opt/lcg/etc/lcg-info-dynamic-scheduler-lcgcondor.conf' ) *sometimes*
works ( giving real values for, e.g. GlueCEStateFreeJobSlots). But
other times it dumps a Python stack trace:
Traceback (most recent call last):
File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 306, in ?
ert = wq.estimate(bq,vo,debug=0)
File "/opt/lcg/lib/python/EstTT.py", line 183, in estimate
return self.strategy.algorithm(lrms, vo, debug)
File "/opt/lcg/lib/python/EstTT.py", line 47, in algorithm
return ett(lrms, self.queue, vo, algorithm='longest',debug=debug)
File "/opt/lcg/lib/python/EstTT.py", line 245, in ett
est = _ALGS[algorithm](server,server.jobs_last_query())
File "/opt/lcg/lib/python/EstTT.py", line 194, in ett_longest_queue_time
qtlist.append(server.now - j.get('qtime'))
TypeError: unsupported operand type(s) for -: 'int' and 'str'
3) Running '/opt/lcg/bin/lcg-info-generic
/opt/lcg/etc/lcg-info-generic.conf', which I thought basically just runs
the plugins, never shows an error, but it also never shows real values
for the examples above, either. Just zeros.
My static-file-CE.ldif just has zeros for the values above. I thought
the dynamic values were supposed to replace the zeros. Should the static
files have blanks?
Any idea what might be going wrong? Sorry if I've missed something obvious.
Thanks,
--john
Stephen Childs wrote:
>> Anyway, to get back on-topic, I started hacking an lrmsinfo-condor
>> script yesterday evening. So far I have this:
>
> I have attached a reasonably functional lrmsinfo-condor. This is what
> the output looks like (after the condor output):
>
> [root@gridgate libexec]# condor_status -total; condor_q ; python
> lrmsinfo-condor
>
> Condor output ------------- Total Owner Claimed Unclaimed Matched
> Preempting Backfill
>
> INTEL/LINUX 17 6 0 11 0 0 0
> INTEL/WINNT51 1 1 0 0 0 0 0
>
> Total 18 7 0 11 0 0 0
>
>
> -- Submitter: gridgate.cs.tcd.ie : <134.226.53.57:44399> :
> gridgate.cs.tcd.ie ID OWNER SUBMITTED RUN_TIME ST
> PRI SIZE CMD 16470.0 gitest042 1/11 17:36 0+00:18:16 H 0
> 9.8 data 22766.0 gitest042 1/25 11:26 0+00:00:03 R 0 9.8
> data 22767.0 gitest042 1/25 11:26 0+00:00:00 I 0 9.8 data
> 22768.0 gitest042 1/25 11:26 0+00:00:00 I 0 9.8 data
>
> 4 jobs; 2 idle, 1 running, 1 held
>
>
>
> lrmsinfo-condor output ---------------------- nactive 11
> nfree 10 now 1169724422 schedCycle 300
> {'start': '1168537036.0', 'queue': 'condor', 'state': 'pending',
> 'cpucount': '1', 'group': 'gitest', 'user': 'gitest042',
> 'maxwalltime': '999999999.0', 'qtime': '1168537013.0', 'jobid':
> 'gridgate.cs.tcd.ie#1168537013#16470.0'} {'start': '1169724418.0',
> 'queue': 'condor', 'state': 'running', 'cpucount': '1', 'group':
> 'gitest', 'user': 'gitest042', 'maxwalltime': '999999999.0', 'qtime':
> '1169724411.0', 'jobid': 'gridgate.cs.tcd.ie#1169724412#22766.0'}
> {'queue': 'condor', 'state': 'queued', 'cpucount': '1', 'group':
> 'gitest', 'user': 'gitest042', 'maxwalltime': '999999999.0', 'qtime':
> '1169724412.0', 'jobid': 'gridgate.cs.tcd.ie#1169724412#22767.0'}
> {'queue': 'condor', 'state': 'queued', 'cpucount': '1', 'group':
> 'gitest', 'user': 'gitest042', 'maxwalltime': '999999999.0', 'qtime':
> '1169724413.0', 'jobid': 'gridgate.cs.tcd.ie#1169724413#22768.0'} [
>
>
> I am neither a Condor nor a Python expert so comments welcome. In
> particular, schedCycle, cpucount and maxwalltime are hard-coded for
> the moment.
>
> Stephen
>
>
> ------------------------------------------------------------------------
>
>
>
> #!/usr/bin/python
>
> import commands import time
>
> # First list is of Condor's job states, second is of the states used
> by the # lrmsinfo spec. lrmsinfo list is accessed directly, I've left
> them both # in to show the mapping.
> condorJobStates=['Unexpanded','Idle','Running','Removed','Completed','Held','Submission_err']
>
>
> lrmsinfoJobStates=['queued','queued','running','done','done','pending','done']
>
>
>
> # Get the current time now=int(time.time())
>
> # Get the status of pool nodes using condor_status
> condor_output=commands.getstatusoutput('condor_status -format "%s\n"
> State')
>
> if (condor_output[0] == 0): condor_states=condor_output[1].split()
>
> freq={} for state in ['Owner','Claimed','Unclaimed','Matched']:
> freq[state]=condor_states.count(state) total_nodes=len(condor_states)
>
>
>
> # number of nodes available to run jobs is the number of # nodes up
> and unused by their owner nactive=total_nodes-freq['Owner'] print
> "nactive\t\t%s" %(nactive)
>
> # number of free nodes is the number of active nodes # minus the
> number of claimed nodes nfree=freq['Unclaimed'] print "nfree\t\t%s"
> %(nfree)
>
> # now is the number of seconds since start of epoch print "now\t\t%s"
> %(now)
>
> # schedCycle is NEGOTIATOR_INTERVAL? Can it be queried using #
> condor_config_val? Hard-coded to 300 for now. schedCycle=300 print
> "schedCycle\t%d" %(schedCycle)
>
> # Get the info we need about jobs from condor_q
>
> condorq_output=commands.getstatusoutput('condor_q -format "%.1f,"
> JobStartDate -format "%d," JobStatus -format "%s," Owner -format
> "%.1f," QDate -format "%s\n" GlobalJobId')
>
> if (condorq_output[0] == 0): # Break up into individual jobs
> job_list=condorq_output[1].split('\n')
>
> # Process each job for job in job_list: job=job.split(',') # If the
> job hasn't started yet, there will be no start time. if len(job) ==
> 5: (start,state,user,qtime,jobid)=job elif len(job) == 4:
> (state,user,qtime,jobid)=job start='no'
>
> # get user's group getusergroup=commands.getstatusoutput("id -Gn %s"
> %user) if (getusergroup[0] == 0): group=getusergroup[1] else:
> group="error"
>
> # set CPU count (not sure how to get this in Condor) cpucount="1"
>
> jobDescr= "{"
>
> state=int(state) if (start != 'no'): jobDescr+="'start': '%s', "
> %start jobDescr+="'queue': 'condor', 'state': '%s', 'cpucount': '%s',
> 'group': '%s', 'user': '%s', 'maxwalltime': '999999999.0', 'qtime':
> '%s', 'jobid': '%s'}" %(lrmsinfoJobStates[state], cpucount, group,
> user, qtime, jobid) print jobDescr
--
John R. Hover
RHIC/Atlas Computing Facility, Bldg. 510M
Physics Department
Brookhaven National Laboratory
Upton, NY 11793
email: [log in to unmask]
tel: 631-344-5828
|