Hi,
I understand that for many values, there are signal flags that need to
be set as the default ( e.g., 4444 for WaitingJobs ). I just wanted to
double check that the other fields, e.g. "GlueCEStateTotalJobs: 0", are
supposed to have a 0 default.
Further testing on lrmsinfo-condor seems to show that the problem is
that it throws an error when there are no jobs in the queue. This output
confuses lcg-info-dynamic-scheduler, causing the int/str error.
Stephen, I've attached an edited version with my changes. The change in
logic necessary basically is just that the condor_q command can have
non-error exit value, yet still have no output.
Jeff, I just checked the GIP permissions and the groups and they look
just like yours. I'll investigate further.
Thanks,
--john
Jeff Templon wrote:
> Hi,
>
> First about the static-file-CE.ldif : it should not have zeros or blanks
> for the fields. It should have for WaitingJobs the value 4444, and for
> EstimatedResponseTime and WorstResponseTime it should have the value
>
> 2146060842
>
> (currently YAIM configures 2146660842 but this is a bug I still need to
> submit).
>
> Secondly, the python error you have, this sounds like the backend plugin
>
> lrmsinfo-condor
>
> (or whatever Stephen called it) is sometimes returning a string instead
> of an integer for 'qtime' for one of the jobs. The best thing to do to
> debug this, is, if you see this stack dump, run the condor backend
> plugin (Stephen's script), dump the stdout from that into a text file
> and post that. That's the most likely spot for the error.
>
> As far as the last point, about never seeing the real values, it sounds
> like a GIP problem. The standard thing to check is permissions in the
> GIP var directories.
>
>> tbn20:gip> cd /opt/lcg/var/gip tbn20:gip> ls -ltr
>> total 72
>> drwxr-xr-x 2 root root 4096 Nov 13 10:16 plugin
>> drwxr-xr-x 2 root root 4096 Jan 9 10:09 provider
>> -rw-r--r-- 1 root root 552 Jan 30 16:45
>> lcg-info-static-site.conf
>> -rw-r--r-- 1 root root 5232 Jan 30 16:45
>> lcg-info-static-cluster.conf
>> -rw-r--r-- 1 root root 32239 Jan 30 16:45
>> lcg-info-static-ce.conf
>> -rw-r--r-- 1 root root 10538 Jan 30 16:46
>> lcg-info-static-cesebind.conf
>> drwxrwxr-x 2 edguser infosys 4096 Feb 1 10:07 tmp
>> drwxrwxr-x 2 edguser infosys 4096 Feb 1 10:07 ldif
>> tbn20:gip> groups rgma rgma : rgma infosys
>> tbn20:gip> groups edginfo edginfo : edginfo infosys
--
John R. Hover
RHIC/Atlas Computing Facility, Bldg. 510M
Physics Department
Brookhaven National Laboratory
Upton, NY 11793
email: [log in to unmask]
tel: 631-344-5828
#!/usr/bin/python
#
# Author: Stephen Childs <[log in to unmask]>
# Edits:
# John Hover <[log in to unmask]>
#
import commands
import time
# First list is of Condor's job states, second is of the states used by the
# lrmsinfo spec. lrmsinfo list is accessed directly, I've left them both
# in to show the mapping.
condorJobStates=['Unexpanded','Idle','Running','Removed','Completed','Held','Submission_err']
lrmsinfoJobStates=['queued','queued','running','done','done','pending','done']
# Get the current time
now=int(time.time())
# Get the status of pool nodes using condor_status
(status,output)=commands.getstatusoutput('condor_status -format "%s\n" State')
if status == 0 and len(output) > 0 :
condor_states=output.split('\n')
freq={}
for state in ['Owner','Claimed','Unclaimed','Matched']:
freq[state]=condor_states.count(state)
total_nodes=len(condor_states)
# number of nodes available to run jobs is the number of
# nodes up and unused by their owner
nactive=total_nodes-freq['Owner']
print "nactive\t\t%s" %(nactive)
# number of free nodes is the number of active nodes
# minus the number of claimed nodes
nfree=freq['Unclaimed']
print "nfree\t\t%s" %(nfree)
# now is the number of seconds since start of epoch
print "now\t\t%s" %(now)
# schedCycle is NEGOTIATOR_INTERVAL? Can it be queried using
# condor_config_val? Hard-coded to 300 for now.
# Note : condor_config_val returns non-zero status if variable is not set
# and getstatusoutput returns non-zero if command not in path
schedCycle=300
(status,output)=commands.getstatusoutput('condor_config_val NEGOTIATOR_INTERVAL')
if status == 0 :
schedCycle= int(output)
print "schedCycle\t%d" %(schedCycle)
# Get the info we need about jobs from condor_q
(status,output)=commands.getstatusoutput('condor_q -format "%.1f," JobStartDate -format "%d," JobStatus -format "%s," Owner -format "%.1f," QDate -format "%s\n" GlobalJobId')
if status == 0 and len(output) > 0 :
# Break up into individual jobs
job_list=output.split('\n')
# Process each job
for job in job_list:
job=job.split(',')
# If the job hasn't started yet, there will be no start time.
if len(job) == 5:
(start,state,user,qtime,jobid)=job
elif len(job) == 4:
(state,user,qtime,jobid)=job
start='no'
# get user's group
getusergroup=commands.getstatusoutput("id -gn %s" %user)
if (getusergroup[0] == 0):
group=getusergroup[1]
else:
group="error"
# set CPU count (not sure how to get this in Condor)
cpucount="1"
jobDescr= "{"
state=int(state)
if (start != 'no'):
jobDescr+="'start': '%s', " %start
jobDescr+="'queue': 'condor', 'state': '%s', 'cpucount': '%s', 'group': '%s', 'user': '%s', 'maxwalltime': '240.0', 'qtime': '%s', 'jobid': '%s'}" %(lrmsinfoJobStates[state], cpucount, group, user, qtime, jobid)
print jobDescr
|