Hi Jeff
1) I might have forgotten to restart maui the last time I added
edguser to the entry in the maui.cfg file
The output is now:
[edguser@ce201 root]$ diagnose -g
Displaying group information...
Name Priority Flags QDef QOSList*
PartitionList Target Limits
see 0 [NONE] [NONE]
[NONE] [NONE] 0.00 [NONE]
dteam 0 [NONE] [NONE]
[NONE] [NONE] 0.00 [NONE]
DEFAULT 0 [NONE] [NONE]
[NONE] [NONE] 0.00 [NONE]
[edguser@ce201 root]$ /opt/lcg/libexec/vomaxjobs-maui
{}
[edguser@ce201 root]$
2) But I still get the following in /var/log/messages every 1 minute:
Oct 9 11:06:23 ce201 lcg-info-dynamic-scheduler: VO max jobs backend
command returned nonzero exit status
Oct 9 11:06:23 ce201 lcg-info-dynamic-scheduler: Exiting without
output, GIP will use static values
Oct 9 11:06:23 ce201 lcg-info-dynamic-scheduler: VO max jobs backend
command returned nonzero exit status
Oct 9 11:06:23 ce201 lcg-info-dynamic-scheduler: Exiting without
output, GIP will use static values
And this makes the EstRespTime change between 0, 77777, 99999 at
regular intervals
Thanks,
Harald Gjermundrod
On Oct 9, 2006, at 10:12 AM, Steve Traylen wrote:
>
> On Oct 9, 2006, at 8:22 AM, Harald Gjermundrod wrote:
>
>> Hi Jeff
>>
>> 1) This is the output of running the commands suggested:
>>
>> [root@ce201 root]# su edguser
>> [edguser@ce201 root]$ /opt/lcg/libexec/vomaxjobs-maui
>> /opt/lcg/libexec/vomaxjobs-maui: command 'diagnose -g' exited with
>> nonzero status
>> [edguser@ce201 root]$ diagnose -g
>> ERROR: 'diagnose' failed
>> ERROR: user 'edguser' is not authorized to execute command
>> 'diagnose'
>> [edguser@ce201 root]$
>>
>
> Hi Harald
>
>> Where do I add edguser so that it will be authorized to execute
>> diagnose -g
>>
>
> You should add edguser to /var/spool/maui/maui.cfg as documented here:
> http://www.clusterresources.com/products/maui/docs/
> a.fparameters.shtml#rmconfigfile
>
> the user should be added to at least ADMIN3 who has read access to
> maui's runtime
> state.
>
>> 2) The version of torque and maui
>>
>> Maui version 3.2.6p11
>> Torque version torque-1.0.1p6
>>
>> Thanks,
>> Harald Gjermundrod
>>
>>
>> On Oct 6, 2006, at 9:57 PM, Jeff Templon wrote:
>>
>>> Hi,
>>>
>>> try as edguser:
>>>
>>> /opt/lcg/libexec/vomaxjobs-maui
>>>
>>> as well as
>>>
>>> diagnose -g
>>>
>>> also what version of torque and maui are you running?
>>>
>>> JT
>>>
>>> No Name Available wrote:
>>>> Hi Jeff
>>>> I have the same problem as Serge on my gLite CE, but the problem
>>>> still persist even when I have the following in my maui.cfg
>>>> ADMIN3 edginfo rgma edguser
>>>> So this does not solve the problem. Thanks,
>>>> Harald
>>>> Quoting Jeff Templon <[log in to unmask]>:
>>>>> Hi
>>>>>
>>>>> yep. 'edguser' needs to be in maui.cfg as ADMIN3
>>>>>
>>>>> JT
>>>>>
>>>>>
>>>>> Serge Vrijaldenhoven wrote:
>>>>>> Hi Jeff,
>>>>>>
>>>>>> update:
>>>>>> edguser# /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-
>>>>>> wrapper
>>>>>> 2006-10-06 16:07:07 lcg-info-dynamic-scheduler: VO max jobs
>>>>>> backend command returned nonzero exit status
>>>>>> 2006-10-06 16:07:07 lcg-info-dynamic-scheduler: Exiting
>>>>>> without output, GIP will use static values
>>>>>>
>>>>>> so probably some acces problem?
>>>>>> breaks on:
>>>>>> cmd: /opt/lcg/libexec/vomaxjobs-maui <CEnode>
>>>>>> and that breaks on:
>>>>>> /opt/lcg/libexec/vomaxjobs-maui: command 'diagnose -g' exited
>>>>>> with nonzero status
>>>>>>
>>>>>> root# diagnose -g <CEnode>
>>>>>> Displaying group information...
>>>>>> Name Priority Flags QDef
>>>>>> QOSList* PartitionList Target Limits
>>>>>> DEFAULT 0 [NONE] [NONE]
>>>>>> [NONE] [NONE] 0.00 [NONE]
>>>>>>
>>>>>> edguser# diagnose -g <CEnode>
>>>>>> ERROR: 'diagnose' failed
>>>>>> ERROR: user 'edguser' is not authorized to execute command
>>>>>> 'diagnose'
>>>>>>
>>>>>> however: -rwxr-xr-x 30 root root 997454 Dec 5
>>>>>> 2004 /usr/bin/diagnose*
>>>>>> Guess this is for after the w/end.
>>>>>>
>>>>>> Grtz,
>>>>>> Serge
>>>>>>
>>>>>> LHC Computer Grid - Rollout <[log in to unmask]>
>>>>>> wrote on 06-10-2006 15:31:50:
>>>>>>
>>>>>> >
>>>>>> > LHC Computer Grid - Rollout <LCG-
>>>>>> [log in to unmask]> wrote
>>>>>> > on 06-10-2006 14:34:40:
>>>>>> >
>>>>>> > > Hi Serge,
>>>>>> > >
>>>>>> > > very strange .. can you do the following for me:
>>>>>> > > take a look in syslog and see if you have any messages
>>>>>> > > from the dynamic-scheduler plugin?
>>>>>> >
>>>>>> > Oct 6 14:51:45 deimos lcg-info-dynamic-scheduler: VO max jobs
>>>>>> > backend command returned nonzero exit status
>>>>>> > Oct 6 14:51:45 deimos lcg-info-dynamic-scheduler: Exiting
>>>>>> without
>>>>>> > output, GIP will use static values
>>>>>> > Oct 6 14:52:11 deimos lcg-info-dynamic-scheduler: VO max jobs
>>>>>> > backend command returned nonzero exit status
>>>>>> > Oct 6 14:52:11 deimos lcg-info-dynamic-scheduler: Exiting
>>>>>> without
>>>>>> > output, GIP will use static values
>>>>>> > Oct 6 14:52:11 deimos lcg-info-dynamic-scheduler: VO max jobs
>>>>>> > backend command returned nonzero exit status
>>>>>> > Oct 6 14:52:11 deimos lcg-info-dynamic-scheduler: Exiting
>>>>>> without
>>>>>> > output, GIP will use static values
>>>>>> > Oct 6 14:52:45 deimos lcg-info-dynamic-scheduler: VO max jobs
>>>>>> > backend command returned nonzero exit status
>>>>>> > Oct 6 14:52:45 deimos lcg-info-dynamic-scheduler: Exiting
>>>>>> without
>>>>>> > output, GIP will use static values
>>>>>> >
>>>>>> > > Also run it by
>>>>>> > > hand and see what you get, make sure you run it as the
>>>>>> > > same user as the GIP user. Finally, try
>>>>>> > uhm... how do I find out the GIP user?
>>>>>> >
>>>>>> > root# /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-
>>>>>> wrapper
>>>>>> > (contains: /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
>>>>>> > /opt/lcg/etc/lcg-info-dynamic-scheduler.conf)
>>>>>> > In the conf file I see:
>>>>>> > +- - - - - - - - -
>>>>>> > | [Main]
>>>>>> > | static_ldif_file: /opt/lcg/var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif
>>>>>> > | vomap :
>>>>>> > | phicos:phicos
>>>>>> > | dteam:dteam
>>>>>> > | module_search_path : ../lrms:../ett
>>>>>> > | [LRMS]
>>>>>> > | lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs
>>>>>> > | [Scheduler]
>>>>>> > | cycle_time : 0
>>>>>> > | vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui <CEnode>
>>>>>> > +- - - - - - - - -
>>>>>> >
>>>>>> > dn: GlueVOViewLocalID=phicos,GlueCEUniqueID=<CEnode>:2119/
>>>>>> blah-pbs-
>>>>>> > phicos,mds-vo-name=local,o=grid
>>>>>> > GlueVOViewLocalID: phicos
>>>>>> > GlueCEAccessControlBaseRule: VO:phicos
>>>>>> > GlueCEStateRunningJobs: 0
>>>>>> > GlueCEStateWaitingJobs: 0
>>>>>> > GlueCEStateTotalJobs: 0
>>>>>> > GlueCEStateFreeJobSlots: 4
>>>>>> > GlueCEStateEstimatedResponseTime: 0
>>>>>> > GlueCEStateWorstResponseTime: 0
>>>>>> >
>>>>>> > dn: GlueVOViewLocalID=dteam,GlueCEUniqueID=<CEnode>:2119/
>>>>>> blah-pbs-
>>>>>> > dteam,mds-vo-name=local,o=grid
>>>>>> > GlueVOViewLocalID: dteam
>>>>>> > GlueCEAccessControlBaseRule: VO:dteam
>>>>>> > GlueCEStateRunningJobs: 0
>>>>>> > GlueCEStateWaitingJobs: 0
>>>>>> > GlueCEStateTotalJobs: 0
>>>>>> > GlueCEStateFreeJobSlots: 4
>>>>>> > GlueCEStateEstimatedResponseTime: 0
>>>>>> > GlueCEStateWorstResponseTime: 0
>>>>>> >
>>>>>> > dn: GlueCEUniqueID=<CEnode>:2119/blah-pbs-phicos,mds-vo-
>>>>>> name=local,o=grid
>>>>>> > GlueCEStateRunningJobs: 0
>>>>>> > GlueCEStateWaitingJobs: 0
>>>>>> > GlueCEStateTotalJobs: 0
>>>>>> > GlueCEStateEstimatedResponseTime: 0
>>>>>> > GlueCEStateWorstResponseTime: 0
>>>>>> >
>>>>>> > dn:
>>>>> GlueCEUniqueID=<CEnode>:2119/blah-pbs-dteam,mds-vo-
>>>>> name=local,o=grid
>>>>>> > GlueCEStateRunningJobs: 0
>>>>>> > GlueCEStateWaitingJobs: 0
>>>>>> > GlueCEStateTotalJobs: 0
>>>>>> > GlueCEStateEstimatedResponseTime: 0
>>>>>> > GlueCEStateWorstResponseTime: 0
>>>>>> >
>>>>>> > Does this relate to: http://savannah.cern.ch/bugs/?
>>>>>> > func=detailitem&item_id=13952 ?
>>>>>> >
>>>>>> > >
>>>>>> > > cd /opt/lcg
>>>>>> > > find lib libexec | xargs grep 999
>>>>>> >
>>>>>> > # find lib libexec | xargs grep 999
>>>>>> > libexec/lcg-info-dynamic-lsf: $Time=999999;
>>>>>> > libexec/lcg-info-dynamic-pbs: $MaxRunningJobs = 9999999;
>>>>>> > > > as far as I can tell, your reported
>>>>>> EstimatedResponseTime is *not*
>>>>>> > > coming from my info provider, unless you have a really
>>>>>> ancient version.
>>>>>> > > It may be that the gLite CE default info is different
>>>>>> than for
>>>>> the
>>>>>> > > LCG CE. You can try this:
>>>>>> > >
>>>>>> > > cd /opt/lcg
>>>>>> > > find var | xargs grep 999
>>>>>> >
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEStateEstimatedResponseTime: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxCPUTime: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxRunningJobs: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxTotalJobs: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxWallClockTime: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEStateEstimatedResponseTime: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxCPUTime: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxRunningJobs: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxTotalJobs: 999999
>>>>>> > var/gip/ldif/lcg-info-static-
>>>>>> ce.ldif:GlueCEPolicyMaxWallClockTime: 999999
>>>>>> >
>>>>>> > > and see what you get.
>>>>>> > >
>>>>>> > > JT
>>>>>> >
>>>>>> > In light of "VO max jobs backend command returned nonzero
>>>>>> exit status":
>>>>>> > # /opt/lcg/libexec/vomaxjobs-maui <CEnode>
>>>>>> > {}
>>>>>> >
>>>>>> > Grtz,
>>>>>> > Serge
>>>>>> >
>>>>>> >
>>>>>> ---------------------------------------------------------------
>>>>>> > >
>>>>>> > > Serge Vrijaldenhoven wrote:
>>>>>> > > >
>>>>>> > > > Hi Jeff,
>>>>>> > > >
>>>>>> > > > I saw that the number of free CPU's had dropped to 0,
>>>>>> this had something
>>>>>> > > > to do with firewall settings (only tcp was allowed from
>>>>>> the WN to CE). I
>>>>>> > > > fixed this en then got the following results:
>>>>>> > > >
>>>>>> > > > # /opt/lcg/libexec/lrmsinfo-pbs
>>>>>> > > > nactive 4
>>>>>> > > > nfree 4
>>>>>> > > > now 1160137743
>>>>>> > > > schedCycle 26
>>>>>> > > > {}
>>>>>> > > >
>>>>>> > > > (for "more output" from commands, see at the bottom)
>>>>>> > > > However.... I still get after job submission:
>>>>>> > > >
>>>>>> > > > # glite-job-status <jobid>
>>>>>> > > > BOOKKEEPING INFORMATION:
>>>>>> > > > Current Status: Waiting
>>>>>> > > > Status Reason: BrokerHelper: Problems during rank
>>>>>> evaluation (e.g.
>>>>>> > > > GRISes down, wrong JDL rank expression, etc.)
>>>>>> > > >
>>>>>> > > > #lcg-info --list-ce --attrs EstRespTime
>>>>>> > > > - EstRespTime 999999 <- - - hmmm... at
>>>>>> least we're getting
>>>>>> > > > to the top?
>>>>>> > > >
>>>>>> > > > PS: notes noted. Thanks for replying even at your day
>>>>>> off...
>>>>>> > > >
>>>>>> > > > Grtz,
>>>>>> > > > Serge
>>>>>> > > >
>>>>>> > > > More output
>>>>>> > > > ===========
>>>>>> > > > #pbsnodes -a
>>>>>> > > > <node>
>>>>>> > > > state = free
>>>>>> > > > np = 2
>>>>>> > > > properties = glite
>>>>>> > > > ntype = cluster
>>>>>> > > > status = arch=linux,uname=Linux <node>
>>>>>> 2.4.21-47.ELsmp #1 SMP Thu
>>>>>> > > > Jul 20 09:54:04 CDT 2006 i686,sessions=? 0,nsessions=?
>>>>>> > > >
>>>>>> 0,nusers=0,idletime=3252,totmem=4194303kb,availmem=4194303kb,
>>>>>> > > physmem=8195020kb,ncpus=2,loadave=0.12,rectime=1160134328
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > <node>
>>>>>> > > > state = free
>>>>>> > > > np = 2
>>>>>> > > > properties = glite
>>>>>> > > > ntype = cluster
>>>>>> > > > status = arch=linux,uname=Linux <node>
>>>>>> 2.4.21-47.ELsmp #1 SMP Thu
>>>>>> > > > Jul 20 09:54:04 CDT 2006 i686,sessions=? 0,nsessions=?
>>>>>> > > >
>>>>>> 0,nusers=0,idletime=246260,totmem=4194303kb,availmem=4194303kb,
>>>>>> > > physmem=8195020kb,ncpus=2,loadave=0.00,rectime=1160134336
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > #qmgr
>>>>>> > > > Qmgr: list server
>>>>>> > > > Server <node>
>>>>>> > > > server_state = Active
>>>>>> > > > scheduling = True
>>>>>> > > > total_jobs = 0
>>>>>> > > > state_count = Transit:0 Queued:0 Held:0 Waiting:
>>>>>> 0 Running:0
>>>>>> > > > Exiting:0
>>>>>> > > > acl_host_enable = False
>>>>>> > > > managers = root@<node>
>>>>>> > > > default_queue = dteam
>>>>>> > > > log_events = 511
>>>>>> > > > mail_from = adm
>>>>>> > > > query_other_jobs = True
>>>>>> > > > scheduler_iteration = 600
>>>>>> > > > node_ping_rate = 300
>>>>>> > > > node_check_rate = 600
>>>>>> > > > tcp_timeout = 6
>>>>>> > > > default_node = glite
>>>>>> > > > node_pack = False
>>>>>> > > > pbs_version = torque_1.0.1p5
>>>>>> > > > Qmgr: list queue
>>>>>> > > > No Active Queues, nothing done.
>>>>>> > > > Qmgr: list node
>>>>>> > > > No Active Nodes, nothing done.
>>>>>> > > > Qmgr:
>>>>>> > > > Qmgr: list queue phicos
>>>>>> > > > Queue phicos
>>>>>> > > > queue_type = Execution
>>>>>> > > > total_jobs = 0
>>>>>> > > > state_count = Transit:0 Queued:0 Held:0 Waiting:
>>>>>> 0 Running:0
>>>>>> > > > Exiting:0
>>>>>> > > > resources_max.cput = 48:00:00
>>>>>> > > > resources_max.walltime = 96:00:00
>>>>>> > > > acl_group_enable = True
>>>>>> > > > acl_groups = +phicos
>>>>>> > > > enabled = True
>>>>>> > > > started = True
>>>>>> > > > Qmgr: list node <node1>
>>>>>> > > > Node <node1>
>>>>>> > > > state = free
>>>>>> > > > np = 2
>>>>>> > > > properties = glite
>>>>>> > > > ntype = cluster
>>>>>> > > > status = arch=linux,
>>>>>> > > > uname=Linux <node1> 2.4.21-47.ELsmp #1
>>>>>> SMP Thu Jul 20
>>>>>> > > > 09:54:04 CDT 2006 i686,
>>>>>> > > > sessions=? 0,nsessions=?
>>>>>> 0,nusers=0,idletime=3792,
>>>>>> > > > > > >
>>>>>> totmem=4194303kb,availmem=4194303kb,physmem=8195020kb,ncpus=2,
>>>>>> > > > loadave=0.00,rectime=1160134871
>>>>>> > > > Qmgr: list node <node2>
>>>>>> > > > Node <node2>
>>>>>> > > > state = free
>>>>>> > > > np = 2
>>>>>> > > > properties = glite
>>>>>> > > > ntype = cluster
>>>>>> > > > status = arch=linux,
>>>>>> > > > uname=Linux <node2> 2.4.21-47.ELsmp #1
>>>>>> SMP Thu Jul 20
>>>>>> > > > 09:54:04 CDT 2006 i686,
>>>>>> > > > sessions=? 0,nsessions=?
>>>>> 0,nusers=0,idletime=246860,
>>>>>> > > > > > >
>>>>>> totmem=4194303kb,availmem=4194303kb,physmem=8195020kb,ncpus=2,
>>>>>> > > > loadave=0.00,rectime=1160134937
>>>>>> > > >
>>>>>> > > > #qstat -q
>>>>>> > > > server: <node>
>>>>>> > > > Queue Memory CPU Time Walltime Node Run Que
>>>>>> Lm State
>>>>>> > > > ---------------- ------ -------- -------- ---- --- ---
>>>>>> -- -----
>>>>>> > > > phicos -- 48:00:00 96:00:00 -- 0 0
>>>>>> -- E R
>>>>>> > > >
>>>>>> > > > # qstat -Q phicos
>>>>>> > > > Queue Max Tot Ena Str Que Run Hld Wat Trn
>>>>>> Ext Type
>>>>>> > > > ---------------- --- --- --- --- --- --- --- --- ---
>>>>>> --- ----------
>>>>>> > > > phicos 0 0 yes yes 0 0 0 0 0
>>>>>> 0 Execution
>>>>>> > > >
>>>>>> > > > #qstat -f
>>>>>> > > > <nothing>
>>>>>> > > >
>>>>>> > > >
>>>>>> > > > LHC Computer Grid - Rollout <LCG-
>>>>>> [log in to unmask]> wrote on
>>>>>> > > > 06-10-2006 11:28:32:
>>>>>> > > >
>>>>>> > > > > Hi
>>>>>> > > > >
>>>>>> > > > > the EstRespTime 77777 is usually indicative of an
>>>>>> inconsistency in the
>>>>>> > > > > batch system. One problem might be that the user
>>>>>> under which the info
>>>>>> > > > > provider is running can do the equivalent of
>>>>>> 'pbsnodes' but not 'qstat
>>>>>> > > > > -f'. Another is if for some reason, the jobs on the
>>>>>> WNs are in a funny
>>>>>> > > > > state so that they appear to be running but you
>>>>>> can't see their wall or
>>>>>> > > > > cpu times. A final one is if for example the system
>>>>>> can't detect that
>>>>>> > > > > you have any WNs at all.
>>>>>> > > > >
>>>>>> > > > > Can you run
>>>>>> > > > >
>>>>>> > > > > /opt/lcg/libexec/lrmsinfo-pbs
>>>>>> > > > >
>>>>>> > > > > and attach the output? That would help. Two notes:
>>>>>> > > > >
>>>>>> > > > > 1) Louis Poncet showed me yesterday a gLite-CE site
>>>>>> that had the same
>>>>>> > > > > problem. It may be a problem specific to the gLite CE.
>>>>>> > > > >
>>>>>> > > > > 2) I have the day off today, so it might be awhile
>>>>>> before I answer :)
>>>>>> > > > >
>>>>>> > > > > JT
>>>>>> > > > >
>>>>>> > > > >
>>>>>> > > > > Serge Vrijaldenhoven wrote:
>>>>>> > > > > >
>>>>>> > > > > > Hi,
>>>>>> > > > > >
>>>>>> > > > > > we want to test our fresh gLite site.
>>>>>> > > > > > Is there already something similar to
>>>>>> > > > > > http://grid-deployment.web.cern.ch/grid-
>>>>>> > > > > deployment/documentation/LCG2-Site-Testing/
>>>>>> > > > > > ?
>>>>>> > > > > >
>>>>>> > > > > > Already found the first problem:
>>>>>> > > > > > #glite-job-status <jobid>
>>>>>> > > > > > Current Status: Waiting
>>>>>> > > > > > Status Reason: BrokerHelper: Problems during
>>>>>> rank evaluation
>>>>>> > > > (e.g.
>>>>>> > > > > > GRISes down, wrong JDL rank expression, etc.)
>>>>>> > > > > >
>>>>>> > > > > > #lcg-info --list-ce --attrs EstRespTime
>>>>>> > > > > > - EstRespTime 77777
>>>>>> > > > > >
>>>>>> > > > > > So it seems that information is not published
>>>>>> correctly?
>>>>>> > > > > >
>>>>>> > > > > > Grtz,
>>>>>> > > > > > Serge
>>>>>
>>>
>>
>
> --
> Steve Traylen
> [log in to unmask]
> CERN, IT-GD-OPS.
>
>
>
|