JISCMail - LCG-ROLLOUT Archives

Email discussion lists for the UK Education and Research communities
Subscriber's Corner
Email Lists
LCG-ROLLOUT Archives

LCG-ROLLOUT@JISCMAIL.AC.UK

View:

Message:
[
First
Last
]
By Topic:
[
First
Last
]
By Author:
[
First
Last
]
Font:
Proportional Font
		LISTSERV Archives
		LCG-ROLLOUT Home
		LCG-ROLLOUT 2006
Options

Subscribe or Unsubscribe
Get Password
Subject:
Re: gLite Site Testing: EstRespTime 77777
From:
Jeff Templon <[log in to unmask]>
Reply-To:
LHC Computer Grid - Rollout <[log in to unmask]>
Date:
Fri, 6 Oct 2006 18:04:59 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (348 lines)
Hi

I suspect that you do not have the GIP user (usually edginfo, edguser, 
or rgma) listed with ADMIN3 rights in the maui.cfg file.  You can find 
out what the GIP user is by running "top" and see which process is 
periodically running the various information providers like 
lcg-info-dynamic-scheduler.

				JT


Serge Vrijaldenhoven wrote:
> 
> LHC Computer Grid - Rollout <[log in to unmask]> wrote on 
> 06-10-2006 14:34:40:
> 
>  > Hi Serge,
>  >
>  > very strange .. can you do the following for me:
>  > take a look in syslog and see if you have any messages
>  > from the dynamic-scheduler plugin?
> 
> Oct  6 14:51:45 deimos lcg-info-dynamic-scheduler: VO max jobs backend 
> command returned nonzero exit status
> Oct  6 14:51:45 deimos lcg-info-dynamic-scheduler: Exiting without 
> output, GIP will use static values
> Oct  6 14:52:11 deimos lcg-info-dynamic-scheduler: VO max jobs backend 
> command returned nonzero exit status
> Oct  6 14:52:11 deimos lcg-info-dynamic-scheduler: Exiting without 
> output, GIP will use static values
> Oct  6 14:52:11 deimos lcg-info-dynamic-scheduler: VO max jobs backend 
> command returned nonzero exit status
> Oct  6 14:52:11 deimos lcg-info-dynamic-scheduler: Exiting without 
> output, GIP will use static values
> Oct  6 14:52:45 deimos lcg-info-dynamic-scheduler: VO max jobs backend 
> command returned nonzero exit status
> Oct  6 14:52:45 deimos lcg-info-dynamic-scheduler: Exiting without 
> output, GIP will use static values
> 
>  > Also run it by
>  > hand and see what you get, make sure you run it as the
>  > same user as the GIP user.  Finally, try
> uhm... how do I find out the GIP user?
> 
> root# /opt/lcg/var/gip/plugin/lcg-info-dynamic-scheduler-wrapper
> (contains: /opt/lcg/libexec/lcg-info-dynamic-scheduler -c 
> /opt/lcg/etc/lcg-info-dynamic-scheduler.conf)
> In the conf file I see:
> +- - - - - - - - -
> | [Main]
> | static_ldif_file: /opt/lcg/var/gip/ldif/lcg-info-static-ce.ldif
> | vomap :
> |    phicos:phicos
> |    dteam:dteam
> | module_search_path : ../lrms:../ett
> | [LRMS]
> | lrms_backend_cmd: /opt/lcg/libexec/lrmsinfo-pbs
> | [Scheduler]
> | cycle_time : 0
> | vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui <CEnode>
> +- - - - - - - - -
> 
> dn: 
> GlueVOViewLocalID=phicos,GlueCEUniqueID=<CEnode>:2119/blah-pbs-phicos,mds-vo-name=local,o=grid 
> 
> GlueVOViewLocalID: phicos
> GlueCEAccessControlBaseRule: VO:phicos
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateFreeJobSlots: 4
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
> 
> dn: 
> GlueVOViewLocalID=dteam,GlueCEUniqueID=<CEnode>:2119/blah-pbs-dteam,mds-vo-name=local,o=grid 
> 
> GlueVOViewLocalID: dteam
> GlueCEAccessControlBaseRule: VO:dteam
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateFreeJobSlots: 4
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
> 
> dn: GlueCEUniqueID=<CEnode>:2119/blah-pbs-phicos,mds-vo-name=local,o=grid
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
> 
> dn: GlueCEUniqueID=<CEnode>:2119/blah-pbs-dteam,mds-vo-name=local,o=grid
> GlueCEStateRunningJobs: 0
> GlueCEStateWaitingJobs: 0
> GlueCEStateTotalJobs: 0
> GlueCEStateEstimatedResponseTime: 0
> GlueCEStateWorstResponseTime: 0
> 
> Does this relate to: 
> http://savannah.cern.ch/bugs/?func=detailitem&item_id=13952 ?
> 
>  >
>  >    cd /opt/lcg
>  >    find lib libexec | xargs grep 999
> 
> # find lib libexec | xargs grep 999
> libexec/lcg-info-dynamic-lsf:   $Time=999999;
> libexec/lcg-info-dynamic-pbs:  $MaxRunningJobs = 9999999;
>  
>  > as far as I can tell, your reported EstimatedResponseTime is *not*
>  > coming from my info provider, unless you have a really ancient version.
>  >     It may be that the gLite CE default info is different than for the
>  > LCG CE.  You can try this:
>  >
>  >     cd /opt/lcg
>  >     find var | xargs grep 999
> 
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEStateEstimatedResponseTime: 
> 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxCPUTime: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxRunningJobs: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxTotalJobs: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxWallClockTime: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEStateEstimatedResponseTime: 
> 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxCPUTime: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxRunningJobs: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxTotalJobs: 999999
> var/gip/ldif/lcg-info-static-ce.ldif:GlueCEPolicyMaxWallClockTime: 999999
> 
>  > and see what you get.
>  >
>  >             JT
> 
> In light of "VO max jobs backend command returned nonzero exit status":
> # /opt/lcg/libexec/vomaxjobs-maui <CEnode>
> {}
> 
> Grtz,
>      Serge
> 
> ---------------------------------------------------------------
>  >
>  > Serge Vrijaldenhoven wrote:
>  > >
>  > > Hi Jeff,
>  > >
>  > > I saw that the number of free CPU's had dropped to 0, this had 
> something
>  > > to do with firewall settings (only tcp was allowed from the WN to 
> CE). I
>  > > fixed this en then got the following results:
>  > >
>  > > # /opt/lcg/libexec/lrmsinfo-pbs
>  > > nactive      4
>  > > nfree        4
>  > > now          1160137743
>  > > schedCycle   26
>  > > {}
>  > >
>  > > (for "more output" from commands, see at the bottom)
>  > > However.... I still get after job submission:
>  > >
>  > > # glite-job-status  <jobid>
>  > > BOOKKEEPING INFORMATION:
>  > > Current Status:     Waiting
>  > > Status Reason:      BrokerHelper: Problems during rank evaluation 
> (e.g.
>  > > GRISes down, wrong JDL rank expression, etc.)
>  > >
>  > > #lcg-info --list-ce --attrs EstRespTime
>  > > - EstRespTime         999999      <- - -  hmmm... at least we're 
> getting
>  > > to the top?
>  > >
>  > > PS: notes noted. Thanks for replying even at your day off...
>  > >
>  > > Grtz,
>  > >      Serge
>  > >
>  > > More output
>  > > ===========
>  > > #pbsnodes -a
>  > > <node>
>  > >      state = free
>  > >      np = 2
>  > >      properties = glite
>  > >      ntype = cluster
>  > >      status = arch=linux,uname=Linux <node> 2.4.21-47.ELsmp #1 SMP Thu
>  > > Jul 20 09:54:04 CDT 2006 i686,sessions=? 0,nsessions=?
>  > > 0,nusers=0,idletime=3252,totmem=4194303kb,availmem=4194303kb,
>  > physmem=8195020kb,ncpus=2,loadave=0.12,rectime=1160134328
>  > >
>  > >
>  > > <node>
>  > >      state = free
>  > >      np = 2
>  > >      properties = glite
>  > >      ntype = cluster
>  > >      status = arch=linux,uname=Linux <node> 2.4.21-47.ELsmp #1 SMP Thu
>  > > Jul 20 09:54:04 CDT 2006 i686,sessions=? 0,nsessions=?
>  > > 0,nusers=0,idletime=246260,totmem=4194303kb,availmem=4194303kb,
>  > physmem=8195020kb,ncpus=2,loadave=0.00,rectime=1160134336
>  > >
>  > >
>  > > #qmgr
>  > > Qmgr: list server
>  > > Server <node>
>  > >         server_state = Active
>  > >         scheduling = True
>  > >         total_jobs = 0
>  > >         state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
>  > > Exiting:0
>  > >         acl_host_enable = False
>  > >         managers = root@<node>
>  > >         default_queue = dteam
>  > >         log_events = 511
>  > >         mail_from = adm
>  > >         query_other_jobs = True
>  > >         scheduler_iteration = 600
>  > >         node_ping_rate = 300
>  > >         node_check_rate = 600
>  > >         tcp_timeout = 6
>  > >         default_node = glite
>  > >         node_pack = False
>  > >         pbs_version = torque_1.0.1p5
>  > > Qmgr: list queue
>  > > No Active Queues, nothing done.
>  > > Qmgr: list node
>  > > No Active Nodes, nothing done.
>  > > Qmgr:
>  > > Qmgr: list queue phicos
>  > > Queue phicos
>  > >         queue_type = Execution
>  > >         total_jobs = 0
>  > >         state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
>  > > Exiting:0
>  > >         resources_max.cput = 48:00:00
>  > >         resources_max.walltime = 96:00:00
>  > >         acl_group_enable = True
>  > >         acl_groups = +phicos
>  > >         enabled = True
>  > >         started = True
>  > > Qmgr: list node <node1>
>  > > Node <node1>
>  > >         state = free
>  > >         np = 2
>  > >         properties = glite
>  > >         ntype = cluster
>  > >         status = arch=linux,
>  > >                  uname=Linux <node1> 2.4.21-47.ELsmp #1 SMP Thu Jul 20
>  > > 09:54:04 CDT 2006 i686,
>  > >                  sessions=? 0,nsessions=? 0,nusers=0,idletime=3792,
>  > >                
>  > >  totmem=4194303kb,availmem=4194303kb,physmem=8195020kb,ncpus=2,
>  > >                  loadave=0.00,rectime=1160134871
>  > > Qmgr: list node <node2>
>  > > Node <node2>
>  > >         state = free
>  > >         np = 2
>  > >         properties = glite
>  > >         ntype = cluster
>  > >         status = arch=linux,
>  > >                  uname=Linux <node2> 2.4.21-47.ELsmp #1 SMP Thu Jul 20
>  > > 09:54:04 CDT 2006 i686,
>  > >                  sessions=? 0,nsessions=? 0,nusers=0,idletime=246860,
>  > >                
>  > >  totmem=4194303kb,availmem=4194303kb,physmem=8195020kb,ncpus=2,
>  > >                  loadave=0.00,rectime=1160134937
>  > >
>  > > #qstat -q
>  > > server: <node>
>  > > Queue            Memory CPU Time Walltime Node Run Que Lm  State
>  > > ---------------- ------ -------- -------- ---- --- --- --  -----
>  > > phicos             --   48:00:00 96:00:00  --    0   0 --   E R
>  > >
>  > > # qstat -Q phicos
>  > > Queue            Max Tot Ena Str Que Run Hld Wat Trn Ext Type
>  > > ---------------- --- --- --- --- --- --- --- --- --- --- ----------
>  > > phicos             0   0 yes yes   0   0   0   0   0   0 Execution
>  > >
>  > > #qstat -f
>  > > <nothing>
>  > >
>  > >
>  > > LHC Computer Grid - Rollout <[log in to unmask]> 
> wrote on
>  > > 06-10-2006 11:28:32:
>  > >
>  > >  > Hi
>  > >  >
>  > >  > the EstRespTime 77777 is usually indicative of an inconsistency 
> in the
>  > >  > batch system.  One problem might be that the user under which 
> the info
>  > >  > provider is running can do the equivalent of 'pbsnodes' but not 
> 'qstat
>  > >  > -f'.  Another is if for some reason, the jobs on the WNs are in 
> a funny
>  > >  > state so that they appear to be running but you can't see their 
> wall or
>  > >  > cpu times.  A final one is if for example the system can't 
> detect that
>  > >  > you have any WNs at all.
>  > >  >
>  > >  > Can you run
>  > >  >
>  > >  >    /opt/lcg/libexec/lrmsinfo-pbs
>  > >  >
>  > >  > and attach the output?  That would help.   Two notes:
>  > >  >
>  > >  > 1) Louis Poncet showed me yesterday a gLite-CE site that had the 
> same
>  > >  > problem.  It may be a problem specific to the gLite CE.
>  > >  >
>  > >  > 2) I have the day off today, so it might be awhile before I 
> answer :)
>  > >  >
>  > >  >             JT
>  > >  >
>  > >  >
>  > >  > Serge Vrijaldenhoven wrote:
>  > >  > >
>  > >  > > Hi,
>  > >  > >
>  > >  > > we want to test our fresh gLite site.
>  > >  > > Is there already something similar to
>  > >  > > http://grid-deployment.web.cern.ch/grid-
>  > >  > deployment/documentation/LCG2-Site-Testing/
>  > >  > > ?
>  > >  > >
>  > >  > > Already found the first problem:
>  > >  > > #glite-job-status <jobid>
>  > >  > > Current Status:     Waiting
>  > >  > > Status Reason:      BrokerHelper: Problems during rank evaluation
>  > > (e.g.
>  > >  > > GRISes down, wrong JDL rank expression, etc.)
>  > >  > >
>  > >  > > #lcg-info --list-ce --attrs EstRespTime
>  > >  > > - EstRespTime         77777
>  > >  > >
>  > >  > > So it seems that information is not published correctly?
>  > >  > >
>  > >  > > Grtz,
>  > >  > >         Serge
Top of Message | Previous Page | Permalink
JiscMail Tools

Files Area | help
RSS Feeds and Sharing

Search Archives

Advanced Options