I suspect this is the case: Look inside the directory - /opt/lcg/var/
gip/tmp - the files vary in ownership depending on whether rgma or
edginfo last ran the plugins. They overwrites each other's file!! When
it's ran and owned by rgma, result should be okay but things got
messed up when edginfo owns them.
Try this in there: watch -n 1 'ls -l'
and compare the result.
That was the case for me and I later figure out that /opt/lcg/libexec/
lcg-info-wrapper wasn't sourcing my batch system (condor) environment
properly and only edginfo got affected somehow.
Cheers,
Santanu
On 14 Jul 2008, at 14:30, Adam Padee <[log in to unmask]> wrote:
> Hi Cristina, Eygene, Jason and all
>
> Jason Shih wrote:
>> Hi all,
>>
>>
>>>> Failed execution is accompanied by two lines in /var/log/messages:
>>>>
>>>> Jul 14 11:57:55 ce lcg-info-dynamic-scheduler: VO max jobs
>>>> backend command returned nonzero exit status
>>>> Jul 14 11:57:55 ce lcg-info-dynamic-scheduler: Exiting without
>>>> output, GIP will use static values
>>>>
>>>> But when I run the command couple of times, or execute a qmgr
>>>> command manually, it starts to behave properly:
>>>>
>>>> [root@ce ~]# /opt/glite/libexec/glite-info-wrapper | grep
>>>> GlueCEStateWaitingJobs | uniq
>>>> GlueCEStateWaitingJobs: 0
>>>> [root@ce ~]#
>>>>
>>> You're trying to run it under root, but GIP is executed by 'edginfo'
>>> user. Try running under that account -- possibly this is the
>>> permission problem. But if you have some periods of properly-
>>> behaving
>>> dynamic values, then it won't be the case and the root of the
>>> problem
>>> is somewhere else.
>>>
>>
>>
>> indeed, the role expect to execute the wrapper shall be edguser
>> while Adam also have waiting job error when execute the glite-info-
>> dynamic-scheduler-wrapper that could be arise from instability of
>> maui scheduler? did you read also the logfile of maui that might
>> have some clue if fail to query the scheduler and result in return
>> errors.
>>
>> BR,
>> J
>>
> I checked the behavior of the wrapper under edginfo account, but
> it's exactly the same.
> The perimissions in /var/spool/maui seem to be ok:
> [root@ce ~]# ls -l /var/spool/maui/
> total 148
> -rw-r--r-- 1 root root 3879 Jul 10 15:37 maui.cfg
> -rw-r----- 1 root root 42935 Jul 14 15:00 maui.ck
> -rw-r----- 1 root root 42935 Jul 14 14:55 maui.ck.1
> -rw------- 1 root root 5 Jul 10 15:37 maui.pid
> -rw------- 1 root root 0 Jan 25 14:52 maui-private.cfg
> drwxrwxrwt 2 root root 4096 Jan 25 14:52 spool
> drwxr-xr-x 2 root root 4096 Jul 14 14:07 stats
> drwxr-xr-x 2 root root 4096 Jan 25 14:52 tools
> drwxr-xr-x 2 root root 4096 Jan 25 14:52 traces
> [root@ce ~]#
> I also tried re-creating default, stupid maui configuration with
> yaim, but it didn't help at all.
> The strange thing is that, as I see, both /opt/lcg/libexec/lrmsinfo-
> pbs and /opt/lcg/libexec/vomaxjobs-maui which are specified in lcg-
> info-dynamic-scheduler.conf as backend commands to lcg-info-dynamic-
> scheduler, use just 'diagnose -g', while maui client tools behave
> perfectly stable and I wasn't able to detect any anomalies with
> them. I just have problems with pbs commands.
>
> Below I attach the relevant part of maui.log (from the time I
> executed the tests decribed in my first mail):
>
> 07/14 11:56:18 MPBSClusterQuery(base,RCount,SC)
> 07/14 11:56:18 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:56:18 INFO: PBS node wn080.polgrid.pl set to state Idle
> (free)
> 07/14 11:56:18 MPBSLoadQueueInfo(base,wn080.polgrid.pl,SC)
> 07/14 11:56:18 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:56:18 INFO: PBS node wn081.polgrid.pl set to state Idle
> (free)
> 07/14 11:56:18 MPBSLoadQueueInfo(base,wn081.polgrid.pl,SC)
> 07/14 11:56:18 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:56:18 INFO: PBS node wn082.polgrid.pl set to state Idle
> (free)
> 07/14 11:56:18 MPBSLoadQueueInfo(base,wn082.polgrid.pl,SC)
> 07/14 11:56:18 INFO: 3 PBS resources detected on RM base
> 07/14 11:56:18 INFO: resources detected: 3
> 07/14 11:56:18 MPBSWorkloadQuery(base,JCount,SC)
> 07/14 11:56:18 INFO: 0 PBS jobs detected on RM base
> 07/14 11:56:18 WARNING: no workload detected
> 07/14 11:56:18 INFO: current util[5929]: 0/3 (0.00%) PH:
> 0.47% active jobs: 0 of 0 (completed: 2093)
> 07/14 11:56:18 INFO: scheduling complete. sleeping 60 seconds
> 07/14 11:57:19 MPBSClusterQuery(base,RCount,SC)
> 07/14 11:57:19 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:57:19 INFO: PBS node wn080.polgrid.pl set to state Idle
> (free)
> 07/14 11:57:19 MPBSLoadQueueInfo(base,wn080.polgrid.pl,SC)
> 07/14 11:57:19 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:57:19 INFO: PBS node wn081.polgrid.pl set to state Idle
> (free)
> 07/14 11:57:19 MPBSLoadQueueInfo(base,wn081.polgrid.pl,SC)
> 07/14 11:57:19 __MPBSGetNodeState(Name,State,PNode)
> 07/14 11:57:19 INFO: PBS node wn082.polgrid.pl set to state Idle
> (free)
> 07/14 11:57:19 MPBSLoadQueueInfo(base,wn082.polgrid.pl,SC)
> 07/14 11:57:19 INFO: 3 PBS resources detected on RM base
> 07/14 11:57:19 INFO: resources detected: 3
> 07/14 11:57:19 MPBSWorkloadQuery(base,JCount,SC)
> 07/14 11:57:19 INFO: 0 PBS jobs detected on RM base
> 07/14 11:57:19 WARNING: no workload detected
> 07/14 11:57:19 INFO: current util[5930]: 0/3 (0.00%) PH:
> 0.47% active jobs: 0 of 0 (completed: 2093)
> 07/14 11:57:19 INFO: scheduling complete. sleeping 60 seconds
> 07/14 11:58:20 MPBSClusterQuery(base,RCount,SC)
|