Hi *
ok this is progress. Thanks Adam. so let's dig one level deeper:
what do you get as output of this command, if you run it as an administrator on the torque server? and do you get the same output if you run the command as the user running the info providers (usually edguser but may be different on an EMI machine)?
qstat -f | awk '/walltime/ {print $1}' | sort | uniq -c
JT
On Aug 30, 2011, at 24:54 , Adam Padee wrote:
> Hi Jeff,
>
> Thanks a lot for your reply and for pointing me that ticket.
> I thought myself about adding some sanity check and setting this
> variable to an arbitrary value, but I don't know what it should be - 0
> or rather default maxwalltime for my queues?
>
> I checked lrmsinfo-pbs output on the affected CE and none of the lines
> has maxwalltime field. They look like this:
> {'group': 'cms', 'name': 'cream_293243727', 'qtime': 1314637978.0,
> 'jobid': '497964.ce2.polgrid.pl', 'queue': 'cms', 'state': 'queued',
> 'user': 'cms013'}
> {'group': 'cms', 'name': 'cream_984834080', 'qtime': 1313363866.0,
> 'jobid': '487325.ce2.polgrid.pl', 'queue': 'cms', 'state': 'queued',
> 'cpucount': 1, 'user': 'cmsprd'}
>
> And the records count is:
> [root@ce2 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep
> maxwalltime |wc -l
> 0
> [root@ce2 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
> 393
> [root@ce2 ~]#
>
>
> Apart from that one I have 2 other CEs (one cream this torque 3.0.0 and
> one lcg with torque 2.3.6).
> On the lcg-CE:
> [root@ce3 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep
> maxwalltime |wc -l
> 390
> [root@ce3 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
> 390
> [root@ce3 ~]#
>
> on the other CREAM:
> [root@ce ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
> 35
> [root@ce ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep maxwalltime
> |wc -l
> 32
> [root@ce ~]#
>
> So 3 jobs also are missing maxwalltime attribute, but strangely
> glite-info-dynamic-scheduler-wrapper works without problems on that machine.
>
> Best regards,
> Adam
>
>
> W dniu 2011-08-29 23:36, Jeff Templon pisze:
>> Hi Adam,
>>
>> It sounds like the same as this issue:
>>
>> https://ggus.eu/tech/ticket_show.php?ticket=71830
>>
>> so now there are TWO sites which see this issue. That piece of code is very old, it's not clear why it's a problem now. In the past, all waiting jobs ALWAYS had a maxwalltime field. If you could look at that ticket, maybe you could help debug the issue. I cannot reproduce it so far which makes it very difficult to test.
>>
>> I can always add a check on the value, it might make it in EMI release 2 :) However this is a 'can't happen' case, something is still wrong somewhere else!
>>
>> JT
>>
>> On Aug 29, 2011, at 23:13 , Adam Padee wrote:
>>
>>> Hi,
>>>
>>> I've got a problem with 444444 waiting jobs on my Cream-CE.
>>> I checked /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
>>> manually, and when I try to run it, I get the following error:
>>>
>>> [root@ce2 ~]# /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
>>> lcg-info-dynamic-scheduler.conf
>>> dn:
>>> GlueVOViewLocalID=cms,GlueCEUniqueID=ce2.polgrid.pl:8443/cream-pbs-cms,mds-vo-name=resource,o=grid
>>> GlueVOViewLocalID: cms
>>> GlueCEStateRunningJobs: 7
>>> GlueCEStateWaitingJobs: 417
>>> GlueCEStateTotalJobs: 424
>>> GlueCEStateFreeJobSlots: 0
>>> GlueCEStateEstimatedResponseTime: 1325951
>>> Traceback (most recent call last):
>>> File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 435, in ?
>>> wrt = qwt * nwait
>>> TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
>>> [root@ce2 ~]#
>>>
>>> I located the source of the problem, which is in the line:
>>> qwt = waitingJobs[0].get('maxwalltime')
>>> waitingJobs is a table of lrms.Job objects. It seems to me that the
>>> object constructor somehow cannot find all the required fields in job
>>> records, so the get() method returns None, because "maxwalltime" key is
>>> absent.
>>> But why this happens - I don't know. It is a little bit too complicated
>>> for me.
>>> The funny thing is that I can't recall any recent update that could have
>>> caused that, and I use the same torque version (3.0.0) as on another
>>> cluster, where the command works fine.
>>> Has anyone encountered similar problem? I will be very grateful for any
>>> hints.
>>>
>>> Best regards,
>>> Adam
>
|