Hi Jeff,
Thanks a lot for your reply and for pointing me that ticket.
I thought myself about adding some sanity check and setting this
variable to an arbitrary value, but I don't know what it should be - 0
or rather default maxwalltime for my queues?
I checked lrmsinfo-pbs output on the affected CE and none of the lines
has maxwalltime field. They look like this:
{'group': 'cms', 'name': 'cream_293243727', 'qtime': 1314637978.0,
'jobid': '497964.ce2.polgrid.pl', 'queue': 'cms', 'state': 'queued',
'user': 'cms013'}
{'group': 'cms', 'name': 'cream_984834080', 'qtime': 1313363866.0,
'jobid': '487325.ce2.polgrid.pl', 'queue': 'cms', 'state': 'queued',
'cpucount': 1, 'user': 'cmsprd'}
And the records count is:
[root@ce2 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep
maxwalltime |wc -l
0
[root@ce2 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
393
[root@ce2 ~]#
Apart from that one I have 2 other CEs (one cream this torque 3.0.0 and
one lcg with torque 2.3.6).
On the lcg-CE:
[root@ce3 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep
maxwalltime |wc -l
390
[root@ce3 ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
390
[root@ce3 ~]#
on the other CREAM:
[root@ce ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |wc -l
35
[root@ce ~]# /opt/lcg/libexec/lrmsinfo-pbs |grep queue |grep maxwalltime
|wc -l
32
[root@ce ~]#
So 3 jobs also are missing maxwalltime attribute, but strangely
glite-info-dynamic-scheduler-wrapper works without problems on that machine.
Best regards,
Adam
W dniu 2011-08-29 23:36, Jeff Templon pisze:
> Hi Adam,
>
> It sounds like the same as this issue:
>
> https://ggus.eu/tech/ticket_show.php?ticket=71830
>
> so now there are TWO sites which see this issue. That piece of code is very old, it's not clear why it's a problem now. In the past, all waiting jobs ALWAYS had a maxwalltime field. If you could look at that ticket, maybe you could help debug the issue. I cannot reproduce it so far which makes it very difficult to test.
>
> I can always add a check on the value, it might make it in EMI release 2 :) However this is a 'can't happen' case, something is still wrong somewhere else!
>
> JT
>
> On Aug 29, 2011, at 23:13 , Adam Padee wrote:
>
>> Hi,
>>
>> I've got a problem with 444444 waiting jobs on my Cream-CE.
>> I checked /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
>> manually, and when I try to run it, I get the following error:
>>
>> [root@ce2 ~]# /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
>> lcg-info-dynamic-scheduler.conf
>> dn:
>> GlueVOViewLocalID=cms,GlueCEUniqueID=ce2.polgrid.pl:8443/cream-pbs-cms,mds-vo-name=resource,o=grid
>> GlueVOViewLocalID: cms
>> GlueCEStateRunningJobs: 7
>> GlueCEStateWaitingJobs: 417
>> GlueCEStateTotalJobs: 424
>> GlueCEStateFreeJobSlots: 0
>> GlueCEStateEstimatedResponseTime: 1325951
>> Traceback (most recent call last):
>> File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 435, in ?
>> wrt = qwt * nwait
>> TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
>> [root@ce2 ~]#
>>
>> I located the source of the problem, which is in the line:
>> qwt = waitingJobs[0].get('maxwalltime')
>> waitingJobs is a table of lrms.Job objects. It seems to me that the
>> object constructor somehow cannot find all the required fields in job
>> records, so the get() method returns None, because "maxwalltime" key is
>> absent.
>> But why this happens - I don't know. It is a little bit too complicated
>> for me.
>> The funny thing is that I can't recall any recent update that could have
>> caused that, and I use the same torque version (3.0.0) as on another
>> cluster, where the command works fine.
>> Has anyone encountered similar problem? I will be very grateful for any
>> hints.
>>
>> Best regards,
>> Adam
|