Hi,
and it looks like the same problem I opened a ticket about though it
shows up a bit different.
https://ggus.eu/tech/ticket_show.php?ticket=73384
I realised it because gstat shows sometimes 0 running jobs although
there are jobs running. We have a CreamCE with torque 2.5.5
glite-info-dynamic-scheduler-wrapper crashes at the same line of code.
Regards,
Ralph Müller-Pfefferkorn
--
Dr. rer. nat. Ralph Müller-Pfefferkorn
Head of the department "Distributed and Data Intensive Computing"
Center for Information Services and High Performance Computing
Technische Universität Dresden
01062 Dresden, Germany
Office : Zellescher Weg 12, Willers-Bau room A208
Phone: +49 351 463 39280 Fax: +49 351 463 3773
E-Mail: [log in to unmask]
Jeff Templon wrote on 29.08.2011 23:36:
> Hi Adam,
>
> It sounds like the same as this issue:
>
> https://ggus.eu/tech/ticket_show.php?ticket=71830
>
> so now there are TWO sites which see this issue. That piece of code is very old, it's not clear why it's a problem now. In the past, all waiting jobs ALWAYS had a maxwalltime field. If you could look at that ticket, maybe you could help debug the issue. I cannot reproduce it so far which makes it very difficult to test.
>
> I can always add a check on the value, it might make it in EMI release 2 :) However this is a 'can't happen' case, something is still wrong somewhere else!
>
> JT
>
> On Aug 29, 2011, at 23:13 , Adam Padee wrote:
>
>> Hi,
>>
>> I've got a problem with 444444 waiting jobs on my Cream-CE.
>> I checked /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
>> manually, and when I try to run it, I get the following error:
>>
>> [root@ce2 ~]# /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
>> lcg-info-dynamic-scheduler.conf
>> dn:
>> GlueVOViewLocalID=cms,GlueCEUniqueID=ce2.polgrid.pl:8443/cream-pbs-cms,mds-vo-name=resource,o=grid
>> GlueVOViewLocalID: cms
>> GlueCEStateRunningJobs: 7
>> GlueCEStateWaitingJobs: 417
>> GlueCEStateTotalJobs: 424
>> GlueCEStateFreeJobSlots: 0
>> GlueCEStateEstimatedResponseTime: 1325951
>> Traceback (most recent call last):
>> File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 435, in ?
>> wrt = qwt * nwait
>> TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
>> [root@ce2 ~]#
>>
>> I located the source of the problem, which is in the line:
>> qwt = waitingJobs[0].get('maxwalltime')
>> waitingJobs is a table of lrms.Job objects. It seems to me that the
>> object constructor somehow cannot find all the required fields in job
>> records, so the get() method returns None, because "maxwalltime" key is
>> absent.
>> But why this happens - I don't know. It is a little bit too complicated
>> for me.
>> The funny thing is that I can't recall any recent update that could have
>> caused that, and I use the same torque version (3.0.0) as on another
>> cluster, where the command works fine.
>> Has anyone encountered similar problem? I will be very grateful for any
>> hints.
>>
>> Best regards,
>> Adam
>
|