Hi there,
I found the problem for our case.
As mentioned before we use an external torque server which is not
configured by yaim but by hand.
The queues had no "resources_default.walltime" configured, only
"resources_max.walltime".
Since we introduced "resources_default.walltime",
glite-info-dynamic-scheduler-wrapper is working again and thus does gstat.
Cheers,
Ralph
Ralph Mueller-Pfefferkorn wrote on 30.08.2011 08:44:
> Hi,
>
> and it looks like the same problem I opened a ticket about though it
> shows up a bit different.
> https://ggus.eu/tech/ticket_show.php?ticket=73384
>
> I realised it because gstat shows sometimes 0 running jobs although
> there are jobs running. We have a CreamCE with torque 2.5.5
> glite-info-dynamic-scheduler-wrapper crashes at the same line of code.
>
> Regards,
> Ralph Müller-Pfefferkorn
>
> --
> Dr. rer. nat. Ralph Müller-Pfefferkorn
> Head of the department "Distributed and Data Intensive Computing"
> Center for Information Services and High Performance Computing
> Technische Universität Dresden
> 01062 Dresden, Germany
> Office : Zellescher Weg 12, Willers-Bau room A208
> Phone: +49 351 463 39280 Fax: +49 351 463 3773
> E-Mail: [log in to unmask]
>
>
>
> Jeff Templon wrote on 29.08.2011 23:36:
>> Hi Adam,
>>
>> It sounds like the same as this issue:
>>
>> https://ggus.eu/tech/ticket_show.php?ticket=71830
>>
>> so now there are TWO sites which see this issue. That piece of code is very old, it's not clear why it's a problem now. In the past, all waiting jobs ALWAYS had a maxwalltime field. If you could look at that ticket, maybe you could help debug the issue. I cannot reproduce it so far which makes it very difficult to test.
>>
>> I can always add a check on the value, it might make it in EMI release 2 :) However this is a 'can't happen' case, something is still wrong somewhere else!
>>
>> JT
>>
>> On Aug 29, 2011, at 23:13 , Adam Padee wrote:
>>
>>> Hi,
>>>
>>> I've got a problem with 444444 waiting jobs on my Cream-CE.
>>> I checked /opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapper
>>> manually, and when I try to run it, I get the following error:
>>>
>>> [root@ce2 ~]# /opt/lcg/libexec/lcg-info-dynamic-scheduler -c
>>> lcg-info-dynamic-scheduler.conf
>>> dn:
>>> GlueVOViewLocalID=cms,GlueCEUniqueID=ce2.polgrid.pl:8443/cream-pbs-cms,mds-vo-name=resource,o=grid
>>> GlueVOViewLocalID: cms
>>> GlueCEStateRunningJobs: 7
>>> GlueCEStateWaitingJobs: 417
>>> GlueCEStateTotalJobs: 424
>>> GlueCEStateFreeJobSlots: 0
>>> GlueCEStateEstimatedResponseTime: 1325951
>>> Traceback (most recent call last):
>>> File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 435, in ?
>>> wrt = qwt * nwait
>>> TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
>>> [root@ce2 ~]#
>>>
>>> I located the source of the problem, which is in the line:
>>> qwt = waitingJobs[0].get('maxwalltime')
>>> waitingJobs is a table of lrms.Job objects. It seems to me that the
>>> object constructor somehow cannot find all the required fields in job
>>> records, so the get() method returns None, because "maxwalltime" key is
>>> absent.
>>> But why this happens - I don't know. It is a little bit too complicated
>>> for me.
>>> The funny thing is that I can't recall any recent update that could have
>>> caused that, and I use the same torque version (3.0.0) as on another
>>> cluster, where the command works fine.
>>> Has anyone encountered similar problem? I will be very grateful for any
>>> hints.
>>>
>>> Best regards,
>>> Adam
>>
>
|