This (Kashif's error below) turns out to be another symptom of the
problem I encountered, and Jeff's fix works for me too:
qmgr -c "s q atlas resources_default.walltime = 72:00:00"
qmgr -c "s q atlas resources_default.cput = 48:00:00"
Now "showq" reports sane "remaining time" numbers, and I hope
overrunning jobs will be terminated in future.
Cheers,
Ben
On 17/11/11 16:25, Jeff Templon wrote:
> Hi,
>
> I think you have a problem that was discussed around august /
september. Newer installations of Torque for some reason do not set the
parameter
>
> resources_default.walltime
>
> and this causes that bug (qwt is not defined and hence the multiply
operation fails).
>
> Give that parameter a value in torque, the error should go away.
>
> JT
>
> On 16 Nov 2011, at 13:45, Kashif Mohammad wrote:
>
>> Hi
>>
>> I am seeing this error /var/log/bdii/bdii-update.log in all of our CE's
>> Traceback (most recent call last):
>> File "/opt/lcg/libexec/lcg-info-dynamic-scheduler", line 435, in ?
>> wrt = qwt * nwait
>>
>> "/opt/lcg/libexec/lcg-info-dynamic-scheduler belongs to
lcg-info-dynamic-scheduler-generic-2.3.4-1 which hasn't changed for long
and I am not able to correlate this problem with any other change.
>> Rpm -qa | grep bdii
>> bdii-5.0.8-1
>>
>> The end result is that CE is publishing only default dynamic values.
The last update was almost two week back
>>
>> Nov 02 10:01:39 Updated: torque-2.5.7-2.el5.1.x86_64
>> Nov 02 10:01:39 Updated: libtorque-2.5.7-2.el5.1.x86_64
>> Nov 02 10:01:40 Updated: torque-client-2.5.7-2.el5.1.x86_64
>> Nov 02 10:01:40 Updated: glite-apel-core-2.0.13-8.noarch
>> Nov 02 10:01:40 Updated: glite-version-3.2.3-1.noarch
>> Nov 02 10:01:40 Updated: glite-yaim-torque-utils-4.1.0-2.sl5.noarch
>> Nov 02 10:01:40 Updated: freetype-2.2.1-28.el5_7.1.x86_64
>> Nov 02 10:01:40 Updated: glite-TORQUE_utils-3.2.4-2.sl5.x86_64
>> Nov 02 10:01:53 Installed: kernel-2.6.18-274.7.1.el5.x86_64
>> Nov 02 10:01:53 Updated: rpm-libs-4.4.2.3-22.el5_7.2.x86_64
>> Nov 02 10:01:57 Updated: rpm-4.4.2.3-22.el5_7.2.x86_64
>> Nov 02 10:01:59 Updated: rpm-python-4.4.2.3-22.el5_7.2.x86_64
>> Nov 02 10:02:05 Updated: torque-client-2.5.7-2.el5.1.x86_64
>>
>> There are chances that problem started after this update and we
haven't noticed as most of the big VO's do direct submission.
>> Any suggestion please ?
>>
>> Thanks
>> Kashif
On 11/11/11 12:25, Ben Waugh wrote:
> Thanks for your reply Arnau. I have not compiled Maui myself but have
> installed it from the glite-TORQUE_server_ext repository, and the
> version I have is maui-3.2.6p21-snap.1234905291.5.el5, along with
> torque-2.5.7-2.el5.1.
>
> Can someone who has also installed these versions from the gLite
> repository check whether they see the same effect? I would be a little
> surprised if this distribution did not have the appropriate
> configuration options, but as I said, I would probably not have noticed
> this myself if not for an unrelated problem leading to jobs using much
> more wall- than CPU time.
>
> Cheers,
> Ben
>
>
>
> On 11/10/2011 04:15 PM, Arnau Bria wrote:
>> On Thu, 10 Nov 2011 16:03:09 +0000
>> Ben Waugh wrote:
>>
>>> Hi All,
>> Hi,
>>
>>> I suspect this problem might have arisen from upgrading Torque/Maui
>>> as part of the recent gLite changes, without draining the farm.
>> have you compiled torque by your side?
>>
>> I had simliar issue some time ago...
>> take a look at:
>> http://www.supercluster.org/pipermail/torqueusers/2010-June/010740.html
>>
>> solved by adding --enable-maxdefault at configure time
>>
>> You could chek if this is the same issue when doing a qstat -f and
>> check if there are no default resource time limits.
>>
>> HTH,
>> Arnau
--
Dr Ben Waugh Tel. +44 (0)20 7679 7223
Dept of Physics and Astronomy Internal: 37223
University College London
London WC1E 6BT
|