Hi,
One tip I have is that if you get sick of having a small percentage of jobs failing with munge authentication problems (due to a bug in the torque version which is in EPEL), just build a more recent version of torque from source. We're using 2.5.10.
And if you start getting occasional jobs aborting with errors like this in the CE logs:
BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Error (15033 - Batch protocol error
then increase PBS_NET_MAX_CONNECTIONS from the default value of 10240 in src/include/server_limits.h
Regards,
Andrew.
-----Original Message-----
From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Stephen Jones
Sent: 13 April 2012 09:33
To: [log in to unmask]
Subject: Re: Setting up new CREAM CE/Torque failing
Hi Mark,
I've put my notes on our EMI/UMD CREAM/TORQUE experience here:
https://www.gridpp.ac.uk/wiki/Middleware_upgrades
It has no reference to puppet in it, and it covers the majority (all?) of the obstacles I encountered when installing the new software stack.
I've put in a special note about "authorized_users" in the torque config.
Are there any other things that you ran into?
Cheers,
Steve
On 04/11/2012 08:21 AM, Mark Slater wrote:
> Hi Steve,
>
> After a bit more targetted Googling, I got a little further but I
> think the main issue is related to the munge requirement that was
> introduced about a year ago. This has changed a number of things
> (several of which I've already had problems with!) but almost all the
> documentation I can find is for previous versions of torque that
> *don't* need munge, thus making them a bit pointless :)
>
> After I've got through the installs of the new kit (which is nearly
> there now) I'll try to write up everything I've learnt in a Twiki
> somewhere - this should be quite easy as well seeing as everything I
> know is now encoded in a large number of puppet scripts :)
>
> Thanks,
>
> Mark
>
> On 10/04/2012 17:32, Stephen Jones wrote:
>> Hi Mark,
>>
>> We ran into this too, at Liverpool, and found the answer after
>> googling like mad, and trying lots of things out. But you're right -
>> it didn't spring out at me - I had to drag it out with chains.
>>
>> Did we miss something in the documentation that should be made much
>> clearer?
>> Any thoughts on that?
>>
>> Steve
>>
>>
>> On 04/10/2012 11:29 AM, Mark Slater wrote:
>>> Hi Chris,
>>>
>>> You're a star!!! Now why couldn't Google find this? :)
>>>
>>> Thanks,
>>>
>>> Mark
>>>
>>> On 10/04/2012 10:31, Chris Brew wrote:
>>>> Hi Mark,
>>>>
>>>> Think you need lines like:
>>>>
>>>> set server authorized_users = *@heplnx109.pp.rl.ac.uk set server
>>>> authorized_users += *@heplnx208.pp.rl.ac.uk set server
>>>> authorized_users += *@heplnx108.pp.rl.ac.uk set server
>>>> authorized_users += *@heplnc108.pp.rl.ac.uk set server
>>>> authorized_users += *@heplnx207.pp.rl.ac.uk
>>>>
>>>>
>>>> For each submission host.
>>>>
>>>> Yours,
>>>> Chris.
>>>>
>>>> On 10/04/2012 09:48, "Mark Slater"<[log in to unmask]> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I have a problem with the new Torque server that I'm setting up
>>>>> for Bham. Everything's OK from the server side and I can submit
>>>>> jobs from there, but I can't seem to submit jobs from the new
>>>>> CREAM CE (just basic
>>>>> qsub) or any other remote host for that matter. I can qstat from
>>>>> them fine, but whenever I try a qsub I get:
>>>>>
>>>>> [atl001@epgr02 ~]$ echo "echo 'hello'" | qsub -q long
>>>>> qsub: Bad UID for job execution MSG=could not authorize user
>>>>> atl001 from
>>>>> epgr02.ph.bham.ac.uk
>>>>>
>>>>>
>>>>> and in the logs:
>>>>>
>>>>> 04/10/2012 09:39:39;0080;PBS_Server;Req;req_reject;Reject reply
>>>>> code=15025(Bad UID for job execution MSG=could not authorize user
>>>>> atl001
>>>> >from epgr02.ph.bham\
>>>>> .ac.uk), aux=0, type=QueueJob, from [log in to unmask]
>>>>> 04/10/2012 09:39:57;0080;PBS_Server;Req;req_reject;Reject reply
>>>>> code=15021(Invalid credential), aux=0, type=StatusJob, from
>>>>> [log in to unmask]
>>>>> 04/10/2012 09:40:17;0080;PBS_Server;Req;req_reject;Reject reply
>>>>> code=15021(Invalid credential), aux=0, type=StatusJob, from
>>>>> [log in to unmask]
>>>>>
>>>>>
>>>>> I've added the host to /etc/hosts.equiv and (after that didn't
>>>>> work) added it in acl_hosts:
>>>>>
>>>>> set server scheduling = False
>>>>> set server acl_host_enable = True
>>>>> set server acl_hosts = epgr13.ph.bham.ac.uk set server acl_hosts
>>>>> += localhost set server acl_hosts += epgr02.ph.bham.ac.uk set
>>>>> server managers = [log in to unmask] set server operators =
>>>>> [log in to unmask] set server operators +=
>>>>> [log in to unmask]
>>>>>
>>>>>
>>>>> I'm guessing I'm missing some security setting somewhere but
>>>>> Google isn't helping me find it!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark
>>
>>
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|