Andrew,
I've added your tips in a new section. Please feel
free to add your own content if you wish.
https://www.gridpp.ac.uk/wiki/Example_Build_of_an_EMI-UMD_Cluster
Cheers,
Steve
On 04/13/2012 09:48 AM, Andrew Lahiff wrote:
> Hi,
>
> One tip I have is that if you get sick of having a small percentage of jobs failing with munge authentication problems (due to a bug in the torque version which is in EPEL), just build a more recent version of torque from source. We're using 2.5.10.
>
> And if you start getting occasional jobs aborting with errors like this in the CE logs:
>
> BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Error (15033 - Batch protocol error
>
> then increase PBS_NET_MAX_CONNECTIONS from the default value of 10240 in src/include/server_limits.h
>
> Regards,
> Andrew.
>
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:[log in to unmask]] On Behalf Of Stephen Jones
> Sent: 13 April 2012 09:33
> To: [log in to unmask]
> Subject: Re: Setting up new CREAM CE/Torque failing
>
> Hi Mark,
>
> I've put my notes on our EMI/UMD CREAM/TORQUE experience here:
>
> https://www.gridpp.ac.uk/wiki/Middleware_upgrades
>
> It has no reference to puppet in it, and it covers the majority (all?) of the obstacles I encountered when installing the new software stack.
>
> I've put in a special note about "authorized_users" in the torque config.
> Are there any other things that you ran into?
>
> Cheers,
>
>
> Steve
>
>
>
>
> On 04/11/2012 08:21 AM, Mark Slater wrote:
>> Hi Steve,
>>
>> After a bit more targetted Googling, I got a little further but I
>> think the main issue is related to the munge requirement that was
>> introduced about a year ago. This has changed a number of things
>> (several of which I've already had problems with!) but almost all the
>> documentation I can find is for previous versions of torque that
>> *don't* need munge, thus making them a bit pointless :)
>>
>> After I've got through the installs of the new kit (which is nearly
>> there now) I'll try to write up everything I've learnt in a Twiki
>> somewhere - this should be quite easy as well seeing as everything I
>> know is now encoded in a large number of puppet scripts :)
>>
>> Thanks,
>>
>> Mark
>>
>> On 10/04/2012 17:32, Stephen Jones wrote:
>>> Hi Mark,
>>>
>>> We ran into this too, at Liverpool, and found the answer after
>>> googling like mad, and trying lots of things out. But you're right -
>>> it didn't spring out at me - I had to drag it out with chains.
>>>
>>> Did we miss something in the documentation that should be made much
>>> clearer?
>>> Any thoughts on that?
>>>
>>> Steve
>>>
>>>
>>> On 04/10/2012 11:29 AM, Mark Slater wrote:
>>>> Hi Chris,
>>>>
>>>> You're a star!!! Now why couldn't Google find this? :)
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>> On 10/04/2012 10:31, Chris Brew wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Think you need lines like:
>>>>>
>>>>> set server authorized_users = *@heplnx109.pp.rl.ac.uk set server
>>>>> authorized_users += *@heplnx208.pp.rl.ac.uk set server
>>>>> authorized_users += *@heplnx108.pp.rl.ac.uk set server
>>>>> authorized_users += *@heplnc108.pp.rl.ac.uk set server
>>>>> authorized_users += *@heplnx207.pp.rl.ac.uk
>>>>>
>>>>>
>>>>> For each submission host.
>>>>>
>>>>> Yours,
>>>>> Chris.
>>>>>
>>>>> On 10/04/2012 09:48, "Mark Slater"<[log in to unmask]> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a problem with the new Torque server that I'm setting up
>>>>>> for Bham. Everything's OK from the server side and I can submit
>>>>>> jobs from there, but I can't seem to submit jobs from the new
>>>>>> CREAM CE (just basic
>>>>>> qsub) or any other remote host for that matter. I can qstat from
>>>>>> them fine, but whenever I try a qsub I get:
>>>>>>
>>>>>> [atl001@epgr02 ~]$ echo "echo 'hello'" | qsub -q long
>>>>>> qsub: Bad UID for job execution MSG=could not authorize user
>>>>>> atl001 from
>>>>>> epgr02.ph.bham.ac.uk
>>>>>>
>>>>>>
>>>>>> and in the logs:
>>>>>>
>>>>>> 04/10/2012 09:39:39;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>> code=15025(Bad UID for job execution MSG=could not authorize user
>>>>>> atl001
>>>>> >from epgr02.ph.bham\
>>>>>> .ac.uk), aux=0, type=QueueJob, from [log in to unmask]
>>>>>> 04/10/2012 09:39:57;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>> code=15021(Invalid credential), aux=0, type=StatusJob, from
>>>>>> [log in to unmask]
>>>>>> 04/10/2012 09:40:17;0080;PBS_Server;Req;req_reject;Reject reply
>>>>>> code=15021(Invalid credential), aux=0, type=StatusJob, from
>>>>>> [log in to unmask]
>>>>>>
>>>>>>
>>>>>> I've added the host to /etc/hosts.equiv and (after that didn't
>>>>>> work) added it in acl_hosts:
>>>>>>
>>>>>> set server scheduling = False
>>>>>> set server acl_host_enable = True
>>>>>> set server acl_hosts = epgr13.ph.bham.ac.uk set server acl_hosts
>>>>>> += localhost set server acl_hosts += epgr02.ph.bham.ac.uk set
>>>>>> server managers = [log in to unmask] set server operators =
>>>>>> [log in to unmask] set server operators +=
>>>>>> [log in to unmask]
>>>>>>
>>>>>>
>>>>>> I'm guessing I'm missing some security setting somewhere but
>>>>>> Google isn't helping me find it!
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Mark
>>>
>
--
Steve Jones [log in to unmask]
System Administrator office: 220
High Energy Physics Division tel (int): 42334
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 2334
University of Liverpool http://www.liv.ac.uk/physics/hep/
|