Print

Print


Hi,

This problem has been solved in torque 2.5.10.
Problem is caused by a read call interruption by SIGCHLD signal delivery.

It is easy to fix but requires patching the server.
I am attaching the fix we provided to Adaptive computing. Small 
modifications will be needed to adapt it for 2.5.7

I'm CCing Stever Trylen who's responsible for torque packages 
maintenance in EPEL.

Between 2.5.9 and 2.5.10 unmunge procedures on the server side were 
rewritten to use popen instead of fork+exec calls so please use the 
patch as a reference.

To fix it, one should take care of masking SIGCHLD handling during 
while(read()) loop

I hope that helps

Cheers
--
Lukasz Flis
ACC Cyfronet AGH



On 08.02.2012 10:38, Andrew Lahiff wrote:
> Hi,
>
> When running qsub multiple times manually (or when qsub is run by CEs),
> occasionally I get:
>
> qsub: Invalid credential
>
> and in the log on the batch server is this:
>
> 02/08/2012 06:36:38;0080;PBS_Server;Req;req_reject;Reject reply
> code=15012(PBS_Server System error: Interrupted system call MSG=error
> reading unmunge data), aux=0, type=AlternateUserAuthentication, from
> [log in to unmask]
>
> Similarly, worker nodes also randomly have the same problem:
>
> 02/08/2012 06:35:32;0080;PBS_Server;Req;req_reject;Reject reply
> code=15012(PBS_Server System error: Interrupted system call MSG=error
> reading unmunge data), aux=0, type=AlternateUserAuthentication, from
> [log in to unmask]
>
> Is this a known or expected problem with torque 2.5.7-7? It's a UMD
> torque server currently with 112 glite 3.2 worker nodes, all with the
> same version of torque and munge 0.5.8-8.el5.
>
> I'm just using the default munge configuration. Should I try increasing
> the number of munge threads on the torque server, or is that not likely
> to be the cause of the problem?
>
> Regards,
>
> Andrew.
>