Hello *,
I'm aware of another "interrupted system call" problem which probably
affects interactive jobs:
pbs_mom[]: LOG_ERROR::Interrupted system call (4) in TMomFinalizeChild,
cannot get termtype
This message comes from /var/log/messages logfile
So my question to all LCG-ROLLOUT members:
Have you noticed any problems with interactive jobs on your clusters?
Best Regards,
--
Lukasz Flis
> This problem has been solved in torque 2.5.10.
> Problem is caused by a read call interruption by SIGCHLD signal delivery.
>
> It is easy to fix but requires patching the server.
> I am attaching the fix we provided to Adaptive computing. Small
> modifications will be needed to adapt it for 2.5.7
>
> I'm CCing Stever Trylen who's responsible for torque packages
> maintenance in EPEL.
>
> Between 2.5.9 and 2.5.10 unmunge procedures on the server side were
> rewritten to use popen instead of fork+exec calls so please use the
> patch as a reference.
>
> To fix it, one should take care of masking SIGCHLD handling during
> while(read()) loop
>
> I hope that helps
>
> Cheers
> --
> Lukasz Flis
> ACC Cyfronet AGH
>
>
>
> On 08.02.2012 10:38, Andrew Lahiff wrote:
>> Hi,
>>
>> When running qsub multiple times manually (or when qsub is run by CEs),
>> occasionally I get:
>>
>> qsub: Invalid credential
>>
>> and in the log on the batch server is this:
>>
>> 02/08/2012 06:36:38;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15012(PBS_Server System error: Interrupted system call MSG=error
>> reading unmunge data), aux=0, type=AlternateUserAuthentication, from
>> [log in to unmask]
>>
>> Similarly, worker nodes also randomly have the same problem:
>>
>> 02/08/2012 06:35:32;0080;PBS_Server;Req;req_reject;Reject reply
>> code=15012(PBS_Server System error: Interrupted system call MSG=error
>> reading unmunge data), aux=0, type=AlternateUserAuthentication, from
>> [log in to unmask]
>>
>> Is this a known or expected problem with torque 2.5.7-7? It's a UMD
>> torque server currently with 112 glite 3.2 worker nodes, all with the
>> same version of torque and munge 0.5.8-8.el5.
>>
>> I'm just using the default munge configuration. Should I try increasing
>> the number of munge threads on the torque server, or is that not likely
>> to be the cause of the problem?
>>
>> Regards,
>>
>> Andrew.
>>
>
>
|