This ?
https://ggus.eu/index.php?mode=ticket_info&ticket_id=113745
raul
On 04/06/15 14:55, Gareth Roy wrote:
> Hi All,
>
> We’ve recently built a new ARC v5.0 CE fronting our HTCondor farm and have run into a really strange issue that I’m wondering if anyone has seen or could shed some light on.
>
> Everything seems reasonably okay but at present we’re failing Steve Lloyds tests as the jobs keep being held by Condor due to an error that looks like:
>
>> "Error from slot1@node128: STARTER at 10.141.0.128 failed to send file(s) to <10.141.255.19:41710>; SHADOW at 10.141.255.19 failed to write to file /var/spool/arc/grid01/46NMDmT1iKmnbbfC3pqhhxZmABFKDmABFKDmkDQKDmABFKDmKUkTam.comment: (errno 13) Permission denied”
>
> Effectively the condor_shadow process stops having permission to write stdout to the ARC comment file. Initially we thought this might be a poor interaction with the new ARC and the WMS but we’ve been unable to replicate the error through our own job submission. Looking in more depth it really appears to be the condor_shadow process that is changing the file permissions on the file, by strace’ing the process we can see (byte transfers have been removed for clarity):
>
>> [pid 23223] recvfrom(6, "_condor_stderr\0", 15, 0, NULL, NULL) = 15
>
> Receiving _condor_stderr from the WN
>
>> [pid 23223] open("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment", O_WRONLY <unfinished ...>
>> [pid 23223] <... open resumed> ) = 10
>> [pid 23223] fstat(10, {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
> Opening the file for writing with permission of 0755
>
>> [pid 23223] ioctl(10, SNDCTL_TMR_TIMEBASE or TCGETS <unfinished ...>
>> [pid 23223] <... ioctl resumed> , 0x7fff8a6518e0) = -1 ENOTTY (Inappropriate ioctl for device)
> Not finding the file to be present, so creating the file and writing data
>
>> [pid 23223] <... close resumed> ) = 0
>
> Closing the file
>
>> [pid 23223] chmod("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment", 0400 <unfinished ...>
> For some reason deciding that the file permissions should now be 0400, I have no idea why!
>
>> [pid 23223] <... recvfrom resumed> "_condor_stdout\0", 15, 0, NULL, NULL) = 15
> Receiving _condor_stdout from the WN
>
>> [pid 23223] open("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment", O_WRONLY <unfinished ...>
>> [pid 23223] <... open resumed> ) = -1 EACCES (Permission denied)
>
> Attempting to open the file to write but since it was previously set to 0400 we can’t and we fail. The result of which is that we now have a hung job in the batch system which we end up having to kill as no matter how much we try and interact with it it’s stuck. It appears that it’s only happening with Steve’s jobs for some reason, we’ve tried submitting a set of tests as the epic VO via svr022 and that all seems to work fine, as does normal direct job submission. My concern is there is something here that our simple WMS tests are missing but will potentially cause issues for other small VOs (especially as we are looking to retire our last CREAM-CE).
>
> It must be some sort of interaction between the ARC and the rest of the components as the condor packages are the same on all three of our CEs the only difference is this is ARC version 5.0 rather than 4.2. Perhaps some file permissions are being back propagated from the WN? I have to admit I’m completely confused as to why this is happening, potentially I’m missing something simple but permissions etc all look fine and we only seem to be seeing it with this particular payload and only on this CE.
>
> If anyone has any suggestions it would be greatly appreciated.
>
> Thanks,
>
> Gareth
|