complement... dc2-grid-21 is also Arc 5. it's been in production running
CMS without problems for several weeks. yet, Steve Lloyd's tests fail
and anything that I submit through the WMS servers at IC. Same behaviour
that you reported.
raul
On 04/06/15 15:12, RAUL H C LOPES wrote:
> This ?
>
> https://ggus.eu/index.php?mode=ticket_info&ticket_id=113745
>
> raul
> On 04/06/15 14:55, Gareth Roy wrote:
>> Hi All,
>>
>> We’ve recently built a new ARC v5.0 CE fronting our HTCondor farm and
>> have run into a really strange issue that I’m wondering if anyone has
>> seen or could shed some light on.
>>
>> Everything seems reasonably okay but at present we’re failing Steve
>> Lloyds tests as the jobs keep being held by Condor due to an error
>> that looks like:
>>
>>> "Error from slot1@node128: STARTER at 10.141.0.128 failed to send
>>> file(s) to <10.141.255.19:41710>; SHADOW at 10.141.255.19 failed to
>>> write to file
>>> /var/spool/arc/grid01/46NMDmT1iKmnbbfC3pqhhxZmABFKDmABFKDmkDQKDmABFKDmKUkTam.comment:
>>> (errno 13) Permission denied”
>>
>> Effectively the condor_shadow process stops having permission to
>> write stdout to the ARC comment file. Initially we thought this might
>> be a poor interaction with the new ARC and the WMS but we’ve been
>> unable to replicate the error through our own job submission. Looking
>> in more depth it really appears to be the condor_shadow process that
>> is changing the file permissions on the file, by strace’ing the
>> process we can see (byte transfers have been removed for clarity):
>>
>>> [pid 23223] recvfrom(6, "_condor_stderr\0", 15, 0, NULL, NULL) = 15
>>
>> Receiving _condor_stderr from the WN
>>
>>> [pid 23223]
>>> open("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment",
>>> O_WRONLY <unfinished ...>
>>> [pid 23223] <... open resumed> ) = 10
>>> [pid 23223] fstat(10, {st_mode=S_IFREG|0755, st_size=0, ...}) = 0
>> Opening the file for writing with permission of 0755
>>
>>> [pid 23223] ioctl(10, SNDCTL_TMR_TIMEBASE or TCGETS <unfinished ...>
>>> [pid 23223] <... ioctl resumed> , 0x7fff8a6518e0) = -1 ENOTTY
>>> (Inappropriate ioctl for device)
>> Not finding the file to be present, so creating the file and writing
>> data
>>
>>> [pid 23223] <... close resumed> ) = 0
>>
>> Closing the file
>>
>>> [pid 23223]
>>> chmod("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment",
>>> 0400 <unfinished ...>
>> For some reason deciding that the file permissions should now be
>> 0400, I have no idea why!
>>
>>> [pid 23223] <... recvfrom resumed> "_condor_stdout\0", 15, 0, NULL,
>>> NULL) = 15
>> Receiving _condor_stdout from the WN
>>
>>> [pid 23223]
>>> open("/var/spool/arc/grid04/FoaNDmz3hKmnbbfC3pqhhxZmABFKDmABFKDm4hGKDmABFKDmpvOUln.comment",
>>> O_WRONLY <unfinished ...>
>>> [pid 23223] <... open resumed> ) = -1 EACCES (Permission denied)
>>
>> Attempting to open the file to write but since it was previously set
>> to 0400 we can’t and we fail. The result of which is that we now have
>> a hung job in the batch system which we end up having to kill as no
>> matter how much we try and interact with it it’s stuck. It appears
>> that it’s only happening with Steve’s jobs for some reason, we’ve
>> tried submitting a set of tests as the epic VO via svr022 and that
>> all seems to work fine, as does normal direct job submission. My
>> concern is there is something here that our simple WMS tests are
>> missing but will potentially cause issues for other small VOs
>> (especially as we are looking to retire our last CREAM-CE).
>>
>> It must be some sort of interaction between the ARC and the rest of
>> the components as the condor packages are the same on all three of
>> our CEs the only difference is this is ARC version 5.0 rather than
>> 4.2. Perhaps some file permissions are being back propagated from the
>> WN? I have to admit I’m completely confused as to why this is
>> happening, potentially I’m missing something simple but permissions
>> etc all look fine and we only seem to be seeing it with this
>> particular payload and only on this CE.
>>
>> If anyone has any suggestions it would be greatly appreciated.
>>
>> Thanks,
>>
>> Gareth
|