On Mon, 10 Jan 2011, David Berry wrote:
> On 7 January 2011 18:32, Peter W. Draper <[log in to unmask]> wrote:
>> On Fri, 7 Jan 2011, Mark Taylor wrote:
>>
>>> On Fri, 7 Jan 2011, Peter W. Draper wrote:
>>>
>>>>> Having done all that ... running the JNIAST unit tests is now giving
>>>>> me occasional core dumps :-(. This is probably a threading issue -
>>>>> the SEGVs are in uk.ac.starlink.ast.ThreadTest. I've attached an
>>>>> example dump file. Difficult to reproduce, currently happening
>>>>> something like one time in ten, but in the way of these things it
>>>>> seems to go away if you just run the test ten times. I don't
>>>>> *think* this problem is present in the build before these changes.
>>>>> It looks to me unlikely that the changes I've just made to JNIAST
>>>>> would cause this (though I can't say it's impossible), which
>>>>> suggests it's to do with how it's built. The builds I've done have
>>>>> been against hawaiki (AST 5.3-1) or nanahope (AST 5.2-0) - I don't
>>>>> have a 5.1-0 to hand. So, it could be version issues. I suspect
>>>>> however that it might go away if it's built against an AST built
>>>>> without threading. Although I believe/believed that JNIAST ought to
>>>>> work with a multithreaded AST, it doesn't actually benefit much from
>>>>> multithreaded AST (see comments in jniast.h for the sorry tale), so
>>>>> this wouldn't have much of a downside (and the previous build might
>>>>> well have been against a threadless AST, though I'm not sure if it
>>>>> was or not). If we can't resolve this we should put the changes
>>>>> back on a branch or otherwise roll them back - it is not safe to
>>>>> leave them without a native library rebuild, since the java and C
>>>>> sides of the JNI code do not quite match when using the old native
>>>>> libs.
>>>>
>>>> Hi Mark & David,
>>>>
>>>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem
>>>> is still there, and it's repeatable as in I get to the same line of
>>>> code each time, and it seems likely to be the same one that Mark's
>>>> dump shows:
>>>>
>>>> 1225: MAKE_RESAMPLEX(I,int,jint,Int,I)
>>>>
>>>> in Mapping.c.
>>>>
>>>> Must be worth one of you looking at that macro for any shared
>>>> resources that are not being locked (there are some structs malloc
>>>> outside of any locks I can see, or David worrying about astResample,
>>>> but I imagine that is heavily used?).
>>>>
>>>> I'd prefer not to require a non-threaded AST library as that will
>>>> mean we cannot build against a standard tree (that is bound to come
>>>> back as a problem).
>>>
>>> I did a bit more investigation, and it turns out that contrary to my
>>> assessment above, the same issue can show up in the existing JNIAST,
>>> so this has been lurking for a while, i.e. before the FitsChan-related
>>> fixes suggested by David. I'm not sure why I haven't seen the failure
>>> before, but likely it's just that the last time I was doing JNIAST
>>> work I had a different machine and it just never hit the relevant race
>>> condition. On the bright side, as far as I know nobody has come
>>> across this issue in real life - probably nobody is doing massively
>>> multithreaded JNIAST.
>>>
>>> So, the new build is (most likely) no worse than the one we had
>>> before. It's quite possible that a non-threaded AST wouldn't even
>>> help.
>>
>> Sorry should have at least tested that idea, but it seems getting a
>> non-threaded AST isn't possible:
>>
>> ./configure --without-pthreads
>> make
>> .....
>> channel.c: In function 'ReadDouble':
>> channel.c:2796: error: 'AST__BAD' undeclared (first use in this function)
>> channel.c:2796: error: (Each undeclared identifier is reported only once
>> channel.c:2796: error: for each function it appears in.)
>> channel.c: In function 'WriteDouble':
>> channel.c:3989: error: 'AST__BAD' undeclared (first use in this function)
>> make[1]: *** [libast_la-channel.lo] Error 1
>> make[1]: Leaving directory
>> `/loc/pwda/pdraper/starlink_git/starlink/libraries/ast'
>> make: *** [all] Error 2
>
> A few missing includes - should be OK now.
And the result is that this error is also present with an unthreaded AST
build as well. So this problem is probably not related to AST threads.
Given that I'd suggest that we just build against threaded AST and Mark
takes a note to look at this sometime.
Peter.
|