Print

Print


On 10 January 2011 12:45, Peter W. Draper <[log in to unmask]> wrote:
> On Mon, 10 Jan 2011, David Berry wrote:
>
>> On 7 January 2011 18:32, Peter W. Draper <[log in to unmask]> wrote:
>>>
>>> On Fri, 7 Jan 2011, Mark Taylor wrote:
>>>
>>>> On Fri, 7 Jan 2011, Peter W. Draper wrote:
>>>>
>>>>>> Having done all that ... running the JNIAST unit tests is now giving
>>>>>> me occasional core dumps :-(.  This is probably a threading issue - the
>>>>>> SEGVs are in uk.ac.starlink.ast.ThreadTest.  I've attached an example dump
>>>>>> file. Difficult to reproduce, currently happening something like one time in
>>>>>> ten, but in the way of these things it seems to go away if you just run the
>>>>>> test ten times.  I don't *think* this problem is present in the build before
>>>>>> these changes.  It looks to me unlikely that the changes I've just made to
>>>>>> JNIAST would cause this (though I can't say it's impossible), which suggests
>>>>>> it's to do with how it's built. The builds I've done have been against
>>>>>> hawaiki (AST 5.3-1) or nanahope (AST 5.2-0) - I don't have a 5.1-0 to hand.
>>>>>> So, it could be version issues.  I suspect however that it might go away if
>>>>>> it's built against an AST built without threading.  Although I
>>>>>> believe/believed that JNIAST ought to work with a multithreaded AST, it
>>>>>> doesn't actually benefit much from multithreaded AST (see comments in
>>>>>> jniast.h for the sorry tale), so this wouldn't have much of a downside (and
>>>>>> the previous build might well have been against a threadless AST, though I'm
>>>>>> not sure if it was or not).  If we can't resolve this we should put the
>>>>>> changes back on a branch or otherwise roll them back - it is not safe to
>>>>>> leave them without a native library rebuild, since the java and C sides of
>>>>>> the JNI code do not quite match when using the old native libs.
>>>>>
>>>>> Hi Mark & David,
>>>>>
>>>>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem
>>>>> is still there, and it's repeatable as in I get to the same line of code
>>>>> each time, and it seems likely to be the same one that Mark's dump shows:
>>>>>
>>>>>  1225: MAKE_RESAMPLEX(I,int,jint,Int,I)
>>>>>
>>>>> in Mapping.c.
>>>>>
>>>>> Must be worth one of you looking at that macro for any shared resources
>>>>> that are not being locked (there are some structs malloc outside of any
>>>>> locks I can see, or David worrying about astResample, but I imagine that is
>>>>> heavily used?).
>>>>>
>>>>> I'd prefer not to require a non-threaded AST library as that will mean
>>>>> we cannot build against a standard tree (that is bound to come back as a
>>>>> problem).
>>>>
>>>> I did a bit more investigation, and it turns out that contrary to my
>>>> assessment above, the same issue can show up in the existing JNIAST, so this
>>>> has been lurking for a while, i.e. before the FitsChan-related fixes
>>>> suggested by David.  I'm not sure why I haven't seen the failure before, but
>>>> likely it's just that the last time I was doing JNIAST work I had a
>>>> different machine and it just never hit the relevant race condition.  On the
>>>> bright side, as far as I know nobody has come across this issue in real life
>>>> - probably nobody is doing massively multithreaded JNIAST.
>>>>
>>>> So, the new build is (most likely) no worse than the one we had before.
>>>> It's quite possible that a non-threaded AST wouldn't even help.
>>>
>>> Sorry should have at least tested that idea, but it seems getting a
>>> non-threaded AST isn't possible:
>>>
>>>  ./configure --without-pthreads
>>>  make
>>>    .....
>>>  channel.c: In function 'ReadDouble':
>>>  channel.c:2796: error: 'AST__BAD' undeclared (first use in this
>>> function)
>>>  channel.c:2796: error: (Each undeclared identifier is reported only once
>>>  channel.c:2796: error: for each function it appears in.)
>>>  channel.c: In function 'WriteDouble':
>>>  channel.c:3989: error: 'AST__BAD' undeclared (first use in this
>>> function)
>>>  make[1]: *** [libast_la-channel.lo] Error 1
>>>  make[1]: Leaving directory
>>>  `/loc/pwda/pdraper/starlink_git/starlink/libraries/ast'
>>>  make: *** [all] Error 2
>>
>> A few missing includes - should be OK now.
>
> And the result is that this error is also present with an unthreaded AST
> build as well. So this problem is probably not related to AST threads.
>
> Given that I'd suggest that we just build against threaded AST and Mark
> takes a note to look at this sometime.

Fine with me.

David