On 10 January 2011 12:45, Peter W. Draper <[log in to unmask]> wrote:
> On Mon, 10 Jan 2011, David Berry wrote:
>
>> On 7 January 2011 18:32, Peter W. Draper <[log in to unmask]> wrote:
>>>
>>> On Fri, 7 Jan 2011, Mark Taylor wrote:
>>>
>>>> On Fri, 7 Jan 2011, Peter W. Draper wrote:
>>>>
>>>>>> Having done all that ... running the JNIAST unit tests is now giving
>>>>>> me occasional core dumps :-(. This is probably a threading issue - the
>>>>>> SEGVs are in uk.ac.starlink.ast.ThreadTest. I've attached an example dump
>>>>>> file. Difficult to reproduce, currently happening something like one time in
>>>>>> ten, but in the way of these things it seems to go away if you just run the
>>>>>> test ten times. I don't *think* this problem is present in the build before
>>>>>> these changes. It looks to me unlikely that the changes I've just made to
>>>>>> JNIAST would cause this (though I can't say it's impossible), which suggests
>>>>>> it's to do with how it's built. The builds I've done have been against
>>>>>> hawaiki (AST 5.3-1) or nanahope (AST 5.2-0) - I don't have a 5.1-0 to hand.
>>>>>> So, it could be version issues. I suspect however that it might go away if
>>>>>> it's built against an AST built without threading. Although I
>>>>>> believe/believed that JNIAST ought to work with a multithreaded AST, it
>>>>>> doesn't actually benefit much from multithreaded AST (see comments in
>>>>>> jniast.h for the sorry tale), so this wouldn't have much of a downside (and
>>>>>> the previous build might well have been against a threadless AST, though I'm
>>>>>> not sure if it was or not). If we can't resolve this we should put the
>>>>>> changes back on a branch or otherwise roll them back - it is not safe to
>>>>>> leave them without a native library rebuild, since the java and C sides of
>>>>>> the JNI code do not quite match when using the old native libs.
>>>>>
>>>>> Hi Mark & David,
>>>>>
>>>>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem
>>>>> is still there, and it's repeatable as in I get to the same line of code
>>>>> each time, and it seems likely to be the same one that Mark's dump shows:
>>>>>
>>>>> 1225: MAKE_RESAMPLEX(I,int,jint,Int,I)
>>>>>
>>>>> in Mapping.c.
>>>>>
>>>>> Must be worth one of you looking at that macro for any shared resources
>>>>> that are not being locked (there are some structs malloc outside of any
>>>>> locks I can see, or David worrying about astResample, but I imagine that is
>>>>> heavily used?).
>>>>>
>>>>> I'd prefer not to require a non-threaded AST library as that will mean
>>>>> we cannot build against a standard tree (that is bound to come back as a
>>>>> problem).
>>>>
>>>> I did a bit more investigation, and it turns out that contrary to my
>>>> assessment above, the same issue can show up in the existing JNIAST, so this
>>>> has been lurking for a while, i.e. before the FitsChan-related fixes
>>>> suggested by David. I'm not sure why I haven't seen the failure before, but
>>>> likely it's just that the last time I was doing JNIAST work I had a
>>>> different machine and it just never hit the relevant race condition. On the
>>>> bright side, as far as I know nobody has come across this issue in real life
>>>> - probably nobody is doing massively multithreaded JNIAST.
>>>>
>>>> So, the new build is (most likely) no worse than the one we had before.
>>>> It's quite possible that a non-threaded AST wouldn't even help.
>>>
>>> Sorry should have at least tested that idea, but it seems getting a
>>> non-threaded AST isn't possible:
>>>
>>> ./configure --without-pthreads
>>> make
>>> .....
>>> channel.c: In function 'ReadDouble':
>>> channel.c:2796: error: 'AST__BAD' undeclared (first use in this
>>> function)
>>> channel.c:2796: error: (Each undeclared identifier is reported only once
>>> channel.c:2796: error: for each function it appears in.)
>>> channel.c: In function 'WriteDouble':
>>> channel.c:3989: error: 'AST__BAD' undeclared (first use in this
>>> function)
>>> make[1]: *** [libast_la-channel.lo] Error 1
>>> make[1]: Leaving directory
>>> `/loc/pwda/pdraper/starlink_git/starlink/libraries/ast'
>>> make: *** [all] Error 2
>>
>> A few missing includes - should be OK now.
>
> And the result is that this error is also present with an unthreaded AST
> build as well. So this problem is probably not related to AST threads.
>
> Given that I'd suggest that we just build against threaded AST and Mark
> takes a note to look at this sometime.
Fine with me.
David
|