On 10 January 2011 12:45, Peter W. Draper <[log in to unmask]> wrote: > On Mon, 10 Jan 2011, David Berry wrote: > >> On 7 January 2011 18:32, Peter W. Draper <[log in to unmask]> wrote: >>> >>> On Fri, 7 Jan 2011, Mark Taylor wrote: >>> >>>> On Fri, 7 Jan 2011, Peter W. Draper wrote: >>>> >>>>>> Having done all that ... running the JNIAST unit tests is now giving >>>>>> me occasional core dumps :-(. This is probably a threading issue - the >>>>>> SEGVs are in uk.ac.starlink.ast.ThreadTest. I've attached an example dump >>>>>> file. Difficult to reproduce, currently happening something like one time in >>>>>> ten, but in the way of these things it seems to go away if you just run the >>>>>> test ten times. I don't *think* this problem is present in the build before >>>>>> these changes. It looks to me unlikely that the changes I've just made to >>>>>> JNIAST would cause this (though I can't say it's impossible), which suggests >>>>>> it's to do with how it's built. The builds I've done have been against >>>>>> hawaiki (AST 5.3-1) or nanahope (AST 5.2-0) - I don't have a 5.1-0 to hand. >>>>>> So, it could be version issues. I suspect however that it might go away if >>>>>> it's built against an AST built without threading. Although I >>>>>> believe/believed that JNIAST ought to work with a multithreaded AST, it >>>>>> doesn't actually benefit much from multithreaded AST (see comments in >>>>>> jniast.h for the sorry tale), so this wouldn't have much of a downside (and >>>>>> the previous build might well have been against a threadless AST, though I'm >>>>>> not sure if it was or not). If we can't resolve this we should put the >>>>>> changes back on a branch or otherwise roll them back - it is not safe to >>>>>> leave them without a native library rebuild, since the java and C sides of >>>>>> the JNI code do not quite match when using the old native libs. >>>>> >>>>> Hi Mark & David, >>>>> >>>>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem >>>>> is still there, and it's repeatable as in I get to the same line of code >>>>> each time, and it seems likely to be the same one that Mark's dump shows: >>>>> >>>>> 1225: MAKE_RESAMPLEX(I,int,jint,Int,I) >>>>> >>>>> in Mapping.c. >>>>> >>>>> Must be worth one of you looking at that macro for any shared resources >>>>> that are not being locked (there are some structs malloc outside of any >>>>> locks I can see, or David worrying about astResample, but I imagine that is >>>>> heavily used?). >>>>> >>>>> I'd prefer not to require a non-threaded AST library as that will mean >>>>> we cannot build against a standard tree (that is bound to come back as a >>>>> problem). >>>> >>>> I did a bit more investigation, and it turns out that contrary to my >>>> assessment above, the same issue can show up in the existing JNIAST, so this >>>> has been lurking for a while, i.e. before the FitsChan-related fixes >>>> suggested by David. I'm not sure why I haven't seen the failure before, but >>>> likely it's just that the last time I was doing JNIAST work I had a >>>> different machine and it just never hit the relevant race condition. On the >>>> bright side, as far as I know nobody has come across this issue in real life >>>> - probably nobody is doing massively multithreaded JNIAST. >>>> >>>> So, the new build is (most likely) no worse than the one we had before. >>>> It's quite possible that a non-threaded AST wouldn't even help. >>> >>> Sorry should have at least tested that idea, but it seems getting a >>> non-threaded AST isn't possible: >>> >>> ./configure --without-pthreads >>> make >>> ..... >>> channel.c: In function 'ReadDouble': >>> channel.c:2796: error: 'AST__BAD' undeclared (first use in this >>> function) >>> channel.c:2796: error: (Each undeclared identifier is reported only once >>> channel.c:2796: error: for each function it appears in.) >>> channel.c: In function 'WriteDouble': >>> channel.c:3989: error: 'AST__BAD' undeclared (first use in this >>> function) >>> make[1]: *** [libast_la-channel.lo] Error 1 >>> make[1]: Leaving directory >>> `/loc/pwda/pdraper/starlink_git/starlink/libraries/ast' >>> make: *** [all] Error 2 >> >> A few missing includes - should be OK now. > > And the result is that this error is also present with an unthreaded AST > build as well. So this problem is probably not related to AST threads. > > Given that I'd suggest that we just build against threaded AST and Mark > takes a note to look at this sometime. Fine with me. David