On Fri, 7 Jan 2011, Mark Taylor wrote:
> On Fri, 7 Jan 2011, Peter W. Draper wrote:
>
>>> Having done all that ... running the JNIAST unit tests is now giving
>>> me occasional core dumps :-(. This is probably a threading issue -
>>> the SEGVs are in uk.ac.starlink.ast.ThreadTest. I've attached an
>>> example dump file. Difficult to reproduce, currently happening
>>> something like one time in ten, but in the way of these things it
>>> seems to go away if you just run the test ten times. I don't *think*
>>> this problem is present in the build before these changes. It looks
>>> to me unlikely that the changes I've just made to JNIAST would cause
>>> this (though I can't say it's impossible), which suggests it's to do
>>> with how it's built. The builds I've done have been against hawaiki
>>> (AST 5.3-1) or nanahope (AST 5.2-0) - I don't have a 5.1-0 to hand.
>>> So, it could be version issues. I suspect however that it might go
>>> away if it's built against an AST built without threading. Although I
>>> believe/believed that JNIAST ought to work with a multithreaded AST,
>>> it doesn't actually benefit much from multithreaded AST (see comments
>>> in jniast.h for the sorry tale), so this wouldn't have much of a
>>> downside (and the previous build might well have been against a
>>> threadless AST, though I'm not sure if it was or not). If we can't
>>> resolve this we should put the changes back on a branch or otherwise
>>> roll them back - it is not safe to leave them without a native library
>>> rebuild, since the java and C sides of the JNI code do not quite match
>>> when using the old native libs.
>>
>> Hi Mark & David,
>>
>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem
>> is still there, and it's repeatable as in I get to the same line of
>> code each time, and it seems likely to be the same one that Mark's dump
>> shows:
>>
>> 1225: MAKE_RESAMPLEX(I,int,jint,Int,I)
>>
>> in Mapping.c.
>>
>> Must be worth one of you looking at that macro for any shared resources
>> that are not being locked (there are some structs malloc outside of any
>> locks I can see, or David worrying about astResample, but I imagine
>> that is heavily used?).
>>
>> I'd prefer not to require a non-threaded AST library as that will mean
>> we cannot build against a standard tree (that is bound to come back as
>> a problem).
>
> I did a bit more investigation, and it turns out that contrary to my
> assessment above, the same issue can show up in the existing JNIAST, so
> this has been lurking for a while, i.e. before the FitsChan-related
> fixes suggested by David. I'm not sure why I haven't seen the failure
> before, but likely it's just that the last time I was doing JNIAST work
> I had a different machine and it just never hit the relevant race
> condition. On the bright side, as far as I know nobody has come across
> this issue in real life - probably nobody is doing massively
> multithreaded JNIAST.
>
> So, the new build is (most likely) no worse than the one we had before.
> It's quite possible that a non-threaded AST wouldn't even help.
Sorry should have at least tested that idea, but it seems getting a
non-threaded AST isn't possible:
./configure --without-pthreads
make
.....
channel.c: In function 'ReadDouble':
channel.c:2796: error: 'AST__BAD' undeclared (first use in this function)
channel.c:2796: error: (Each undeclared identifier is reported only once
channel.c:2796: error: for each function it appears in.)
channel.c: In function 'WriteDouble':
channel.c:3989: error: 'AST__BAD' undeclared (first use in this function)
make[1]: *** [libast_la-channel.lo] Error 1
make[1]: Leaving directory
`/loc/pwda/pdraper/starlink_git/starlink/libraries/ast'
make: *** [all] Error 2
Naturally Ok without "--without-pthreads".
> Another point: the test in question calls astResample in loads of
> threads at once just to see if there's trouble. I picked astResample
> only since it's an AST function that takes a reasonable amount of CPU
> time to run. So, the problem may not be specifically to do with
> the MAKE_RESAMPLEX macro, it may be to do with the JNIAST framework
> in general (which was really what that test was designed to look for).
>
> Mark
>
> --
> Mark Taylor Astronomical Programmer Physics, Bristol University, UK
> [log in to unmask] +44-117-928-8776 http://www.star.bris.ac.uk/~mbt/
>
--
Peter W. Draper, http://astro.dur.ac.uk/~pdraper
|