Print

Print


On Jan 7, 2011, at 8:08 AM, Mark Taylor wrote:

> On Fri, 7 Jan 2011, Peter W. Draper wrote:
> 
>>> Having done all that ... running the JNIAST unit tests is now giving me
>>> occasional core dumps :-(.  This is probably a threading issue - the SEGVs
>>> are in uk.ac.starlink.ast.ThreadTest.  I've attached an example dump file.
>>> Difficult to reproduce, currently happening something like one time in ten,
>>> but in the way of these things it seems to go away if you just run the test
>>> ten times.  I don't *think* this problem is present in the build before
>>> these changes.  It looks to me unlikely that the changes I've just made to
>>> JNIAST would cause this (though I can't say it's impossible), which suggests
>>> it's to do with how it's built. The builds I've done have been against
>>> hawaiki (AST 5.3-1) or nanahope (AST 5.2-0) - I don't have a 5.1-0 to hand.
>>> So, it could be version issues.  I suspect however that it might go away if
>>> it's built against an AST built without threading.  Although I
>>> believe/believed that JNIAST ought to work with a multithreaded AST, it
>>> doesn't actually benefit much from multithreaded AST (see comments in
>>> jniast.h for the sorry tale), so this wouldn't have much of a downside (and
>>> the previous build might well have been against a threadless AST, though I'm
>>> not sure if it was or not).  If we can't resolve this we should put the
>>> changes back on a branch or otherwise roll them back - it is not safe to
>>> leave them without a native library rebuild, since the java and C sides of
>>> the JNI code do not quite match when using the old native libs.
>> 
>> Hi Mark & David,
>> 
>> OK, I've rebuilt JNIAST against the latest AST (5.4) and this problem is still
>> there, and it's repeatable as in I get to the same line of code each time, and
>> it seems likely to be the same one that Mark's dump shows:
>> 
>> 1225: MAKE_RESAMPLEX(I,int,jint,Int,I)
>> 
>> in Mapping.c.
>> 
>> Must be worth one of you looking at that macro for any shared resources that
>> are not being locked (there are some structs malloc outside of any locks I can
>> see, or David worrying about astResample, but I imagine that is heavily
>> used?).
>> 
>> I'd prefer not to require a non-threaded AST library as that will mean we
>> cannot build against a standard tree (that is bound to come back as a
>> problem).
> 
> I did a bit more investigation, and it turns out that contrary to my
> assessment above, the same issue can show up in the existing JNIAST, 
> so this has been lurking for a while, i.e. before the FitsChan-related
> fixes suggested by David.  I'm not sure why I haven't
> seen the failure before, but likely it's just that the last time I was 
> doing JNIAST work I had a different machine and it just never hit 
> the relevant race condition.  On the bright side, as far as 
> I know nobody has come across this issue in real life - probably 
> nobody is doing massively multithreaded JNIAST.
> 
> So, the new build is (most likely) no worse than the one we had before.
> It's quite possible that a non-threaded AST wouldn't even help.  
> 
> Another point: the test in question calls astResample in loads of 
> threads at once just to see if there's trouble.  I picked astResample 
> only since it's an AST function that takes a reasonable amount of CPU 
> time to run.  So, the problem may not be specifically to do with 
> the MAKE_RESAMPLEX macro, it may be to do with the JNIAST framework
> in general (which was really what that test was designed to look for).
> 

is it possible to run the test with valgrind?

It is also possible for the macros to be expanded prior to compilation (David has a script for that in the AST tree) so that a more useful line number would turn up.

-- 
Tim Jenness
Joint Astronomy Centre