On Tue, 24 Feb 2009, Mark Taylor wrote:
>> On Tue, 24 Feb 2009, Mark Taylor wrote:
>>
>>>> Don't see any updates to JNIAST, did you mean to merge them from the
>>>> development branch?
>>>
>>> ack, you're right. Try again (8537).
>>
>> Well, that error message has gone away, but there's clearly something else
>> going on. What I occasionally see are the SPLAT plot windows going blank,
>> that's no visible UI at all, but clearly no CPU is being used. So you've got
>> to suspect some kind of deadlock. Running up a debugger and attaching to the
>> process I can see two likely threads that show:
>>
>> #0 0x00002b33f4b24b04 in __lll_lock_wait () from /lib/libpthread.so.0
>> #1 0x00002b33f4b201a0 in _L_lock_102 () from /lib/libpthread.so.0
>> #2 0x00002b33f4b1faae in pthread_mutex_lock () from /lib/libpthread.so.0
>> #3 0x00002aaac2d754cf in ManageLock (this=0x2aaac0413d98, mode=1,
>> extra=1, fail=0x412a4400, status=0x412a4524) at object.c:2354
>> #4 0x00002aaac2baa833 in ManageLock (this_object=0x2aaac0413df8,
>> mode=1, extra=1, fail=0x412a4400, status=0x412a4524) at frame.c:6162
>> #5 0x00002aaac2b40d69 in ManageLock (this_object=0x2aaac0413df8,
>> mode=1, extra=1, fail=0x412a4400, status=0x412a4524) at cmpframe.c:3936
>> #6 0x00002aaac2bbc55a in ManageLock (this_object=0x2aaac59edc38,
>> mode=1, extra=1, fail=0x412a4400, status=0x412a4524) at frameset.c:5623
>> #7 0x00002aaac2d74e7f in astLockId_ (this_id=0x148ee1, wait=1,
>> status=0x412a4524) at object.c:6060
>> #8 0x00002aaac2aed7b4 in jniastLock (ast_objs=0x2aaabd191740) at
>> jniast.c:441
>> #9 0x00002aaac2b0a598 in Java_uk_ac_starlink_ast_Mapping_tran2
>> (env=0x2aaac83bad98, this=0x412a4610, npoint=4, jXin=0x412a4600,
>> jYin=0x412a45f8, forward=0 '\0') at Mapping.c:407
>> #10 0x00002aaaab7a16cf in ?? ()
>> #11 0x00000000412a45a0 in ?? ()
>> #12 0x00002b33f59f0950 in typeArrayKlass::allocate () from
>> /loc/pwdc/pdraper/jvms/jdk1.6.0_07/jre/lib/amd64/server/libjvm.so
>> #13 0x00002aaaab795009 in ?? ()
>> #14 0x0000000000000000 in ?? ()
>>
>> and:
>>
>> #0 0x00002b33f4b24b04 in __lll_lock_wait () from /lib/libpthread.so.0
>> #1 0x00002b33f4b201a0 in _L_lock_102 () from /lib/libpthread.so.0
>> #2 0x00002b33f4b1faae in pthread_mutex_lock () from /lib/libpthread.so.0
>> #3 0x00002aaac2d75499 in ManageLock (this=0x2aaac0413d98, mode=1,
>> extra=1, fail=0x41724d90, status=0x41724e78) at object.c:2332
>> #4 0x00002aaac2baa833 in ManageLock (this_object=0x2aaac0413dd0,
>> mode=1, extra=1, fail=0x41724d90, status=0x41724e78) at frame.c:6162
>> #5 0x00002aaac2b40d69 in ManageLock (this_object=0x2aaac0413dd0,
>> mode=1, extra=1, fail=0x41724d90, status=0x41724e78) at cmpframe.c:3936
>> #6 0x00002aaac2d74e7f in astLockId_ (this_id=0x101b80, wait=1,
>> status=0x41724e78) at object.c:6060
>> #7 0x00002aaac2aed7b4 in jniastLock (ast_objs=0x2aaac565f030) at
>> jniast.c:441
>> #8 0x00002aaac2aed358 in jniastMakeObject (env=0x2aaabc614d98,
>> objptr=0x101b80) at jniast.c:364
>> #9 0x00002aaac2affd65 in Java_uk_ac_starlink_ast_FrameSet_getFrame
>> (env=0x2aaabc614d98, this=0x41725010, iframe=-1) at FrameSet.c:91
>> #10 0x00002aaaab7a16cf in ?? ()
>> #11 0x0000000041724fc0 in ?? ()
>> #12 0x0000000041724f98 in ?? ()
>> #13 0x0000000041724fa0 in ?? ()
>> #14 0x0000000041724fa8 in ?? ()
>> #15 0x0000000000000000 in ?? ()
>>
>> Sure looks like a deadlock. If so cannot say I'm surprised, having a thread
>> that loads data and another that re-draws the UI in response to new data
>> running concurrently seems likely to cause this.
>
> In principle it should (I believe) be OK. At the JNIAST level, every
> time I lock objects I do it in a defined order: ascending order of
> AstObject * pointer. All my locking is done like this:
>
> (no objects locked)
> work out which objects will be required
> lock them all in order
> do AST stuff
> unlock them all
> (no objects locked)
>
> from the discussion of deadlocks at
>
> http://en.wikipedia.org/wiki/Dining_philosophers_problem#Resource_hierarchy_solution
>
> I believe that this is in principle sufficient to ensure that no deadlocks
> occur. However, thinking about it, there may be complications:
>
> 1. Does the AstObject * pointer uniquely identify an object to be
> locked? I know AST is tricky, but I'm not quite sure how tricky.
> If the AstPointer * is not actually the thing which identifies
> the locked object (e.g. different AstPointer * values can point to
> the same lockable object) this won't do the trick.
>
> 2. I think that astLock(ptr,1), as well as acquiring a lock on the
> object ptr, also acquires locks on all the objects contained within
> it. If those locks are not acquired in order, and moreover in
> the same order that I'm using outside of AST, the scheme will fail.
>
> My guess is that fixing up these 1 or 2 points is somewhere between
> difficult and impossible (but it's worth David bearing these
> considerations in mind in case other threaded-AST users attempt to
> do something similar).
>
> So, unless David says different:
Maybe there's a bug in the ordering of locks, but given that AST deals
with graphs of objects to lock, the order might be a bit fuzzy so 2 looks
likely.
>> Maybe it's back to plan A?
>
> ... I give in. I'll re-instate the per-AST monolithic lock
> (#define JNIAST_THREADS 0).
>
> Given that, do we have a working SPLAT?
Yes, with that change SPLAT passes all the same tests that produce the
deadlock.
> Note, I'm still doing JNIAST updates for the AST 4.1->5.1 upgrades, so
> it's not worth doing lib rebuilds and super-exhaustive tests just yet.
Understood.
Cheers,
Peter.
|