----------
> Most readers of this list are probably not interested in this, but I wanted to
> follow up my last posting:
>
> First, after I sent my message I realized that I had not noticed a couple of
> Robin Vowel's comments that I want to respond to:
>
> > But the code *was* changed for Ariane 5. Protection was applied to
> > other similar conversions in the vicinity of the conversion that actually
> > overflowed.
>
> The report does not state when the protection was applied. I have always
> assume that these decisions were made when the code was developed for the
> Ariane 4.
Unlikely to be a valid assumption.
> Any thoughtful person, as part of the devolopment for the Ariane 4,
> would have examined the code for potential overflows, no mention was made of
> having performed this review twice, and the decisions made were obviously more
> applicable to the Ariane 4 than to the Ariane 5.
You'll find that the Report refers to decisions made about the code
in relation to Ariane 5.
> Of course my assumption might
> be incorrect and if there is any contrary information I would be glad to hear
> of it.
>
> > The addition of protection of overflow (a single instruction) could not have
> > made any significant difference to the running time of the SRI computer.
>
> By itself this argument is not conclusive. The impact of an instruction
> depends on how often that context is executed
The SRI computers had to process other data besides the particular one that
overflowed. Recall that there are about 8 such conversions
(from double precision floating-point to integer) in the segment of code
that failed. Should give some inkling about the amount of work that was
being carried out. Details are in the Report.
> and the computational resources
> on rockets are more limited than one might expect. While I would be surprised
> that adding this instruction would increase the computational requirements by
> more than 1%, there were two other unprotected conversions, and many other
> computational changes to decrease computer load. Any increase in the load
> means either the potential of deadlock, or more difficulty in adding other
> functionality later.
> A better solution would have been to shut down this part of the software upon
> launch, and to have the default handler check whether the other computer was
> still operating before shutting down the computer that generated the
> conversion error.
Even better to have both an interrupt handler, as well as
a specific protection, to ensure a fail-safe operation.
This is a basic real-time fail-safe programming requirement,
which the Ariane team failed to grasp.
> > In any case, error trapping and recovery needed to be provided for.
>
> The potential for recovery on a launch system is very limited. If the system
> is performing a useful function having it latch to a fixed value means that it
> is getting only the same incorrect value.
Yes, but the alteernative was total catastrophe.
In that particuar case, with the value exceeding the maximum
(32767), probably the maximum value would have been close enough
to the actual not to have made any significant difference.
The alternative was to allow for 32-bit word for holding the integer value,
which would have needed re-programming -- probably requiring
more cycles than the one or two extra instructions
needed for the protection test.
> It is likely if such a value is
> generated as part of the guidance system that the system will continue off
> course (more gradually) and still have to be destroyed.
The greater acceleration of Ariane 5 only occurred in the early parts
of the takeoff (compared to Ariane 4). If the rocket got as far
as it did without exceeding the capacity of the 16-bit word,
the rest of the flight probably would have been handled OK too,
just using the maximum value instead of the actual value
(which would have needed a 32-bit word).
We are not told what alterations were made to the program for subsequent
launches.
> In fact in this case
> ignoring the error and not trying to recover something better might have been
> appropriate as the system was not performing a useful function at the time the
> error was generated. What must be done is analyze the behavior of the system
> in sufficient detail that no overflow can occur for the properly functioning
> system.
These were the same mistakes made by the Ariane team.
All code needs to have error-handling to provide a fail-safe fall-back
operation -- just in case something was overlooked.
All code needs to have protection.
It is absolutely essential to test the program with data.
No amount of analysys is a substitute for a good test run(s).
> I agree with Ladkin. The real problem with the mission was a problem of
> requirements. There was no review to determine whether and how the Ariane 4
> requirements might conflict with the Ariane 5 requirement.
See the Report. The analysys was done.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|