When you wish to keep track of the average and the 'estimated' standard
deviation in a stream of data, you need to carry only:
count of x, n, = count of number of obs. Often simply i - the last
trial number.
sum of x, sum_x, = sum(x(i))
sum of x squared, sum_x_sq, = sum(x(i)^2)
at any time that the count = n, then
x_bar = sum_x/n
estimated variance = est_var = (sum_x_sq - sum_x^2/n)/(n-1)
and est_stdev = square root (est_var) = square root of[(sum_x_sq -
sum_x^2/n)/(n-1)]
This equation is usually given in intro stat books as the 'computational
formula' or 'direct formula.' It is mathematically equal to the
'definitional formula' which requires that you know the average before
computing the standard deviation.
Now, minor point of language:
Why do I say 'average' instead of 'mean', 'estimated variance' instead
of 'variance,' and 'estimated standard deviation' instead of simply,
'standard deviation'?
Because I believe you mean the defined 'population' to be all the
possible simulation trials that you could run, under whatever 'boundary'
conditions you select. This includes the simulations you _will_ run, or
_could_ run, and the ones that you want to talk about, or predict. This
population is infinite.
If I am right on this, then you are calculating the average and est.
stdev on the (finite) sample you collected, and you will use these
values to _estimate_ the mean and (true) standard deviation of the total
population.
Obviously, if you have lots and lots of simulation data, those estimates
will be pretty darned close :) But not exact.
My apologies for belaboring such a basic point to you and the list, but
in my experience a little care in terminology here will save a lot of
misinformed carping later - especially from managers.
Cheers,
Jay
Trevor Carpenter wrote:
>May I first of all point out that I am not a statistician so I am sorry if my
>question is not in the usual notation or has an obvious answer.
>
>I am running many millions of simulations and want to calculate the mean and
>variance of the population after each of the separate trials. It is not
>practical to store all the results and then calculate the mean and standard
>deviation from the entire population. I have derived a formula to calculate the
>mean after iteration N+1, Xbar[N+1], from the new value X[N+1] and the mean
>after the previous iteration Xbar[N]:
>
>Xbar[N+1]=(N*Xbar[N]+X[N+1])/(N+1)
>
>I believe this is called the running mean.
>
>I have been trying to solve the same problem for the variance of the
>population, that is to find a formula for the running variance. I have
>attempted to derive it myself, and have been to the medical libary to consult
>the statistics books but to no avail.
>Is there a method for obtaining the variance at point N+1 given that I know:
>
>X[N+1];
>Xbar[N];
>Xbar[N+1];
>Xvar[N];
>
>I have thought about storing the sum of squares etc but this does not seem very
>elegant and may not be computationally stable (rounding error).
>
>I would be grateful if someone could provide me with an answer to
>this question even if it is that it has no solution.
>
>Trevor Carpenter
>
>
>
>
--
Jay Warner
Principal Scientist
Warner Consulting, Inc.
4444 North Green Bay Road
Racine, WI 53404-1216
USA
Ph: (262) 634-9100
FAX: (262) 681-1133
email: [log in to unmask]
web: http://www.a2q.com
The A2Q Method (tm) -- What do you want to improve today?
|