Dear
all,
I
want to share the answers I received to my questions regarding data given as sufficient
statistics (that is, the sample mean, sample variance and number of data points
n for normal data or GM,GSD, n for a log normal data). I first summarize my
questions and update you on my a posterior thoughts conditioned on the feedback
you gave me :)
In
short, my first question was: How do I express the likelihood functions for sample
mean and sample variance in BUGS syntax.
I
first tried to set a N(mu,sigma^2/n) likelihood on sample mean yhat and an
Inverse X^2(n-1,sigma^2) likelihood on sample variance S^2. After giving this
some more thought, the distribution of sample variance S^2 is obviously wrong
(I switched place of the sample variance and true variance). Given the known
result for normal samples:
(n-1)S^2/sigma^2
~ X(n-1)
I
arrive at
1/S^2
~ (n-1)/Chi^2(n-1)/sigma^2 equal to 1/S^2 ~ InvX^2(n-1, 1/sigma^2) or in
bugs syntax S^2 ~ dgamma( (n-1)/2, (n-1)/2/sigma^2).
I
then set priors as usual on the mu and sigma^2. This however produce the
wrong result for sigma^2 ( correct for mu though). In short, most of the
responses I got only confirmed the fact that it theoretically should be possible
to use only the sufficient statistics, not how this is done in bugs. If someone
finds any flaws in my reasoning or have any other ideas, please get in touch
with me :) In the meantime I’ll stick to my own Gibbs-sampler.
Thank
you Mr Hahn for suggesting the Linear Bayesian Approach. I am only little
familiar with the topic. It seems like a quite different methodology but I will
definitely look more its applications.
My
second question was: If my data (assumed log normally distributed) is given as Mean,SD
and N, do I lose any information when transforming these statistics to GM,GSD
using formulas (which again assume log normal distributed data). Thanks to Mr
Parkhurst for an excellent article on the use of the biased statistics GM,GSD
when summarizing concentrations when mass-balance is an issue (you could contact
to get the article). Whether the conversion of Mean,SD to GM,GSD using formulas
I now figure it is ok since we assume log normal data anyway.
My
original mail and some of the answers I got:
Dear
All,
In
my field (radioecological risk assessment) we encounter parameters in the
environment that are highly variable. In addition there is often a lack of data
for specific situations. I am using Bayesian methods to compensate for lack of
data and the large uncertainty using prior information, and especially
hierarchical models.
Most
of the data we encounter (taken from e.g. literature) is given as Geometric
mean GM, geometric standard deviation GSD and number of data points n (assuming
log normality). That is, we often don’t have the raw measurements.
I
have written a Gibbs sampler for updating a normal hierarchical model (without
regression variables) that take these sufficient statistics (i.e. GM,GSD and n)
as input. My question is if there is any way to define a normal model accepting
only sufficient statistics in the BUGS language?
I
experimented with a variation (here explained non-hierarchically) where I
assigned yhat.data (the sample mean) a N(mu,sigma^2/n) likelihood and the
s2.data (the sample variance) an Inverse Chi square (n-1, sigma^2) likelihood
(both are taken from the theoretical distributions of sample mean and
variance). The mean mu and variance sigma^2 are then assigned priors as usual.
This method seems to produce reasonable results (but I have not assessed this
method extensively yet) but the results still differ somewhat from my
semi-analytical approach (the gibbs sampler using analytically derived
expressions of the conditional posteriors depending on just yhat, s2 and n). Is
there any major flaw with my approach of assigning “two
likelihoods” for the sample mean and sample variance? I also understand
there is a way to define custom distributions in BUGS (using e.g. the
“zeros-trick”). I may be talking through my hat here, but is it
also possible to define custom likelihoods (depending on only sufficient
statistics)?
My
second question is not specific to Bayesian methods but becomes relevant when
analysis log normal data: Sometimes my data is given in terms of Mean, SD (i.e.
not GM,GSD). I then need to calculate GM, GSD using formulas which assume that
the data are log normal:
GM*
= Mean/sqrt(1+CV^2)
GSD*
= exp(sqrt(ln(1+CV^2))), where CV=SD/Mean is the coefficient of variation
I
figure that I loss a substantive amount of information about the sample when
doing this transformation. When doing some simulation in matlab I notice the
values of GM* and GSD* are often way different than the GM,GSD from the true
sample, even if the sample size is extremely large). Are there any alternative
approaches to this problem?
Best
regards,
Kristofer
Stenberg
Facilia
AB
***
Hi
Kristofer,
It is well known that the posterior distribution depends on the data only
through the sufficient statistics. In other words, if X=(X_1,...,X_n) denotes
the raw data and S=S(X) is a vector of (minimal) sufficient statistic, then
p(theta|X)=p(theta|S), i.e., the posterior density of the parameter theta given
the entire raw data is same as the posterior density of theta given only the
sufficient statistic S.
So if you specify the models (in BUGS) using only sufficient statistic the
inference would be identical to the model specified using the raw data under
the assumed statistical model.
**
Hi
Kristofer,
Below are my thoughts which may or may not help. I hope they help.
Question 1:
I don't see any reason why you should not be able to do the analysis using
WinBUGS with the sufficient statistics as input. After all, the sufficient
statistics should contain all the information available from the data about the
parameters in the model.
Perhaps one difference between your Gibbs sampler and your WinBUGS
implimentation is an assumption of independence. I do not quite understand your
Gibbs sampler but I assume it uses a JOINT distribution (likelihood) of GM and
GSD. I wonder if your WinBUGS implimentation may be producing two independent
MARGINAL distributions for GM and GSD.
Question 2:
In principal it seems you could partition your likelihood into two components.
When you have GM&GSD you can use a normal likelihood. When you have
Mean&SD you can use a lognormal likelihood. The product of these 2
likelihoods should be the likelihood for the full data set. That way you do not
have to use your approximations.
Best
regards,
Dave
**
Be
very cautious if you're using geometric means in mass-balance (conservation of
mass) applications. See the attached paper.
**
Dear
Kristofer,
This
does not quite answer the questions you posed below, but perhaps you have heard
of Bayes Linear methods? This approach, in essence, leans heavily on
summary statistics and various functions thereof. So you might be able to
obtain a Bayes Linear posterior analytically, and then sample from it (via
Gibbs or even an independence sampler) to answer your questions of interest.
Apologies
in advance if you already knew about this.
Best,
Gene