Dear all,

I want to share the answers I received to my questions regarding data given as sufficient statistics (that is, the sample mean, sample variance and number of data points n for normal data or GM,GSD, n for a log normal data). I first summarize my questions and update you on my a posterior thoughts conditioned on the feedback you gave me :)

In short, my first question was: How do I express the likelihood functions for sample mean and sample variance in BUGS syntax.

I first tried to set a N(mu,sigma^2/n) likelihood on sample mean yhat and an Inverse X^2(n-1,sigma^2) likelihood on sample variance S^2. After giving this some more thought, the distribution of sample variance S^2 is obviously wrong (I switched place of the sample variance and true variance). Given the known result for normal samples:

(n-1)S^2/sigma^2 ~ X(n-1)

I arrive at

1/S^2 ~ (n-1)/Chi^2(n-1)/sigma^2 equal to 1/S^2 ~ InvX^2(n-1, 1/sigma^2) or in bugs syntax S^2 ~ dgamma( (n-1)/2, (n-1)/2/sigma^2).

I then set priors as usual on the mu and sigma^2. This however produce the wrong result for sigma^2 ( correct for mu though). In short, most of the responses I got only confirmed the fact that it theoretically should be possible to use only the sufficient statistics, not how this is done in bugs. If someone finds any flaws in my reasoning or have any other ideas, please get in touch with me :) In the meantime I’ll stick to my own Gibbs-sampler.

Thank you Mr Hahn for suggesting the Linear Bayesian Approach. I am only little familiar with the topic. It seems like a quite different methodology but I will definitely look more its applications.

My second question was: If my data (assumed log normally distributed) is given as Mean,SD and N, do I lose any information when transforming these statistics to GM,GSD using formulas (which again assume log normal distributed data). Thanks to Mr Parkhurst for an excellent article on the use of the biased statistics GM,GSD when summarizing concentrations when mass-balance is an issue (you could contact to get the article). Whether the conversion of Mean,SD to GM,GSD using formulas I now figure it is ok since we assume log normal data anyway.

My original mail and some of the answers I got:

Dear All,

In my field (radioecological risk assessment) we encounter parameters in the environment that are highly variable. In addition there is often a lack of data for specific situations. I am using Bayesian methods to compensate for lack of data and the large uncertainty using prior information, and especially hierarchical models.

Most of the data we encounter (taken from e.g. literature) is given as Geometric mean GM, geometric standard deviation GSD and number of data points n (assuming log normality). That is, we often don’t have the raw measurements.

I have written a Gibbs sampler for updating a normal hierarchical model (without regression variables) that take these sufficient statistics (i.e. GM,GSD and n) as input. My question is if there is any way to define a normal model accepting only sufficient statistics in the BUGS language?

I experimented with a variation (here explained non-hierarchically) where I assigned yhat.data (the sample mean) a N(mu,sigma^2/n) likelihood and the s2.data (the sample variance) an Inverse Chi square (n-1, sigma^2) likelihood (both are taken from the theoretical distributions of sample mean and variance). The mean mu and variance sigma^2 are then assigned priors as usual. This method seems to produce reasonable results (but I have not assessed this method extensively yet) but the results still differ somewhat from my semi-analytical approach (the gibbs sampler using analytically derived expressions of the conditional posteriors depending on just yhat, s2 and n). Is there any major flaw with my approach of assigning “two likelihoods” for the sample mean and sample variance? I also understand there is a way to define custom distributions in BUGS (using e.g. the “zeros-trick”). I may be talking through my hat here, but is it also possible to define custom likelihoods (depending on only sufficient statistics)?

My second question is not specific to Bayesian methods but becomes relevant when analysis log normal data: Sometimes my data is given in terms of Mean, SD (i.e. not GM,GSD). I then need to calculate GM, GSD using formulas which assume that the data are log normal:

GM* = Mean/sqrt(1+CV^2)

GSD* = exp(sqrt(ln(1+CV^2))), where CV=SD/Mean is the coefficient of variation

I figure that I loss a substantive amount of information about the sample when doing this transformation. When doing some simulation in matlab I notice the values of GM* and GSD* are often way different than the GM,GSD from the true sample, even if the sample size is extremely large). Are there any alternative approaches to this problem?

Best regards,

Kristofer Stenberg

Facilia AB

***

Hi Kristofer,

It is well known that the posterior distribution depends on the data only through the sufficient statistics. In other words, if X=(X_1,...,X_n) denotes the raw data and S=S(X) is a vector of (minimal) sufficient statistic, then p(theta|X)=p(theta|S), i.e., the posterior density of the parameter theta given the entire raw data is same as the posterior density of theta given only the sufficient statistic S.

So if you specify the models (in BUGS) using only sufficient statistic the inference would be identical to the model specified using the raw data under the assumed statistical model.

Hi Kristofer,

Below are my thoughts which may or may not help. I hope they help.

Question 1:
I don't see any reason why you should not be able to do the analysis using WinBUGS with the sufficient statistics as input. After all, the sufficient statistics should contain all the information available from the data about the parameters in the model.

Perhaps one difference between your Gibbs sampler and your WinBUGS implimentation is an assumption of independence. I do not quite understand your Gibbs sampler but I assume it uses a JOINT distribution (likelihood) of GM and GSD. I wonder if your WinBUGS implimentation may be producing two independent MARGINAL distributions for GM and GSD.

Question 2:
In principal it seems you could partition your likelihood into two components. When you have GM&GSD you can use a normal likelihood. When you have Mean&SD you can use a lognormal likelihood. The product of these 2 likelihoods should be the likelihood for the full data set. That way you do not have to use your approximations.

Best regards,

Dave

Be very cautious if you're using geometric means in mass-balance (conservation of mass) applications. See the attached paper.

Dear Kristofer,

This does not quite answer the questions you posed below, but perhaps you have heard of Bayes Linear methods? This approach, in essence, leans heavily on summary statistics and various functions thereof. So you might be able to obtain a Bayes Linear posterior analytically, and then sample from it (via Gibbs or even an independence sampler) to answer your questions of interest.

Apologies in advance if you already knew about this.

Best,

Gene