Thank you very much to all who responded to my query on failures of
independence in the analysis of unit sizes in a data set comprising clonal
organisms made up of different numbers of different types of units. I've
summarised (to varying degrees) most of the suggestions offered, and I hope
I'm not misrepresenting anyone:
Dick Beldin:
If the sizes are not independent, you need to think about what the
connection could be, physiologically and then develop (find) a suitable
mathematical model for the dependence.
Null model: The different types of cells could simply have different size
characteristics, regardless of the process which generated them or how they
are aggregated. So you might have homogenous families of cells but
different from each other. This would show up as differences in the mean
size by type
with very low variability within type.
Alternative model 1: Large cells of one type associate preferably with
small cells of another type.
Alternative model 2: Large cells of one type associate preferably with
large cells of another type.
Alternative model 3: Cells are associated at random but only those
aggregates which satisfy some criteria can survive to be observed.
Laura Thompson:
If you sampled "independently" using the two-stage sampling procedure you
described, and compared types using ANOVA, it could be that some types are
declared larger than others because they are influenced by other
types--perhaps the types usually reside next to each other within an organism.
So, type 7 could be significantly larger than type 1 due to the fact that
7 appears next to type 6 in the majority of organisms. This suggests that
there might be a spatial dependence among units
within an organism. In the comparison of types, one would want to account
for these spatial influences. The ANOVA you described, at best, throws
away a lot of potentially important information regarding this.
Miland Joshi:
Two suggested approaches:
1. Statistical and biological independence are not the same thing. You may
wish to carry out a preliminary survey to see whether the size of the
organism is uncorrelated with the size of a particular type of unit; a
scatter plot should look random. Then, if you have good reason to believe
that the size
of a type of unit does not depend on the individual organism you could
treat physiological connection as irrelevant.
2. You could replace the sample and re-sample the population, taking a
larger sample this time.The fact that you sampled the populaiton before and
some individuals might be sampled again doesn't matter so long as the
probability of individual organisms being selected is unchanged (i.e. if
you haven't 'tagged' them in any way which might enhance their chance of
selection the second time round).
Roger Newson:
Another two suggestions:
Sugessted using the statistical package STATA. Which allows comparison of
sizes of different types of unit in the same organism by doing a regression
analysis with Huber variances, using the "units" as
data points, the organisms as clusters, and the type as a predictive factor.
A low-tech alternative is to compare Type A and Type B units by restricting
the analysis to organisms with at least one unit of each of the two types
and calculating, for each of these organisms, the mean size of the Type A
units in the organism and the mean size of the Type B units in the
organism. These paired within-organism mean sizes can then be compared
using a paired t-test, to derive a confidence interval for the mean
difference between mean sizes.
Dan Altmann
Suggested a multilevel modelling approach. Suppose you randomly sampled
your organisms, recorded all the units for each organism, and then took
your units as level 1 observations, and then level 2 will be the organisms
which own the units. You then allow random variation not only at the level
of the units, but also at the level of the organisms, and this gets over
the problem of dependence between
units belonging to one organism because you're actually modelling the
dependence.
Ellen Hertz (via bionet.info-theory)
And finally two suggestions here:
If you have access to SUDAAN, it can handle clustered data even if the
clustering does not arise from a survey; it is explained in the manual
using an example of animals in litters which is clearly a similar
situation.
Another suggestion is to use dummy variables for the organisms. If there
are n of them, you number them and then let I_i = 1 if the unit is in the
ith organism and 0 otherwise, i = 1,..,n-1. If it belongs to the nth
organism, all of the I_i's are zero; that is to prevent a singular matrix.
If there are J unit types and you assign dummy variables V_i for J-1 of
them in the same way and then model
SIZE = I_1..I_(n-1) V_1..V_(J-1)
(where at most 2 of these covariates are non-zero for any given unit),the
cofficient of V_i represents the estimated average difference in size
between a type i and a type J after controlling for the organism.
_________
So there you have it; a sample of respondents and their approaches to
solving a problem. I'm currently going down the SUDAAN road, so if there
are any users out there I have a question to put to you, so PLEASE reply.
thank you
David Oatway
----------------------------------------------------------------------------
David Oatway
Centre for Land Use and Water Resources Research
Porter Building
University of Newcastle
Newcastle upon Tyne
NE1 7RU
United Kingdom
Tel: +44 191 222 5956
Fax: +44 191 222 6563
URL: http://www.cluwrr.ncl.ac.uk
----------------------------------------------------------------------------
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|