Hi Allstat,
The response to my question exceeded my expectation. While different
authors offered different advice/opinions, what is clear is that this
question is a contentious one. In particular, I want to highlight the two
late replies given by Blaise Egan and Duncan Hedderly, which I think
address my concern best. Both of the articles favoured the view that no
multiple testing procedure is needed for the usual type of epidemiological
research, but do look at the response to the BMJ article which argued that
statisical adjustments are mandatory!
I think that the issue may be less complicated than it seems if we
considered the background of the authors. Epidemiologists tend to favour
not using these multiple testing procedures, while clinical trialists tend
to favour them. Epidemiologists tend to ask many loosely related questions
in a study and therefore using multiple testing procedures does tend to
add to the confusion by somehow suggesting that they are all under one
experiment, while clinical trialists usually have one major goal in mind
in a study, and using multiple comparison procedures helps to reduce false
positives when such results are used for clinical decisions.
Lastly if one do decide to adjust for multiple testing, there now seems to
be clear, better alternatives than Bonferroni correction. I haven't had
time to read about the False Discovery Rate, or the Holm method, but the
references are here if you want to look further.
Thanks again to all who responded. Given below is my original query
followed by all relevant answers.
Original query:
When I learnt statistics, I learnt that if you are going to test more than
once in your experiment, you should adjust for multiple testing, usually
by means of a Bonferroni correction. However, having actually done
statistics for psychiatric research for a year, I found that in practice,
one can't really do it. That's because in the work that I do, people
generally want to test many things in a single paper, not to mention the
quality control tests such as testing for age difference at baseline. I'm
sure other medical statisticians will no doubt have seen one of these
papers littered all over with p-values.
My question is: How have more experienced medical statisticians come to
terms with this? We usually collect massive amount of information per
project. Each score will have sub-scores, and sub-scores are made up of
individual questions. Are individual questions really not of interest? But
if we look at each individual question separately then no doubt we'll end
up with a plethora of tests per paper. After all, is there really no value
in fishing for significant results? If we don't do this, how are we going
to discover something new?
Thanks for any comments.
******************************
Michael meyners:
in brief, you might want to use the False Discovery Rate (FDR). To start
with, see Benjamini & Hochberg, J R Stat Soc B, 57, 1995, 289-300. Also,
you might want to browse a little through the literature for the analysis
of gene expression data, as they have a similar problem as well (I'd say
that their hypotheses are less "dependend" than yours might be, but it
might give you some ideas, though).
*********************
Allan Reese:
I used to deal with many student surveys, generally in social science or
education. The advice I offered was that individual questions were
generally not of interest for testing as the questionnaire had been
designed with groups of related questions and often an expectation of
observing certain interactions. P-values should therefore be interpreted
in relation to what the researcher expected (an informal Bayesian
approach) and patterns of p-values should be looked for. In particular,
since most student studies have small samples subject to biases of
accessibility, having a set of questions that showed non-significant
effects but all in the expected direction should *not* be reported simply
as "no significant effects were found". In practice, I observed that
effects were generally nowhere near significance or were highly
significant even for the small (generally about 100 cases) samples. I
attributed this to the influence of researchers' prior knowledge - ie they
were demonstrating effects they anticipated, not looking at random for
correlations.
It seems to me that statistics should more commonly be presented as used
in two contexts: (1) exploratory, where a set of data is examined for
pattern and a reasonable question is to ask how often one is being mislead
by chance coincidences, and (2) as a quality assurance technique for
measurements in the known presence of variation. Researchers too often
assume that ideas relevant to the latter (sample size, power) can be
arbitrarily applied to the former.
A final thought is to suggest that too many papers stop short at the
p-value. Authors should be coerced to take the next step and explain
*what* the (significant) effect is and *why* it is important. That would,
for example, put many claims of relative risk into clinical perspective.
**********************
Roger Newson:
The issue of multiple comparisons is a fast-moving field at this point in
the early 21st century, and there is no consensus regarding the best
approach, even amongst statisticians. However, I have written a paper on
the subject in The Stata Journal, summarizing other people's thoughts and
adding a few of my own, and have implemented a few multiple-test
procedures
in Stata (Newson, 2003). A preprint of this reference can be downloaded
from my website, where you can also download a presentation on the subject
that I gave at the 2003 UK Stata User Meeting.
********************
Tzippy:
The newest method is the Binyamini's and Hochberg's False Detection Rate
(FDR)
It controls the percentage of false significances in a scenario, rather
than
the alpha for each test.
Sas's Proc Multtest gives several options for multiple testing, that are
less conservative than Bonferroni.
*********************
Allan White:
I, too, am concerned about this. One glaring discrepancy has been
bothering
me for some time. If we conduct a one-way ANOVA which yields a significant
F value, we often follow it up with a Tukey test for all possible pairwise
comparisons. This gives p values which are adjusted to allow for the fact
that we are doing a number of tests, so that the experiment-wise p value
is at the desired level. That is fair enough. However, this is in marked
contrast to what is typically done when we do, say, a 4-way ANOVA, which
yields 15 effects (4 main effects, 6 2-way interactions, 4 3-way
interactions
and a 4-way interaction). The p values for each of these effects is NEVER
adjusted for the fact that we are looking at 15 effects, i.e. we are
giving
ourselves 15 chances of finding something significant!
However, in spite of the inconsistencies that we have noted, the problem
of multiple tests is real enough. In the example that I just quoted, the
chances of getting one of more 15 effects significant at a nominal 5 per
cent level is approximately 50:50. We really need to be far more rigorous
and consistent in dealing with this type of problem than we currently are.
Nevertheless, you do have a point about the legitimacy of "fishing
expeditions". If we are too rigorous in correcting p values for multiple
tests, then we run the risk of missing something which is really there.
One solution that occurs to me (but which I have never seen used in
practice)
is to split the data set in two on a random basis and to carry out the
same analysis on each half of the data. The chances are that only effects
that are really there will appear as significant in both analyses. Effects
that are significant in one half as a result of pure chance will only
rarely
be significant in the other half. Of course, there is a loss of power in
splitting your data in two in this way but, with a large dataset, this may
matter a lot less than the benefit gained.
**********************
Sue Richards:
I think the basic principle we work to is:
1. Pre-sepcify, before looking at the data, a limited set of 'primary'
analyses.
Hopefully these are not too many, and if clearly stated, then results can
be
viewed bearing in mind the multiplicity of tests.
2. All other analyses should be regarded as 'hypothesis generating' only.
In
papers, it should be made clear what tests are done, again so that
multiple
testing can be born in mind.
There remains the problem of over-interpretation by those who do not
understand the issue, and we all need to add 'health warnings'.
The most frequent problem is not what is reported, but the lack of detail
on
what has been done and NOT reported, meaing that we are unaware of the
multiple testing.
***********************
Blaise Egan:
I suggest you read this excellent discussion in the British Medical
Journal
http://bmj.bmjjournals.com/cgi/content/full/316/7139/1236?view=full&pmid=955
3006
**********************
Duncan Hedderly:
I probably worry about this less than I ought. You might find the
articles by Schulz & Grimes in the Lancet (2005, vol 365, pp1591-95 and
pp1657-61) interesting
|