Print

Print


Re: Web site
Dear Susan,
Your example of two or more measurements on the same subject has two facets that should not be confused.  These problems are 1) statistical power, and 2) the potential lack of independence of two or more measurements on the same subject.  Your interpretations of your mock data set statistics are not exactly proper.  There is a difference between treated and controls looking at  the right foot alone (t = 1.83), the left foot alone (t = 1.62), or the average of the two (t = 1.82).  These represent differences between the treatment and control means of 1.83, 1.62 and 1.82 standard deviations.  These are substantial and probably clinically significant effect sizes.  The p-values just above statistical significance (presumably 0.05) do not indicate "no difference," but merely reflect the fact that 30 measurements provide statistical power too low to detect with statistical significance even the fairly large differences observed in your data, much less other smaller differences that might be within the range of clinical significance.  The proper statistical protocol is to do a power analysis, which will show that the statistics here do not prove "no difference," but instead prove the result is inconclusive due to low power.  Effect sizes indicating a true "no difference" would be close to zero and would fluctuate to positive and negative values with different sample sets.  Researchers unfamiliar with the meaning and behavior of effect sizes should not be using p-values and statistical significance tests as crutches.  Researchers unfamiliar with the essential concept of statistical power, and unable to carry out and understand the results of power analyses should not be using p-values and statistical significance tests.  Unfortunately these points are not well made in the statistics education most of us receive.  These problems are now even more egregious with the ability of anyone to hit a computer key and turn the smallest or largest data set into a single p-value, with all its room for misinterpretation.
 
Now if you count all the right and left feet separately, you have doubled your sample size, and viola!  Now there is adequate power, and the new effect size of t = 2.44 is statistically significant with room to spare.  Probably any of the other t scores would also be significant with that sample size (check a t table, or change the sample size in your computerized mock data set to see).  This is merely an example of the too poorly understood fact that it is difficult to get statistical significance even for very large and important differences if the sample is small (low power), and it is easy to get statistical significance for even trivial differences if the sample is large enough (high power, one might say over-powered).  This is why p-values and statistical significance tests are frankly misleading in judging clinical significance, which is what we should all be interested in.
 
But your example also illustrates the other problem with multiple measurements on the same subject.  The effect size itself changed from t around 1.6 or 1.8 to a t of 2.44.  This is a difference of 0.6 to 0.8 standard deviation units.  It seems unlikely that this large a difference results merely from the chance fluctuations of the effect size in the smaller sample sets stabilizing on this new higher effect size as the sample size increases.  Instead it looks as though there is not independence between the results on feet within the same subject.  This is a violation of one of the assumptions upon which the t-test is based.  The individual units sampled must be independent of each other.  The statistics may or may not give an obvious indication of such a violation.  This is a matter for biological and clinical logic, and additional data.  For example, if one is measuring rates of wound healing, it makes a huge difference whether the two feet are on a diabetic or not.  I am not familiar enough with the medical problem in your example to understand the interplay between the lack of independence between two feet in the same subject and the effectiveness of the intervention.
 
Basically, in your example, if the effect size (the t score) had not changed much, but the increased power had simply brought that effect size to statistical significance, there might be nothing wrong with counting the feet separately.  However, if the effect size changes depending on how one counts the feet, there is a problem that must be understood before proceeding further.
 

David L. Doggett, Ph.D.
Senior Medical Research Analyst
Health Technology Assessment and Information Services
ECRI, a nonprofit health services research organization
5200 Butler Pike
Plymouth Meeting, PA 19462, USA
Phone: (610) 825-6000 x5509
FAX: (610) 834-1275
e-mail: [log in to unmask]

-----Original Message-----
From: Susan Stacpoole Shea [mailto:[log in to unmask]]
Sent: Thursday, September 12, 2002 3:06 AM
To: [log in to unmask]
Subject: Two feet or one person

 
Hello,
 
This message was originally put to the Podiatry mail base. The author and I would appreciate any advice that is forth coming from the members concerned with Evidence based health.
 
Thank you in anticipation,
 
Susan Stacpoole-Shea
Ballarat, Victoria, Australia
 
 Dear all,
I'd like to pose a thorny question for the researchers on the mailbase. This
question was initially bought to my attention by Lloyd Reed from QUT a
couple of years ago and has stimulated much debate among my colleagues at
UWS. My apologies for the length of this posting (the text is taken from a
paper I've started writing).
In many fields of biomedical research, information is collected on multiple
joints or organs from the same subject. For example, many ophthalmology
studies record data from both eyes, and in the case of foot and ankle
research, data is often collected from both feet. This raises a significant,
yet largely overlooked problem when it comes to statistical analysis. One of
the fundamental requirements of statistics is that each data point must
represent an independent observation to justify being considered a "unit".
In most cases, the unit of measurement is the subject, so if, for example,
50 subjects are enrolled in the study, each observation recorded from each
subject counts as a single unit, ie: n=50. However, if data is recorded from
both feet, a major problem arises. What is the unit of measurement – a
subject, or a foot? Do we have a sample of n=50 people, or a sample of n=100
feet? A cursory examination of the foot and ankle literature reveals dozens
of examples of statements like "We recruited thirty subjects (sixty feet)".
From a conceptual viewpoint, it does seem a little odd to conduct research
into individual feet rather than people, as clearly the way an individual
foot functions is dependent on the person attached to it. For example, the
healing rate of a surgical wound is strongly dependent on its blood supply,
the pressure distribution under a foot is strongly dependent on the gait
pattern of the individual, and the pain experienced following local
anaesthetic injection strongly dependent on the individual’s pain threshold.
In each of these examples, it is likely that the degree of association
between right and left feet in the same subject would be far greater than
the association between different subjects. Therefore, if both right and
left feet were counted as single independent observations, the researcher is
essentially "double-dipping" their data, ie: counting each subject twice.
Doing so will increase sample size and decrease variability in the data,
thereby increasing the power of the study and increasing the likelihood of
detecting statistical differences. But are these "significant" differences
real?
In order to demonstrate how the decision to pool or not pool right and left
foot data can influence results, I have developed a dataset of "dummy" data
for 30 subjects (see below). For the purpose of discussion, the data can be
considered to represent rearfoot motion values (in degrees) for 30 subjects
with and without foot orthoses for both right and left feet. Paired t-tests
were then used to compare the "without orthosis" and "with orthosis"
conditions for the right foot only, the left foot only, the average of the
right and left feet, and with right and left foot data combined.
Key to table:
WOOR - without orthosis right foot, WOOL - without orthosis left foot, WOR -
with orthosis right foot, WOL - without orthosis left foot
WOOR WOOL WOR WOL
1 2 1 3 2
2 4 4 2 2
3 6 6 4 3
4 8 7 2 2
5 4 4 1 1
6 5 5 4 3
7 6 6 5 4
8 3 3 7 7
9 2 1 5 5
10 4 3 3 3
11 5 5 2 2
12 7 7 4 2
13 4 3 3 3
14 2 1 1 1
15 2 2 1 1
16 2 1 3 2
17 4 4 6 2
18 6 6 4 3
19 8 7 2 2
20 4 4 6 5
21 5 5 4 3
22 6 6 5 4
23 3 3 7 7
24 2 1 5 5
25 4 3 3 3
26 5 5 2 2
27 7 7 4 2
28 4 3 3 3
29 2 1 1 5
30 2 2 1 1
For right foot data, there was no difference in rearfoot motion with or
without foot orthoses (t29=1.83, p=0.077). Similarly, for left foot data,
there was no difference in rearfoot motion with or without foot orthoses
(t29=1.68, p=0.104). For the averaged data, there was no difference in
rearfoot motion with or without foot orthoses (t29=1.82, p=0.079). For
pooled right and left foot data (thereby increasing the sample size from 30
to 60), the paired t-test revealed a significant reduction in rearfoot
motion when wearing foot orthoses compared to the without orthosis condition
(t59=2.44, p=0.018).
The results of this simple example clearly highlight the problems inherent
in analysing pooled data from paired limbs. In the example provided, foot
orthoses had no effect on rearfoot motion on the right foot when analysed in
isolation, no effect on the left foot when analysed in isolation, and no
effect when the two feet were averaged. However, when the right and left
data was pooled, a significant reduction in rearfoot motion was apparent in
the orthosis condition. Although the difference was small in absolute terms,
there is little doubt that such a difference would be reported as a
"significant" finding. Thus, depending on whether data is pooled or not, it
could be concluded that foot orthoses either do influence rearfoot motion
when walking or they do not.
So my questions are as follows:
What is the best approach for the statistical analysis of paired data?
If we decide to analyse one foot only, which foot do we pick (and why)?
Are there any situations in which analysing paired data is justifiable?
Kind regards,
Hylton
Hylton B. Menz