On 24-Jan-11 10:41:27, Ronan Conroy wrote:
> On 23 Jan 2011, at 14:03, Cristian Baicus wrote:
>
> NONparametric techniques are based on ranks/signs, because mean and
> standard deviation have no sense (they don’t describe the population,
> because the distribution is not normal=Gaussian).
>
> This is a misconception. So-called nonparametric techniques often
> estimate useful useful parameters. For example, the Mann-Whitney
> Wilcoxon test estimates the probability that an observation in one
> group will be greater than an observation in the other group (with an
> allowance for ties). This corresponds to the probability that a person
> in a clinical trial, for instance, will have a better outcome with one
> treatment rather than with the other.
>
> Means and standard deviations can be used to describe any distribution,
> so I do not understand how they can have no sense in describing, say, a
> uniform distribution. The mean is a useful descriptor of any
> distribution (and it is the only descriptor necessary for a Poisson
> distribution). And the standard deviation, while it can be calculated
> validly on any distribution, has a limited role in communication, since
> most people don't understand what it is -- and this applies to normally
> distributed data as well.
>
> Ronán Conroy
Well stated (granted Stephen Senn's later comment).
I would add to Ronan's final statement that, quite apart from "not
understanding what a standard devation is", few people know just
what it is that the SD tells you about a distribution. There is
a common misconception that you can interpret an SD in terms of
what it tells you about a Normal distribution (which of course is
highly specific) -- e.g. that 68% of the distribution/data are
within 1 SD of the mean, 90% within 1.65 SDs, 95% within 1.96 SDs,
99% within 2.58 SDs, etc.
However, when you move into a "non-paramatric" context -- i.e. not
making any presumptions about what particular kind of distribution
(Normal, log-Normal, whatever) is involved, then the most that the
SD can tell you is as in the following examples.
1. At most 1/2 = 50% of the distribution/data lies beyond 2^2=4 SDs
away from the mean (compare with 32% for Normal distribution)
2. At most 1/3 = 33% lies beyond 3^2 = 9 SDs away from the mean
(compare with an almost exact 0% for Normal distribution)
3. At most 1/4 = 25% lies beyond 4^2 = 16 SDs away from the mean
and so on. You can turn these round into the form:
1. At least 50% lies within 4 SDs either side of the mean
2. At least 67% lies within 9 SDs either side of the mean
3. At least 75% lies within 16 SDs either side of the mean
And there is the trivial case:
At most 100% of lies beyond 1 SD away from the mean
or
At least 0% lies within 1 SD either side of the mean
Furthermore, it is possible to determine a probability distribution
(or a set of artifical data) such that "at least"/"at most" becomes
"exactly".
This shows how small is the information contained in the value of
the SD alone, without assuming something about properties of the
distribution! It therefore uncovers the implicit invitation to
readers to make such assumptions when summary data are routinely
presented as "mean (SD)". Such tabulations should be inspected
carefully.
For example, in an article in the latest JRSS(A), 174(1), Jan. 2011:
"Biases in the healthcare luxury good hypothesis?:
a meta-regression analysis", Coasta-Fort & Gemmil (pp. 95-107),
from Table 1 (p. 100) one can read that for the variable:
panel (Indicates whether the study used panel data
or cross-section time series techniques)
Mean: 0.174 (0.029)
Median: 0.000
10th percentile: 0.0000
90th percentile: 1.000
Since this is clearly a 0/1 indicator variable, the fact that
the mean itself is given as 0.174 tells that 17.4% of the
values are 1, and 82.6% of them are 0. From this information,
along with the SD of 0.029, we could infer (to within rounding
error) that the number N of data satisfies
sqrt(0.174*0.826/N) = 0.029
N = 0.174*0.826/(0.029^2) = 171
However, the Table header says that N=167, so that calculation
(2.4% out, compatible with the 0.029 being given to within 1/30)
is superfluous. Indeed, given that N=167 and Mean=0.174, we don't
need to be told the SD -- we can work it out from Mean and SD.
Then the information about the 10th and 90th percentiles tells
us that 100% of the data lie outside the mean +/- k SDs for any
k < 0.174/0.029 = 6 SDs; and that 90% lie outside mean +/- k SDs
for any k < 0.826/0.029 = 28.5 SDs!
This is grossly different from the "off the top of the head"
interpretation that many readers might unthinkingly give to
the statement
Variable "panel": Mean = 0.174, SD = 0.029
e.g. that ("assuming approximate Normality") 95% of the data
are within 0.174 +/- 1.96*0.029, i.e. the interval (0.117, .231);
which is clearly nonsense, since all data are either 0 or 1.
Such a gross difference is of course due to the fact that the
data are nowhere remotely near to being Normally distributed.
And, while I am at it: While the Table give the 10th percentile
as 0, and the 90th percentile as 1, these values are in fact
the 17.4th and 100th percentiles (as well as being the 10th and 90th).
So there.
Best wishes to all,
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <[log in to unmask]>
Fax-to-email: +44 (0)870 094 0861
Date: 24-Jan-11 Time: 13:05:58
------------------------------ XFMail ------------------------------
|