OK, brace yourselves people. This is the sort of thing I get excited about.

James said:

On 12 February 2013 08:18, James Alvarez <[log in to unmask]> wrote:
>
> Speaking generally, there are often good practical and theoretical reasons to reduce the amount of information in data. Too much information is often harder to understand and is liable to just be noise - AKA to not see the wood for the trees. It's probably better to address this at the point of collection rather than later but with a good a priori rational then there is no reasonnot to.
>
> 'More is better' when applied to variables and levels within variables is not always the case and probably often wrong. I would also say its misleading to compare it to having more participants as one is a question of power and the other of resolution.
>
And Rosemary said"

"In short, transformation and loss do not equate to the same thing - and, of course, there are all kinds of reasons why we might make reasoned decisions to discard data too."

There might be reasons to discard data for interpretational purposes, but it's rare to see a case for analytic purposes. (Although one of the references at the end addresses this).

It's pretty easy to demonstrate that transformation and data loss do equate to the same thing. Here's a baby simulation written in R.

I generate 1000 samples, of 100 people. The mean in the population is 0.01. For each sample, I run a t-test and see if the mean is significantly different from zero.

x <- rep(c(1:1000))
x <- as.data.frame(cbind(x, 100))
tt <- function(n) t.test(rnorm(n)+0.03)$p.value

for(loop in c(1:1000)) {
x$V2[loop] <- tt(100)
}

mean(x$V2 < 0.05)

Here's the result;

> mean(x$V2 < 0.05)
[1] 0.858

This means that I'm getting a statistically significant result 86% of the time.

Because all assumptions were satisfied, I can also use a simple power analysis to do this:

power.t.test(n=100, delta=0.3, sd=1, type="one.sample")

One-sample t test power calculation

n = 100
delta = 0.3
sd = 1
sig.level = 0.05
power = 0.8439467
alternative = two.sided

So the analytic approach says that I have 84% power. Pretty close.

As soon as you start to violate assumptions, we can't use power analysis any more, and we're stuck with simulations. (It's not hard to run these in SPSS either, but I don't have it, so I can't. It's easier in R anyway).

So then I categorize my variable. I'll cut it at -1.5, -0.5, 0.5, 1.5 - I'll create a 5 point scale, which is still pretty close to normally distributed.

> mean(x$V2 < 0.05)
[1] 0.737

Now my power is 0.74 - I've lost some information when I categorized the scale, and hence I've lost some power. The more I categorize, the more information I lose, and the more power I lose. (It's also worth noting that I haven't lost a great deal of power - people sometimes argue that you should not use parametric methods on a 5 point scale, but it hasn't really hurt us a lot - but it did hurt us.)

How many people is that equivalent to losing? We can use the power analysis to find out.

> power.t.test(power=0.74, delta=0.3, sd=1, type="one.sample")

One-sample t test power calculation

n = 77.24355
delta = 0.3
sd = 1
sig.level = 0.05
power = 0.74
alternative = two.sided

So by categorizing, we get the same power that we would have got if we hadn't categorized and had 77 people. By categorizing we have the equivalent of losing 23% of our sample.

It's useful to think in terms of sample size equivalence. For example, it's most powerful to assign people equally to groups in experiments, but if one condition is more expensive or difficult than the other, you can get more power for you money (as it were) by randomizng unequally. You need to think about sample size equivalents when you do that.

Here's a paper on that, http://www.jeremymiles.co.uk/mestuff/publications/p39.pdf

If you do group randomized studies (for example, kids in schools) you also need to think about power in terms of sample size equivalence. For example, if you have kids in schools and you don't randomize by groups, you'll lose some power because of contamination - some kids will get the wrong treatment. If you do, you lose power because of the design effect (and this email is already long and boring enough, so I won't go into it). Here's an extremely long and tedious report on that: http://www.hta.ac.uk/execsumm/summ1143.htm

If you're interested in this sort of thing, here are some papers worth reading which are written by people much cleverer than me:

Bollen, K. A., & Barb, K. H. (1985). Pearson's R and coarsely categorized measures. American Sociological Review, 46, 232-239.

Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7, 249-253.

DeCoster, J., Iselin, A.-M. R., & Gallucci, M. (2009). A conceptual and empirical examination of justifications for dichotomization. Psychological Methods, 14(4), 349-366.

MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

McClelland, G. H. (1997). Optimal Design in Psychological Research. Psychological Methods, 2(1), 3-19.

Muthén, B. O. (2006). Should substance use disorders be considered as categorical or dimensional? Addiction,, 101, 6-16.

Muthén, B. (2001). Second-generation structural equation modeling with a combination of continuous and categorical latent variables: new opportunities for latent class-latent growth modeling. In L. M. Collins & A. G. Sayer (Eds.), New methods for the analysis of change. Washington DC: APA.

Senn, S. J. (2003). Dichotomania. British Medical Journal - Web page rapid response, 327(http://bmj.bmjjournals.com/cgi/content/full/327/7428/0-h).

Streiner, D. L. (2002). Breaking up is hard to do: the heartbreak of dichotomizing continuous data. Canadian Journal of Psychiatry, 47, 262-566.

They are mostly talking about dichotomization, but it's all true (just less so) for categorization.

And incidentally, simulations are great. I just ran 2000 experiments. If you can sneak a simulation into your thesis, you've got a chapter without having to collect any data, or, you know, talk to anyone.

Jeremy