Hi All,
I have 2 extremely large samples (2 million+ observations) for which I would
like to compare the means. The data are non-Normal (highly skewed), so a
classic t-test would be questionable due to the Normality requirement.
I have thought about trying the following:
1. Transform the data using logs for example, to "Normalize" the data and
then apply the t-test on the "logged" data. However with such large sample
sizes the standard error and confidence intervals will be tiny so even the
smallest difference in the means would be flagged as significant (at
alpha=5%). Can this be remedied by choosing a much smaller alpha (i.e.
<<5%)?
2. Select a number of non-overlapping samples from the original sample and
generate a sampling distribution of the averages. By the central limit
theorem, although the original sample is skewed, the distribution of the
sample averages is Normal. I can then apply a t-test to the 2 distributions
of sample means. For example
Sample A (2million observations)
Sample B (2million observations)
Divide sample A into 100 bins each of size 20,000, compute the mean of each
of the bins to get a distribution of the sample means ( call it sample A' )
from sample A. Repeat this for sample B to get
sample B'.
Now sample A' and B' are distributions of sample means (where n, the sample
size is 20,000) and are Normally distributed. Apply the t-test to sample A'
and B' to establish whether or not the means of the original sample A and
sample B are statistically different.
3. Use a non-parametric test such as the Wilcoxon signed rank on the raw
data from the 2 original samples. Again how is this test affected by such a
large sample size.
Any thoughts, references or suggestions would be most welcome.
Best Regards,
Richard.
_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
|