Dear all,
as a new user, I wasn't aware that answers to queries had to be posted
back on the mailing list and someone kindly advised me to do so....so here
it goes:
PROBLEM:
--------
Basically I had problems in running a chi-square test to check normality
of rounded data. A Kolmogorov-Smirnov test couldn't be computed because of
the presence of ties (replicated values) in the data. The data appeared
quite normal according to the QQ-Plot but the Chi-square test strongly
rejected the hypothesis of normality. The test was performed on
standardized data.
SOLUTION 1: Thanks (mostly) to M. Roberts and Dr.D. Chanter
---------
The first solution was to NOT standardize the data. Indeed due to the
rounding process, the natural binning of the data (see qqplot) provides a
good way to create bins. The rounding process produces integer values.
Therefore bins were created at +/- 0.5 around each value. Tail bins were
pooled together so that observed frequencies remain acceptable (>5).
Under such design, the Chi-square test produced a p-value of 0.26, thus
not rejecting the hypothesis of normality at a 0.05 level, which makes
sense according to the qqplot.
SOLUTION 2: idea from some contributors as a "test"
-----------
This involves reverting the rounding process by simulating a noise beween
-0.5 and 0.5 according to a uniform distribution. This noise is then added
to the original data to rebuild a "continuous" dataset without replicates.
Then a Kolmogorov-Smirnov normality test can be computed.
This also led to not rejecting the normality hypothesis on my data.
However this is a test only (not statistically correct) and it provides
different p-values depending on the simulated noise values. But I think
it's a nice idea too!
Thanks to everyone that contributed and was interested in this topic.
Kind regards,
Aziz Chaouch
On Fri, 13 Nov 2009 10:51:03 +0100, Martin Roberts
<[log in to unmask]> wrote:
> Hi Aziz,
> One problem with using the chi-square test in this way is that the
> result depends on an arbitrary decision about binning (how many bins,
> cut points). However, I would suggest that first standardising the data
> is a bad idea as you are ignoring the binning that already exists in the
> data as a result of the rounding. i.e. if your values are rounded to
> say 1 decimal place then a measurement of 1.5 for example actually lies
> anywhere in the bin 1.45<= x < 1.55. Try first binning the data in
> accord with this using the original scale of the measurements and then
> calculate the expected counts and see if that changes your result.
> HTH
> Martin Roberts
>
>
> Martin Roberts,
> Research Fellow - MMC
> Anaesthesia Recruitment Validation Group
> Peninsula Medical School / SW Peninsula Deanery
> Plymouth, UK
>
> ________________________________________
> From: A UK-based worldwide e-mail broadcast system mailing list
> [[log in to unmask]] On Behalf Of Aziz Chaouch
> [[log in to unmask]]
> Sent: 13 November 2009 06:48
> To: [log in to unmask]
> Subject: Chisquare test for normality of rounded data
>
> Hi,
>
> I'm using R to analyze the normality of a variable (yield) which is
> supposed to be continuous. However (supposedly) due to the precision of
> the measuring device, the data are rounded and thus many replicated
> values
> appear. In fact when looking at the normal qqplot of the data, they
> appear
> quite normal but the qqplot has a "stairs-like" shape because data
> appears
> as if they were discrete because of the rounding process.
>
> I first tried a KS test but got a warning that the presence of ties
> (replicated values) makes the calculation of p-values impossible. OK so I
> thought about using a Chisquare test "goodness of fit" to check if those
> "discrete like" data can be assumed normal. First I standardized the data
> and cut it into 12 bins of approx equal length. The observed count of
> data
> in each bin was computed and each bin contains at least 7 data. Then I
> computed the expected counts for these bins under the null hypothesis
> (normality). Then I computed the chisquare statistics and I got a value
> of
> 72.4. Using 12-(2+1)=9 degrees of freedom (12 bins and I estimated 2
> parameters when standardizing the data), this got a p-value of the order
> of 10 power minus 12......thus strongly rejecting the hypothesis of
> normality.
>
> I assume there was something wrong in how I did this because the qqplot
> appeared really close to normal (see it here:
> http://img18.imageshack.us/img18/7787/yieldh.jpg).
>
> I guess there is an issue on how the bins were created or simply that a
> Chi-square test is not appropriate in this situation. Therefore a few
> questions:
>
> - Is there a proper way to cut a "continuous BUT discrete-like" variable
> (due to rounding) to build a chisquare test for normality?
>
> - What should I care for when creating the bins?
>
> - Is this an issue related to the bins containing the same value
> replicated X times?
>
> - Is there any other goodness of fit test for normality that would be
> helpful in such circumstances (rounded data) and would provide accurate
> p-values?
>
> Thanks a lot!
>
> Aziz
|