Here's a problem that has arisen on the hi-fi mailing list of which I'm
listowner. I haven't received a response to my thoughts from anyone on that
list as yet, so I thought I'd run it up the flagpole here.
AB-X testing is a form of blind testing of hi-fi components in which two
known pieces of equipment are demonstrated for a listener, followed by a
random unknown that is either one or the other. A chi-square statistic is
used to test the hypothesis that the listeners were or weren't able to
distinguish between these pieces of equipment.
AB-X testing has been criticised for its possible susceptibility to type II
error if the true detectability of the difference is not near unity. That
is, if the difference is subtle enough that it is not always apparent,
given normal fluctuations in music content and listener attentional state,
the AB-X paradigm will require a huge number of trials to demonstrate the
effect. A table from Levenson (1985) illustrates the point:
______________________________________________________________________________
Type 1 Error Type 2 Error
__________ _______________________________________
N r p =.6 p =.7 p =.8 p =.9
______________________________________________________________________________
16 13 0.0106 0.9349 0.7541 0.4019 0.0684
16 12 0.0384 0.8334 0.5501 0.2018 0.0170
16 11 0.1051 0.6712 0.3402 0.0817 0.0033
16 10 0.2272 0.4728 0.1753 0.0267 0.0005
16 9 0.4018 0.2839 0.0744 0.0070 0.0001
50 32 0.0325 0.6644 0.1406 0.0025 0.0000
50 31 0.0595 0.5535 0.0848 0.0009 0.0000
100 59 0.0443 0.3775 0.0072 0.0000 0.0000
100 58 0.0666 0.3033 0.0040 0.0000 0.0000
____________________________________________________________________________
N is the required number of trials, and r is the required number of correct
identifications to reach significance at alpha, which is shown in the third
column. Type II error can be found in the four columns on the right,
depending on the value that one assigns to P, which is the true value of
the detectability of the difference - that is, the proportion of time an
observer would be able to detect this difference given a very large set of
samplings. It is apparent from the table that any effect that isn't evident
most of the time will not be confirmed by a reasonably sized AB-X study.
A further property of this paradigm is that effects which are apparent half
of the time or less can never be confirmed because the worst possible
performance is chance, or 50%, and subjects are effectively forced to guess
each time. I'd be interested in any solutions to this problem that members
might be able to suggest. One that occurs to me is to allow a third choice,
"Don't Know", which effectively removes the forced-choice contingency and
the element of guessing. If a subject had no strong feeling about which
test piece he had heard that time, he could effectively pass. If you
recorded passes as well as choices, estimates of the detectability of the
effect, as well as the accuracy of choosing, could be easily computed. This
is not a perfect solution, as it depends on the subject's insight as to how
certain he ought to be at a given moment, but I think it might go some way
toward redressing the sensitivity problem. Any other thoughts on this?
david
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|