Hi all who are interested in this discussion.
I guess that there are some difficulties with the conceptual
distinction between measurement and assessment, and I think it is
worthwhile to try to solve these difficulties, as it is possible that
we share the same concerns despite speaking different languages.
Let me take the analogy of a special quantitative attribute, the price
of objects.
Test scores work as prices of responses because test item scores work
as prices. The price of a given m-tuple of responses is the sum (or
another arithmetic fonction) of the prices of each elementary
response. So, in this perspective, item responses are additive since
we compute (not measure) the value of the whole test response as a
function of its elements (now I replace "price" by the technical word
"value").
But the ontology of the value of the test response is not of the same
kind that the ontology of a measurable, and hence quantitative,
attribute in the classical sense. The difference has been detailed by
Searle (1995) -- thank you to Paul B who put my attention on this book
few years ago --, when he opposes "brute facts" to "social facts" --
that is, grossly, institutions (like money and, I would add, test
scores).
The evaluative space developed by psychologists that are commited in
measurement in Paul N's sense serve to evaluate, I mean to give
values, to people (or to their observed behaviors). But these values
take not their meanings as a result of scientific discovery (I mean
discovery of experimental laws like a miraculous Guttman scale, in the
sense that measurability is an empirical, not only conventional,
issue), they take their meanings from non-scientific language, from
‘common-or-garden’ words like intelligence, anxiety and so on, and
from numerical bricolage (the art of numerical aggregation which
allows one to speack the language of low, high, decrease, increase,
etc., and to avoid concepts such as partial order, non-additivity,
etc.).
And, of course, once an institution is built, it works causally. In
this perspective, I understand Paul N's attachment to "fuzzy
measurement" as the result of his need to intervene. Test scores (or
other "fuzzy measurements") may be useful to intervene (according to
which rules?) or at least to monitor macro-level phenomena.
Let me take an example: IQ scores are decreasing. Ok we have a
macro-level signal. But we are pretty sure that this is not because
the amount of intelligence is decreasing in people, since intelligence
is not defined as a quantity. Scores decrease because less test items
are correctly answered but the phenomenon that causes this statistical
fact is not correctly described, and then, understood or known, by
using the language of IQ scores nor the language of a "latent trait".
The concept of a latent trait is clearly false from the standpoint of
scientific truth.
To conclude, I maintain that it would be wise to preserve the
distinction between measurements and numerical indices (or indexes?),
since it has ontological, conceptual, methodological consequences that
would be lost if the distinction were blurred. And I believe there is
room to built an applied psychology concerned with useful indices,
which makes me go back to my late point: psychological assessment is a
matter of science if yo want, but a science that is driven by decision
making problems in specific social, political settings. Maybe
psychologists commited to assessment should also learn to analyse
decision making problems in specific social, political settings... to
be able to explain the social, political, usefulness of assessment --
huge work.
Best,
Stephane
Paul Newton <[log in to unmask]> a écrit :
> … although, if what you’re essentially saying is that adding item scores
> generally results in poor measurement, because it is (implicitly or
> explicitly) based on a poor measurement model, even though we often put up
> with poor measurement for largely practical reasons, then I certainly do
> agree with that!
>
> p
>
>
>
> From: IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] On Behalf Of Paul Newton
> Sent: 15 November 2017 18:31
> To: [log in to unmask]
> Subject: Re: Undergraduate readings?
>
>
>
> Hi Stephane
>
>
>
> Would I agree? Well, possibly yes and possibly no!
>
>
>
> I do agree that measurement is a purpose-driven activity. That’s another of
> the messages from Mari and colleagues in the metrology literature.
>
>
>
> So, I do agree that the decision as to whether to model at the
> macro-attribute-level rather than at the micro-attribute-level ought to be
> purpose-driven.
>
>
>
> In the educational measurement literature, a good example is the distinction
> between a diagnostic/formative purpose (low-level) and a placement purpose
> (high-level). The low-level purpose needs low-level measurement, to inform a
> more specific intervention (a suitable instructional unit, say). The
> high-level purpose needs high-level measurement, to inform a more general
> intervention (a suitable instructional group).
>
>
>
> I guess it’s probably true that we’ll normally be able to model/measure the
> lower level(s) with greater precision than the higher level. But I wouldn’t
> conclude, from that, that the higher level model/measurement is inherently
> less real, or less scientific, or less useful. It’s just more fuzzy
> (implying greater definitional uncertainty).
>
>
>
> On the other hand, I don’t agree (if this is what you mean) that
> purpose-driven measurement ought to be characterised/modelled/evaluated
> non-scientifically, i.e. purely instrumentally. If there’s no science behind
> the measurement, then what kind of activity is it?
>
>
>
> Again, following Messick, I think that ‘blind empiricism’ (a.k.a. ‘pure
> prediction’) is, and ought to be, a thing of the past. We can’t (properly)
> justify any measurement/assessment on purely instrumental grounds; even when
> we accept that measurement is fundamentally a pragmatic activity. (If the
> model doesn’t work against evidence, then it shouldn’t be used in an
> instrumental way. Indeed, if it doesn’t work against evidence, then why
> would it work in an instrumental way?)
>
>
>
> But maybe I’ve not quite got the subtlety of your response?
>
>
>
> Best wishes
>
>
>
> Paul
>
>
>
> From: IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
> Sent: 15 November 2017 16:55
> To: [log in to unmask] <mailto:[log in to unmask]>
> Subject: Re: Undergraduate readings?
>
>
>
> Hi Paul,
>
>
>
> I think we may be convergent if we distinguish two goals, namely to test a
> model against evidence, and to use a model in an instrumentalist way, that
> is, to solve a given (decision) problem with the help of a model, even if
> this model is known to be an oversimplification of known (manifest)
> phenomena in a given descriptive system.
>
>
>
> For instance, if you test that multivariate descriptions are ordinal
> measurements, your goal is to check that the observed descriptions behave as
> simply ordered objects. And the refutation (or, synonymously, the
> falsification) of the model can be a real contribution to the current state
> of knowledge. In this perspective, I believe that the scientific task of
> psychologists is to spread the fact that scores are poor descriptions of the
> observations they are trained to deal with, not that scores are measurements
> of … none knows what is measured and thanks to verbal magic and reification
> – e.g., intelligence is what my test do measure.
>
>
>
> Now, the assessment problem based on test scores. My question is: why to
> simply order learners on a score scale while we perfectly know that the
> scores are not correct descriptions of their performances (which are
> multivariate)? Why do we need to replace the multivariate
> definition/construction of performance (which is made by the test developers
> who carefully selected distinct test items) by a one-dimensional
> compression? Sometimes, it may be useful, for instance for selection
> purposes when we seek say the 10 “best” performers, sometimes not, for
> instance when we want to provide some feedback to the learners on their
> performances.
>
>
>
> My point is that tests users are responsible with respect to the relevance
> of their descriptive choices when they are engaged in any problem solving.
> Test scores are not mandatory. One additional example: sometimes, assessment
> of anxiety does not require to treat anxiety as a one-dimensional construct,
> as clinicians may be interested in a detailed description of a m-tuple of
> clinical signs.
>
>
>
> Conclusion: we have to be clear on whether we, in such and such context,
> pursue scientific vs. utilitarian purposes. Not the same.
>
>
>
> Would you agree?
>
>
>
> Stéphane
>
>
>
>
>
> De : IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] De la part de Paul Newton
> Envoyé : mercredi 15 novembre 2017 17:17
> À : [log in to unmask] <mailto:[log in to unmask]>
> Objet : Re: Undergraduate readings?
>
>
>
> Hi Stephane
>
>
>
> As I see it: the principle of science is to model reality; while the
> practice of science is to attempt to reach consensus over the
> accuracy/usefulness of any particular model, on the basis of empirical
> evidence and logical analysis. The same applies for measurement.
>
>
>
> A critical point, here, is that the models that we use in science – i.e. the
> concepts through which we measure – are (over-)simplified abstractions of
> reality. So, for instance, when we say that we measure the diameter of a
> hypodermic needle, we are not actually measuring a single thing. It’s a
> single thing in the model – sure – but not in reality (when you look at the
> bore of a needle under a high power magnifier, it’s more like a cave with
> stalactites and stalagmites). This ambiguity is embodied in the (measurement
> science) idea of definitional uncertainty – which is a really important
> concept for social science, incidentally. That’s why hard-science
> metrologists tend not to advocate the idea of a ‘true score’ as an
> organising concept, nowadays, appreciating that there is always a range of
> scores that will be compatible with the model/definition that’s being used.
>
>
>
> In other words, science, including measurement, is about modelling, and
> consensus building. The critical issue is whether there is sufficient
> (empirical and logical) reason to believe that the model provides a
> plausible account of reality.
>
>
>
> Using my educational measurement example, the reality (that no-one doubts)
> is that some people graduate with far more knowledge/skill/understanding
> than others. The tricky bit is how to build a measurement model that
> distinguishes (with sufficient precision) between students at different
> points on that scale. The first port of call [but only the first port] in
> validating any such model (and it’s measuring procedure) is, of course, the
> judgement of experts, e.g. maths teachers. If the rank order from the
> measuring procedure is completely contradicted by the judgement of experts,
> then either the model is wrong, or the measuring procedure, or both. Thus,
> educational measurement is falsifiable; in essentially the same way as for
> other kinds of measurement.
>
>
>
> Having said all of this, I don’t think that we have many good models in
> educational measurement. And we’ve been led astray by people who know loads
> about statistics, yet don’t have a clue about what their statistics are
> meant to represent. And, since this is the more general point that we’ve
> been discussing, I do agree!
>
>
>
> However, in my view, the right kinds of model are very unlikely to be
> unidimensional in any way, shape or form. (Isn’t it obvious (from analysis
> of the concept itself) that the real structure of attainment in maths is not
> unidimensional?) Attainment in education involves a progression in
> capability from novice to expert (a la Glaser, Dreyfus and Dreyfus, etc.) so
> that’s the kind of measurement model that we need to be thinking in terms
> of. That’s probably not about specifying logical relationships. It’s
> probably more about identifying underlying patterns. That, incidentally, is
> why I quite like the baldness scale analogy for education.
>
>
>
> We agree on the problem of judges. And I’d say that the lack of agreement
> between judges sets an upper limit on the validity of any measurement
> model/measuring procedure in education. But I’d also say that’s true for any
> model/procedure, whatever the discipline.
>
>
>
> Again, my main concern is that, when you put ‘mere assessments’ in the
> ‘not-science’ bucket, you imply that they should be judged according to
> less-than-scientific standards… and that makes a bad problem even worse!
>
>
>
> Sorry if I’m just waffling now!
>
>
>
> Paul
>
>
>
>
>
> From: IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
> Sent: 15 November 2017 09:20
> To: [log in to unmask] <mailto:[log in to unmask]>
> Subject: Re: Undergraduate readings?
>
>
>
> Hi Paul,
>
> I insert my comments in your text.
>
> Best.
>
>
>
> Stephane
>
>
>
> De : IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] De la part de Paul Newton
> Envoyé : mercredi 15 novembre 2017 09:38
> À : [log in to unmask] <mailto:[log in to unmask]>
> Objet : Re: Undergraduate readings?
>
>
>
> Hi Paul and Stephane
>
>
>
> Lots to talk about here, and not enough time, so apologies if I don’t cover
> everything sensibly; but here’s a few thoughts…
>
>
>
> The idea of ‘redefining’ the word measurement (Paul B) isn’t quite right,
> since there isn’t (currently) a generally accepted, universal definition.
> It’s work in progress.
>
>
>
> However, I am arguing for a wider definition. That’s because what we ought
> to be doing, in educational measurement, is to adopt a scientific approach
> (see Paul B below - agreed) to obtain information about ‘real’ properties of
> human behaviour, which can properly described in terms of amounts. I’m not
> assuming that educational measurement has a unit – so it doesn’t necessarily
> have a Quantity under a strong definition of that term – but it’s
> essentially measurement in all other respects, I would argue. So it makes
> sense (to me) to design educational assessments on a measurement model and,
> most importantly, to evaluate them on a measurement model.
>
> SV: So, the problem is very simple. We have to define (i) the set of the
> possible states of our target (hypothetical) attribute, say the set A, (ii)
> the set of our observables, say the set O, and the relation R from A to O,
> in such a way that observing an o, we can know its antecedents in A. And
> (iv), we have to define the falsifiers of R, that is, to be creative in
> designing a test experiment. If R has no falsifiers given what we can
> imagine as a test of a theory, R is not a scientific theory because we do
> not know how to test it against experience which can be made in O or O^2
> etc. But R can serve as a convention, which allows one to assess, not to
> measure, the state of someone in A given his o (DSM practice). And,
> consequently, assessment is not based on scientific – that is, falsifiable –
> knowledge, it is based on a social consensus enabling one to speak the
> language A starting from the language O.
>
>
>
> Note I say ‘measurement model’ not a ‘psychometric model’ – as psychometric
> models often prejudge a certain kind of structure, without actually testing
> it effectively. (Testing it is both a logical/linguistic and an empirical
> matter.) Josh McGrane and Andy Maul have been talking about this recently.
>
>
>
> Both Michell and Maraun make a splendid case against assuming that certain
> psychological attributes have a certain kind of structure (Quantitative).
> But neither of them (to my mind) presents a strong case that
> psychological/educational attributes have no quantitative structure, and
> therefore that its illegitimate to model them as measurements, in a wider
> sense. So the question is open. Simply to trivialise and dismiss
> non-Quantitative [assessment] by calling is merely ‘assessment’ makes no
> positive case for what this ‘assessment’ thing actually is. Is it purely
> ‘description’ or ‘representation’? Maybe… but maybe it’s actually closer to
> our conventional usage of ‘measurement’ than say ‘description’… albeit
> adopting a wider definition.
>
>
>
> In educational measurement, I find the case for educational attainment
> attributes having a ‘real’ quantitative structure (of some sort) hard to
> reject. How many of us would be prepared to argue that there isn’t a really
> important and meaningful sense in which some students come out of their
> studies having attained at a far higher level than others?
>
> SV: yes of course. But the concept of level, here, is a qualitative concept,
> as the descriptive device we use to specify a given state is a Cartesian
> product formed by combining several criteria, not a segment.
>
>
>
> Similarly, to accept this, but to describe the attribute as somehow not
> ‘real’ seems equally implausible. So that’s the (common-sense,
> logical/linguistic) starting point for me. Or, to put it another way,
> teachers don’t routinely say that their graduating students practice (say)
> maths like a Bear, or like a Fox, or some other purely qualitative
> description/representation. The (primary) classification that teachers (and
> the rest of us make) regarding educational attributes is quantitative, with
> a small ‘q’, so that’s what we (measurement professionals) ought somehow to
> be recognising. IMHO.
>
>
>
> Right, now to the issue of usage! I’ve spent a lot of time arguing (and
> publishing) on the problems that measurement professionals have with the
> word ‘validity’ which is even more contested than the word ‘measurement’.
> I’ve argued strongly (I hope!) that the case for or against any particular
> usage is not a matter of logical necessity, but of consequences.
>
> SV: Consequences are a matter of logical necessity, or not?
>
>
>
> I’ve also argued that the consequential argument for or against a narrow vs.
> broad definition of validity is probably quite evenly balanced. Which is a
> problem.
>
>
>
> In relation to the use of ‘measurement’ for educational and psychological
> attributes, the recent debate seems only to acknowledge the consequential
> case against (see, e.g., Stephane, in other email). I work in a context in
> which lots of people have argued a different case for ‘educational
> assessment’ over ‘educational measurement’. And one of the key arguments,
> here, seems to be that educational ‘attributes’ are just pure ‘social
> constructions’ and (I’m simplifying, but not unreasonably, I think) it’s
> therefore fine for different judges to evaluate the same people in different
> ways, or, in other words, reliability is seriously over-rated. If that is
> your view of ‘assessment’ then it really isn’t measurement, in any sense. By
> reclaiming the concept of ‘educational measurement’ we stress that it’s a
> special kind of description/representation in which the inter-subjectivity
> aspect (calibration to a common reference standard) is critical (which seems
> to be what Mari puts at the heart of his epistemological conception). That’s
> important, I think. Very important.
>
> SV: I think I agree. The issue is not measurement, but description, and the
> interchangeability of the “judges” is a serious problem – this is why tests
> are interesting descriptive techniques because the observations can be said
> objective in the weak sense of judge interchangeability.
>
>
>
> Too little time to continue! But let me just also reference Eran Tal on
> measurement as modelling. Read that work and you’ll realise that all of the
> obvious fuzziness in educational and psychological measurement is also there
> in the rock-hard sciences too. Again, the differences are more of scale than
> of kind.
>
>
>
> Oh yeh… I’m firmly with Samuel Messick that justifying any test or
> assessment on purely pragmatic grounds is extremely dangerous, for all sorts
> of reasons. First, we have never been able to crack the ‘criterion problem’
> and we never will. Second, if we don’t know the explanation for score
> variance, then we’re likely to do harm unconsciously, e.g. by reinforcing
> (and never questioning) pre-existing biases in our tests/assessments.
>
> SV: :).
>
>
>
> Must go. Interesting debate!
>
>
>
> Best wishes
>
>
>
> Paul
>
>
>
>
>
> From: IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] On Behalf Of Paul Barrett
> Sent: 14 November 2017 20:08
> To: [log in to unmask] <mailto:[log in to unmask]>
> Subject: Re: Undergraduate readings?
>
>
>
> Hello again Paul N!
>
>
> This leaves open the possibility that, for instance, ordinal ‘measurement’
> counts as measurement; perhaps even that nominal ‘measurement’ counts as
> measurement. Similarly, it leaves open the possibility of measuring all
> sorts of phenomena, like hardness, or storms, or baldness.
>
> Not a problem - all you are doing is redefining the word ‘measurement’ to
> include any kind of phenomenal observation scaling/classification.
>
>
>
> I work in educational measurement. And I would suggest that attainment in
> mathematics (for instance) is just as measurable as baldness; possibly more
> so.
>
> Again, perfectly reasonable given you take care to align the methods you use
> to manipulate magnitudes of attainment and baldness with the properties of
> the attributes in question. i.e. if mathematical attainment is indexed as
> the number of items solved correctly on a particular test, then use of
> integer arithmetic/frequency/count-based analysis methods is fine. You may
> also construct a more complex ordered-class assessment which takes into
> account a composite of indicators - teacher ratings as well as test
> performance etc. However, if you wish to assert that mathematical attainment
> varies as a quantity, you need to provide empirical evidence that it does so
> before using methods of analysis whose validity of results depend upon that
> assertion being ‘just so’. Same as with ‘baldness’. That latter is the
> domain of scientific exploration of a phenomenon.
>
>
>
> I reiterate: it is about the properties of the attribute required to
> substantiate a claim of whatever you choose to call a ‘measurement’ of it,
> and the capacity of the methods/computations you use to manipulate those
> ‘measurements’ to possess a certain validity, given those attribute
> properties.
>
>
>
> Regards .. Paul
>
>
>
> Chief Research Scientist
>
> Cognadev Ltd.
>
> ____________________________________________________________________________
> ______
>
> W: <http://www.pbarrett.net/> www.pbarrett.net
>
> E: <mailto:[log in to unmask]> [log in to unmask]
>
> M: +64-(0)21-415625
>
>
>
> From: IDANET (Individual Differences and Assessment Network)
> [mailto:[log in to unmask]] On Behalf Of Paul Newton
> Sent: Wednesday, November 15, 2017 7:09 AM
> To: [log in to unmask] <mailto:[log in to unmask]>
> Subject: Re: Undergraduate readings?
>
>
>
> Hi Stephane
>
>
>
> Is ‘measurement science’ free of logical necessity? I don’t think that the
> meaning of words is essentially a matter of logical necessity – e.g. the
> word ‘measure’ – if that’s what you’re asking.
>
>
>
> In terms of Mari’s work, I’d definitely recommend reading it, because the
> following recollection won’t do it justice! But, as I recall it, he
> describes measurement in relation to: (a) objectivity in extracting
> information on the thing you’re measuring; (b) intersubjectivity in
> interpreting that information, such that the information means the same
> thing to every measurement user. This leaves open the possibility that, for
> instance, ordinal ‘measurement’ counts as measurement; perhaps even that
> nominal ‘measurement’ counts as measurement.
>
>
>
> Similarly, it leaves open the possibility of measuring all sorts of
> phenomena, like hardness, or storms, or baldness.
>
>
>
> I work in educational measurement. And I would suggest that attainment in
> mathematics (for instance) is just as measurable as baldness; possibly more
> so.
>
>
>
> Incidentally, I don’t see the point in insisting on using the word
> ‘assessment’ for measurement that’s not like length. I don’t see what it
> buys.
>
>
>
> Best wishes
>
>
>
> Paul
>
>
>
>
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
|