Print

Print


Hi Paul B

 

Damn, you’ve sucked me in again, when I really just wanted to go to bed!

 

I get what Stephane and Joel are trying to do. But the underlying premise
seems to be that psychology ‘just needs to come into line’ with what the
real measurement folk mean by ‘measurement’. What I’ve been trying to say is
that there are plenty of real measurement folk who just don’t buy into the
really strong (and narrow) conception of measurement. They are looking for
alternative technical definitions that can, as you indicate, apply to any
use of the word ‘measurement’ in science.

 

I’ve recently edited a special issue of Assessment in Education on how best
to define the word ‘validity’ so I get how emotive this stuff is, and I
don’t mean to denigrate the desire for technical clarity, and useful
consensus definitions, which I’ve totally argued in favour of in relation to
validity (although, in my heart of hearts, I’m pretty sure we’ll never reach
consensus there).

 

If you want to use ‘measurement’ in the classic, technical sense, that’s
fine. All I’m saying is that it’s not without potential negative
consequences. Equally, there are also potential positive consequences (for
the work I do) arising from adopting ‘measurement’ (broadly defined) as a
guiding framework. 

 

Incidentally, not all science is about seeking causes. Lots of science
doesn’t involve that. More to the point, I would argue that the things that
we need to measure, in education and psychology, are not internal structures
that somehow cause external behaviour. That’s not the level at which
concepts like knowledge, skill and understanding apply. For me, science
comes into play when we take seriously the challenge of understanding these
concepts, and their proper application, and then work out methods for
accurately extracting and representing information about the people whom we
measure in terms of these concepts.

 

Again, you might say that’s not real science, I guess. But science isn’t
quite so easily circumscribed either! 

 

Again, though, I have no problem in condemning the unthinking use of unduly
simplistic psychometric models, which simply presume that attributes have a
certain kind of structure. Often these presumptions seem entirely
implausible to me (e.g. Quantitative structure, with capital Q). But that
doesn’t mean that educational and psychological attributes have no
quantitative (small q) structure. And, if they do, then shouldn’t we be
trying to find ways to represent that structure, scientifically?

 

I sense we’re now going in circles. But it’s been an interesting discussion
all the same!

 

Cheers

 

Paul

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Paul Barrett
Sent: 15 November 2017 19:59
To: [log in to unmask]
Subject: Re: Undergraduate readings?

 

Hello Paul N.

 

You say, without qualification:

If there’s no science behind the measurement

 

Science (it’s practice) is about attempting to detect phenomena through
systematic observation, and understand how and why phenomena occur.

 

So, if you propose to investigate mathematical attainment as a scientist,
you first ‘detect’ the phenomenon itself (whatever it is that seems to
distinguish students based upon their performance in that domain of
mathematics), then you try to investigate what is causal for the phenomenal
observations and the empirically observed properties of that variation (e.g.
are magnitudes varying additively?).

 

The alternative is to detect the phenomenon, and make instrumental use of
the phenomenal observations for pragmatic purposes, where the goal is not to
investigate what is causal for them except at a speculative but meaningful
level of discourse. The primary purpose of what you are doing is to enable
you to order students reliably in something you call ‘mathematical
attainment’, where it’s observed variations can be described meaningfully
and with some consensus among those using them.

 

Stéphane is providing us with the formal specifications (and the properties
which must entail) for a quantitative measurement. In the same way, so does
Michell as a text-based definition:

“the discovery or estimation of the ratio of a magnitude of a quantity to a
unit of the same quantity.” P. 222

Michell, J. (1999). Measurement in Psychology: Critical History of a
Methodological Concept. Cambridge University Press. ISBN: 0-521-62120-8.

 

So, that is a crystal clear technical definition, which applies to any use
of the word ‘measurement’ in a quantitative science.

 

Psychology is not a quantitative science, because none of the attributes it
seeks to understand possess any empirical evidence for their variation as
‘quantities’.

 

So, as with education, we are investigating/working in an area which may be
best described as a non-quantitative science. For clarity only, some like
myself will refrain from using the word ‘measurement’ because it has a very
specific technical definition. We use assessment or evaluation; likewise I
refrain from using the word ‘variable’ instead of attribute, as ‘variable’
also has a very specific technical definition in a quantitative science. 

 

There is nothing more or less “scientific” in that use of terminology; it
just reflects the recognition that we are not ‘measuring’, manipulating, or
‘observing’ quantities, but, given the lack of evidence for anything else,
we are assessing/evaluating orders or classes of things.

 

However, others do not accept that the word ‘measurement’ possesses a
precise, technical definition. So, following Stevens, anything can become a
measurement if a number can be assigned by some rule.

 

So, if someone says I’m ‘measuring mathematical attainment’, I’ll accept it
as a ‘common-or-garden’ use of that term (Maraun, 1998) - where the user is
not employing its technical definition.  But, if the full panoply of
quantitative methods are applied to its magnitudes, where the numbers
representing the magnitudes are now assumed to vary as quantities, and
claims are made about attainment based upon  it varying as a quantity, then
that belief or assumption of quantity might now be challenged in a court of
law should someone choose to question an adverse decision made on the basis
of that claim. 

Maraun, M.D. (1998). Measurement as a normative practice: Implications of
Wittgenstein's philosophy for measurement in Psychology. Theory &
Psychology, 8, 4, 435-461.

 

Regards .. Paul

 

Chief Research Scientist

Cognadev Ltd.

____________________________________________________________________________
______

W:  <http://www.pbarrett.net/> www.pbarrett.net 

E:  <mailto:[log in to unmask]> [log in to unmask] 

M: +64-(0)21-415625

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Paul Newton
Sent: Thursday, November 16, 2017 7:31 AM
To: [log in to unmask] <mailto:[log in to unmask]> 
Subject: Re: Undergraduate readings?

 

Hi Stephane

 

Would I agree? Well, possibly yes and possibly no!

 

I do agree that measurement is a purpose-driven activity. That’s another of
the messages from Mari and colleagues in the metrology literature.

 

So, I do agree that the decision as to whether to model at the
macro-attribute-level rather than at the micro-attribute-level ought to be
purpose-driven.

 

In the educational measurement literature, a good example is the distinction
between a diagnostic/formative purpose (low-level) and a placement purpose
(high-level). The low-level purpose needs low-level measurement, to inform a
more specific intervention (a suitable instructional unit, say). The
high-level purpose needs high-level measurement, to inform a more general
intervention (a suitable instructional group).

 

I guess it’s probably true that we’ll normally be able to model/measure the
lower level(s) with greater precision than the higher level. But I wouldn’t
conclude, from that, that the higher level model/measurement is inherently
less real, or less scientific, or less useful. It’s just more fuzzy
(implying greater definitional uncertainty).

 

On the other hand, I don’t agree (if this is what you mean) that
purpose-driven measurement ought to be characterised/modelled/evaluated
non-scientifically, i.e. purely instrumentally. If there’s no science behind
the measurement, then what kind of activity is it?

 

Again, following Messick, I think that ‘blind empiricism’ (a.k.a. ‘pure
prediction’) is, and ought to be, a thing of the past. We can’t (properly)
justify any measurement/assessment on purely instrumental grounds; even when
we accept that measurement is fundamentally a pragmatic activity. (If the
model doesn’t work against evidence, then it shouldn’t be used in an
instrumental way. Indeed, if it doesn’t work against evidence, then why
would it work in an instrumental way?)

 

But maybe I’ve not quite got the subtlety of your response?

 

Best wishes

 

Paul

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
Sent: 15 November 2017 16:55
To: [log in to unmask] <mailto:[log in to unmask]> 
Subject: Re: Undergraduate readings?

 

Hi Paul,

 

I think we may be convergent if we distinguish two goals, namely to test a
model against evidence, and to use a model in an instrumentalist way, that
is, to solve a given (decision) problem with the help of a model, even if
this model is known to be an oversimplification of known (manifest)
phenomena in a given descriptive system.

 

For instance, if you test that multivariate descriptions are ordinal
measurements, your goal is to check that the observed descriptions behave as
simply ordered objects. And the refutation (or, synonymously, the
falsification) of the model can be a real contribution to the current state
of knowledge. In this perspective, I believe that the scientific task of
psychologists is to spread the fact that scores are poor descriptions of the
observations they are trained to deal with, not that scores are measurements
of … none knows what is measured and thanks to verbal magic and reification
– e.g., intelligence is what my test do measure.

 

Now, the assessment problem based on test scores. My question is: why to
simply order learners on a score scale while we perfectly know that the
scores are not correct descriptions of their performances (which are
multivariate)? Why do we need to replace the multivariate
definition/construction of performance (which is made by the test developers
who carefully selected distinct test items) by a one-dimensional
compression? Sometimes, it may be useful, for instance for selection
purposes when we seek say the 10 “best” performers, sometimes not, for
instance when we want to provide some feedback to the learners on their
performances.

 

My point is that tests users are responsible with respect to the relevance
of their descriptive choices when they are engaged in any problem solving.
Test scores are not mandatory. One additional example: sometimes, assessment
of anxiety does not require to treat anxiety as a one-dimensional construct,
as clinicians may be interested in a detailed description of a m-tuple of
clinical signs.

 

Conclusion: we have to be clear on whether we, in such and such context,
pursue scientific vs. utilitarian purposes. Not the same.

 

Would you agree?

 

Stéphane

 

 

De : IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] De la part de Paul Newton
Envoyé : mercredi 15 novembre 2017 17:17
À : [log in to unmask] <mailto:[log in to unmask]> 
Objet : Re: Undergraduate readings?

 

Hi Stephane

 

As I see it: the principle of science is to model reality; while the
practice of science is to attempt to reach consensus over the
accuracy/usefulness of any particular model, on the basis of empirical
evidence and logical analysis. The same applies for measurement.

 

A critical point, here, is that the models that we use in science – i.e. the
concepts through which we measure – are (over-)simplified abstractions of
reality. So, for instance, when we say that we measure the diameter of a
hypodermic needle, we are not actually measuring a single thing. It’s a
single thing in the model – sure – but not in reality (when you look at the
bore of a needle under a high power magnifier, it’s more like a cave with
stalactites and stalagmites). This ambiguity is embodied in the (measurement
science) idea of definitional uncertainty – which is a really important
concept for social science, incidentally. That’s why hard-science
metrologists tend not to advocate the idea of a ‘true score’ as an
organising concept, nowadays, appreciating that there is always a range of
scores that will be compatible with the model/definition that’s being used.

 

In other words, science, including measurement, is about modelling, and
consensus building. The critical issue is whether there is sufficient
(empirical and logical) reason to believe that the model provides a
plausible account of reality. 

 

Using my educational measurement example, the reality (that no-one doubts)
is that some people graduate with far more knowledge/skill/understanding
than others. The tricky bit is how to build a measurement model that
distinguishes (with sufficient precision) between students at different
points on that scale. The first port of call [but only the first port] in
validating any such model (and it’s measuring procedure) is, of course, the
judgement of experts, e.g. maths teachers. If the rank order from the
measuring procedure is completely contradicted by the judgement of experts,
then either the model is wrong, or the measuring procedure, or both. Thus,
educational measurement is falsifiable; in essentially the same way as for
other kinds of measurement.

 

Having said all of this, I don’t think that we have many good models in
educational measurement. And we’ve been led astray by people who know loads
about statistics, yet don’t have a clue about what their statistics are
meant to represent. And, since this is the more general point that we’ve
been discussing, I do agree!

 

However, in my view, the right kinds of model are very unlikely to be
unidimensional in any way, shape or form. (Isn’t it obvious (from analysis
of the concept itself) that the real structure of attainment in maths is not
unidimensional?) Attainment in education involves a progression in
capability from novice to expert (a la Glaser, Dreyfus and Dreyfus, etc.) so
that’s the kind of measurement model that we need to be thinking in terms
of. That’s probably not about specifying logical relationships. It’s
probably more about identifying underlying patterns. That, incidentally, is
why I quite like the baldness scale analogy for education.

 

We agree on the problem of judges. And I’d say that the lack of agreement
between judges sets an upper limit on the validity of any measurement
model/measuring procedure in education. But I’d also say that’s true for any
model/procedure, whatever the discipline.

 

Again, my main concern is that, when you put ‘mere assessments’ in the
‘not-science’ bucket, you imply that they should be judged according to
less-than-scientific standards… and that makes a bad problem even worse!

 

Sorry if I’m just waffling now!

 

Paul

 

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
Sent: 15 November 2017 09:20
To: [log in to unmask] <mailto:[log in to unmask]> 
Subject: Re: Undergraduate readings?

 

Hi Paul,

I insert my comments in your text.

Best.

 

Stephane

 

De : IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] De la part de Paul Newton
Envoyé : mercredi 15 novembre 2017 09:38
À : [log in to unmask] <mailto:[log in to unmask]> 
Objet : Re: Undergraduate readings?

 

Hi Paul and Stephane

 

Lots to talk about here, and not enough time, so apologies if I don’t cover
everything sensibly; but here’s a few thoughts…

 

The idea of ‘redefining’ the word measurement (Paul B) isn’t quite right,
since there isn’t (currently) a generally accepted, universal definition.
It’s work in progress.

 

However, I am arguing for a wider definition. That’s because what we ought
to be doing, in educational measurement, is to adopt a scientific approach
(see Paul B below - agreed) to obtain information about ‘real’ properties of
human behaviour, which can properly described in terms of amounts. I’m not
assuming that educational measurement has a unit – so it doesn’t necessarily
have a Quantity under a strong definition of that term – but it’s
essentially measurement in all other respects, I would argue. So it makes
sense (to me) to design educational assessments on a measurement model and,
most importantly, to evaluate them on a measurement model.

SV: So, the problem is very simple. We have to define (i) the set of the
possible states of our target (hypothetical) attribute, say the set A, (ii)
the set of our observables, say the set O, and the relation R from A to O,
in such a way that observing an o, we can know its antecedents in A. And
(iv), we have to define the falsifiers of R, that is, to be creative in
designing a test experiment. If R has no falsifiers given what we can
imagine as a test of a theory, R is not a scientific theory because we do
not know how to test it against experience which can be made in O or O^2
etc. But R can serve as a convention, which allows one to assess, not to
measure, the state of someone in A given his o (DSM practice). And,
consequently, assessment is not based on scientific – that is, falsifiable –
knowledge, it is based on a social consensus enabling one to speak the
language A starting from the language O.

 

Note I say ‘measurement model’ not a ‘psychometric model’ – as psychometric
models often prejudge a certain kind of structure, without actually testing
it effectively. (Testing it is both a logical/linguistic and an empirical
matter.) Josh McGrane and Andy Maul have been talking about this recently.

 

Both Michell and Maraun make a splendid case against assuming that certain
psychological attributes have a certain kind of structure (Quantitative).
But neither of them (to my mind) presents a strong case that
psychological/educational attributes have no quantitative structure, and
therefore that its illegitimate to model them as measurements, in a wider
sense. So the question is open. Simply to trivialise and dismiss
non-Quantitative [assessment] by calling is merely ‘assessment’ makes no
positive case for what this ‘assessment’ thing actually is. Is it purely
‘description’ or ‘representation’? Maybe… but maybe it’s actually closer to
our conventional usage of ‘measurement’ than say ‘description’… albeit
adopting a wider definition.

 

In educational measurement, I find the case for educational attainment
attributes having a ‘real’ quantitative structure (of some sort) hard to
reject. How many of us would be prepared to argue that there isn’t a really
important and meaningful sense in which some students come out of their
studies having attained at a far higher level than others? 

SV: yes of course. But the concept of level, here, is a qualitative concept,
as the descriptive device we use to specify a given state is a Cartesian
product formed by combining several criteria, not a segment.

 

Similarly, to accept this, but to describe the attribute as somehow not
‘real’ seems equally implausible. So that’s the (common-sense,
logical/linguistic) starting point for me. Or, to put it another way,
teachers don’t routinely say that their graduating students practice (say)
maths like a Bear, or like a Fox, or some other purely qualitative
description/representation. The (primary) classification that teachers (and
the rest of us make) regarding educational attributes is quantitative, with
a small ‘q’, so that’s what we (measurement professionals) ought somehow to
be recognising. IMHO.

 

Right, now to the issue of usage! I’ve spent a lot of time arguing (and
publishing) on the problems that measurement professionals have with the
word ‘validity’ which is even more contested than the word ‘measurement’.
I’ve argued strongly (I hope!) that the case for or against any particular
usage is not a matter of logical necessity, but of consequences.

SV: Consequences are a matter of logical necessity, or not?

 

I’ve also argued that the consequential argument for or against a narrow vs.
broad definition of validity is probably quite evenly balanced. Which is a
problem.

 

In relation to the use of ‘measurement’ for educational and psychological
attributes, the recent debate seems only to acknowledge the consequential
case against (see, e.g., Stephane, in other email). I work in a context in
which lots of people have argued a different case for ‘educational
assessment’ over ‘educational measurement’. And one of the key arguments,
here, seems to be that educational ‘attributes’ are just pure ‘social
constructions’ and (I’m simplifying, but not unreasonably, I think) it’s
therefore fine for different judges to evaluate the same people in different
ways, or, in other words, reliability is seriously over-rated. If that is
your view of ‘assessment’ then it really isn’t measurement, in any sense. By
reclaiming the concept of ‘educational measurement’ we stress that it’s a
special kind of description/representation in which the inter-subjectivity
aspect (calibration to a common reference standard) is critical (which seems
to be what Mari puts at the heart of his epistemological conception). That’s
important, I think. Very important.

SV: I think I agree. The issue is not measurement, but description, and the
interchangeability of the “judges” is a serious problem – this is why tests
are interesting descriptive techniques because the observations can be said
objective in the weak sense of judge interchangeability.

 

Too little time to continue! But let me just also reference Eran Tal on
measurement as modelling. Read that work and you’ll realise that all of the
obvious fuzziness in educational and psychological measurement is also there
in the rock-hard sciences too. Again, the differences are more of scale than
of kind. 

 

Oh yeh… I’m firmly with Samuel Messick that justifying any test or
assessment on purely pragmatic grounds is extremely dangerous, for all sorts
of reasons. First, we have never been able to crack the ‘criterion problem’
and we never will. Second, if we don’t know the explanation for score
variance, then we’re likely to do harm unconsciously, e.g. by reinforcing
(and never questioning) pre-existing biases in our tests/assessments. 

SV: :).

 

Must go. Interesting debate!

 

Best wishes

 

Paul

 

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Paul Barrett
Sent: 14 November 2017 20:08
To: [log in to unmask] <mailto:[log in to unmask]> 
Subject: Re: Undergraduate readings?

 

Hello again Paul N!


This leaves open the possibility that, for instance, ordinal ‘measurement’
counts as measurement; perhaps even that nominal ‘measurement’ counts as
measurement. Similarly, it leaves open the possibility of measuring all
sorts of phenomena, like hardness, or storms, or baldness.

Not a problem - all you are doing is redefining the word ‘measurement’ to
include any kind of phenomenal observation scaling/classification.

 

I work in educational measurement. And I would suggest that attainment in
mathematics (for instance) is just as measurable as baldness; possibly more
so.

Again, perfectly reasonable given you take care to align the methods you use
to manipulate magnitudes of attainment and baldness with the properties of
the attributes in question. i.e. if mathematical attainment is indexed as
the number of items solved correctly on a particular test, then use of
integer arithmetic/frequency/count-based analysis methods is fine. You may
also construct a more complex ordered-class assessment which takes into
account a composite of indicators - teacher ratings as well as test
performance etc. However, if you wish to assert that mathematical attainment
varies as a quantity, you need to provide empirical evidence that it does so
before using methods of analysis whose validity of results depend upon that
assertion being ‘just so’. Same as with ‘baldness’. That latter is the
domain of scientific exploration of a phenomenon.

 

I reiterate: it is about the properties of the attribute required to
substantiate a claim of whatever you choose to call a ‘measurement’ of it,
and the capacity of the methods/computations you use to manipulate those
‘measurements’ to possess a certain validity, given those attribute
properties.

 

Regards .. Paul

 

Chief Research Scientist

Cognadev Ltd.

____________________________________________________________________________
______

W:  <http://www.pbarrett.net/> www.pbarrett.net 

E:  <mailto:[log in to unmask]> [log in to unmask] 

M: +64-(0)21-415625

 

From: IDANET (Individual Differences and Assessment Network)
[mailto:[log in to unmask]] On Behalf Of Paul Newton
Sent: Wednesday, November 15, 2017 7:09 AM
To: [log in to unmask] <mailto:[log in to unmask]> 
Subject: Re: Undergraduate readings?

 

Hi Stephane

 

Is ‘measurement science’ free of logical necessity? I don’t think that the
meaning of words is essentially a matter of logical necessity – e.g. the
word ‘measure’ – if that’s what you’re asking.

 

In terms of Mari’s work, I’d definitely recommend reading it, because the
following recollection won’t do it justice! But, as I recall it, he
describes measurement in relation to: (a) objectivity in extracting
information on the thing you’re measuring; (b) intersubjectivity in
interpreting that information, such that the information means the same
thing to every measurement user. This leaves open the possibility that, for
instance, ordinal ‘measurement’ counts as measurement; perhaps even that
nominal ‘measurement’ counts as measurement.

 

Similarly, it leaves open the possibility of measuring all sorts of
phenomena, like hardness, or storms, or baldness.

 

I work in educational measurement. And I would suggest that attainment in
mathematics (for instance) is just as measurable as baldness; possibly more
so.

 

Incidentally, I don’t see the point in insisting on using the word
‘assessment’ for measurement that’s not like length. I don’t see what it
buys.

 

Best wishes

 

Paul