Hi Paul B

Damn, you’ve sucked me in again, when I really just wanted to go to bed!

I get what Stephane and Joel are trying to do. But the underlying premise seems to be that psychology ‘just needs to come into line’ with what the real measurement folk mean by ‘measurement’. What I’ve been trying to say is that there are plenty of real measurement folk who just don’t buy into the really strong (and narrow) conception of measurement. They are looking for alternative technical definitions that can, as you indicate, apply to any use of the word ‘measurement’ in science.

I’ve recently edited a special issue of Assessment in Education on how best to define the word ‘validity’ so I get how emotive this stuff is, and I don’t mean to denigrate the desire for technical clarity, and useful consensus definitions, which I’ve totally argued in favour of in relation to validity (although, in my heart of hearts, I’m pretty sure we’ll never reach consensus there).

If you want to use ‘measurement’ in the classic, technical sense, that’s fine. All I’m saying is that it’s not without potential negative consequences. Equally, there are also potential positive consequences (for the work I do) arising from adopting ‘measurement’ (broadly defined) as a guiding framework.

Incidentally, not all science is about seeking causes. Lots of science doesn’t involve that. More to the point, I would argue that the things that we need to measure, in education and psychology, are not internal structures that somehow cause external behaviour. That’s not the level at which concepts like knowledge, skill and understanding apply. For me, science comes into play when we take seriously the challenge of understanding these concepts, and their proper application, and then work out methods for accurately extracting and representing information about the people whom we measure in terms of these concepts.

Again, you might say that’s not real science, I guess. But science isn’t quite so easily circumscribed either!

Again, though, I have no problem in condemning the unthinking use of unduly simplistic psychometric models, which simply presume that attributes have a certain kind of structure. Often these presumptions seem entirely implausible to me (e.g. Quantitative structure, with capital Q). But that doesn’t mean that educational and psychological attributes have no quantitative (small q) structure. And, if they do, then shouldn’t we be trying to find ways to represent that structure, scientifically?

I sense we’re now going in circles. But it’s been an interesting discussion all the same!

Cheers

Paul

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Paul Barrett
Sent: 15 November 2017 19:59
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hello Paul N.

You say, without qualification:

If there’s no science behind the measurement

Science (it’s practice) is about attempting to detect phenomena through systematic observation, and understand how and why phenomena occur.

So, if you propose to investigate mathematical attainment as a scientist, you first ‘detect’ the phenomenon itself (whatever it is that seems to distinguish students based upon their performance in that domain of mathematics), then you try to investigate what is causal for the phenomenal observations and the empirically observed properties of that variation (e.g. are magnitudes varying additively?).

The alternative is to detect the phenomenon, and make instrumental use of the phenomenal observations for pragmatic purposes, where the goal is not to investigate what is causal for them except at a speculative but meaningful level of discourse. The primary purpose of what you are doing is to enable you to order students reliably in something you call ‘mathematical attainment’, where it’s observed variations can be described meaningfully and with some consensus among those using them.

Stéphane is providing us with the formal specifications (and the properties which must entail) for a quantitative measurement. In the same way, so does Michell as a text-based definition:

“the discovery or estimation of the ratio of a magnitude of a quantity to a unit of the same quantity.” P. 222

Michell, J. (1999). Measurement in Psychology: Critical History of a Methodological Concept. Cambridge University Press. ISBN: 0-521-62120-8.

So, that is a crystal clear technical definition, which applies to any use of the word ‘measurement’ in a quantitative science.

Psychology is not a quantitative science, because none of the attributes it seeks to understand possess any empirical evidence for their variation as ‘quantities’.

So, as with education, we are investigating/working in an area which may be best described as a non-quantitative science. For clarity only, some like myself will refrain from using the word ‘measurement’ because it has a very specific technical definition. We use assessment or evaluation; likewise I refrain from using the word ‘variable’ instead of attribute, as ‘variable’ also has a very specific technical definition in a quantitative science.

There is nothing more or less “scientific” in that use of terminology; it just reflects the recognition that we are not ‘measuring’, manipulating, or ‘observing’ quantities, but, given the lack of evidence for anything else, we are assessing/evaluating orders or classes of things.

However, others do not accept that the word ‘measurement’ possesses a precise, technical definition. So, following Stevens, anything can become a measurement if a number can be assigned by some rule.

So, if someone says I’m ‘measuring mathematical attainment’, I’ll accept it as a ‘common-or-garden’ use of that term (Maraun, 1998) - where the user is not employing its technical definition. But, if the full panoply of quantitative methods are applied to its magnitudes, where the numbers representing the magnitudes are now assumed to vary as quantities, and claims are made about attainment based upon it varying as a quantity, then that belief or assumption of quantity might now be challenged in a court of law should someone choose to question an adverse decision made on the basis of that claim.

Maraun, M.D. (1998). Measurement as a normative practice: Implications of Wittgenstein's philosophy for measurement in Psychology. Theory & Psychology, 8, 4, 435-461.

Regards .. Paul

Chief Research Scientist

Cognadev Ltd.

__________________________________________________________________________________

W: www.pbarrett.net

E: [log in to unmask]

M: +64-(0)21-415625

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Paul Newton
Sent: Thursday, November 16, 2017 7:31 AM
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hi Stephane

Would I agree? Well, possibly yes and possibly no!

I do agree that measurement is a purpose-driven activity. That’s another of the messages from Mari and colleagues in the metrology literature.

So, I do agree that the decision as to whether to model at the macro-attribute-level rather than at the micro-attribute-level ought to be purpose-driven.

In the educational measurement literature, a good example is the distinction between a diagnostic/formative purpose (low-level) and a placement purpose (high-level). The low-level purpose needs low-level measurement, to inform a more specific intervention (a suitable instructional unit, say). The high-level purpose needs high-level measurement, to inform a more general intervention (a suitable instructional group).

I guess it’s probably true that we’ll normally be able to model/measure the lower level(s) with greater precision than the higher level. But I wouldn’t conclude, from that, that the higher level model/measurement is inherently less real, or less scientific, or less useful. It’s just more fuzzy (implying greater definitional uncertainty).

On the other hand, I don’t agree (if this is what you mean) that purpose-driven measurement ought to be characterised/modelled/evaluated non-scientifically, i.e. purely instrumentally. If there’s no science behind the measurement, then what kind of activity is it?

Again, following Messick, I think that ‘blind empiricism’ (a.k.a. ‘pure prediction’) is, and ought to be, a thing of the past. We can’t (properly) justify any measurement/assessment on purely instrumental grounds; even when we accept that measurement is fundamentally a pragmatic activity. (If the model doesn’t work against evidence, then it shouldn’t be used in an instrumental way. Indeed, if it doesn’t work against evidence, then why would it work in an instrumental way?)

But maybe I’ve not quite got the subtlety of your response?

Best wishes

Paul

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
Sent: 15 November 2017 16:55
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hi Paul,

I think we may be convergent if we distinguish two goals, namely to test a model against evidence, and to use a model in an instrumentalist way, that is, to solve a given (decision) problem with the help of a model, even if this model is known to be an oversimplification of known (manifest) phenomena in a given descriptive system.

For instance, if you test that multivariate descriptions are ordinal measurements, your goal is to check that the observed descriptions behave as simply ordered objects. And the refutation (or, synonymously, the falsification) of the model can be a real contribution to the current state of knowledge. In this perspective, I believe that the scientific task of psychologists is to spread the fact that scores are poor descriptions of the observations they are trained to deal with, not that scores are measurements of … none knows what is measured and thanks to verbal magic and reification – e.g., intelligence is what my test do measure.

Now, the assessment problem based on test scores. My question is: why to simply order learners on a score scale while we perfectly know that the scores are not correct descriptions of their performances (which are multivariate)? Why do we need to replace the multivariate definition/construction of performance (which is made by the test developers who carefully selected distinct test items) by a one-dimensional compression? Sometimes, it may be useful, for instance for selection purposes when we seek say the 10 “best” performers, sometimes not, for instance when we want to provide some feedback to the learners on their performances.

My point is that tests users are responsible with respect to the relevance of their descriptive choices when they are engaged in any problem solving. Test scores are not mandatory. One additional example: sometimes, assessment of anxiety does not require to treat anxiety as a one-dimensional construct, as clinicians may be interested in a detailed description of a m-tuple of clinical signs.

Conclusion: we have to be clear on whether we, in such and such context, pursue scientific vs. utilitarian purposes. Not the same.

Would you agree?

Stéphane

De : IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] De la part de Paul Newton
Envoyé : mercredi 15 novembre 2017 17:17
À : [log in to unmask]
Objet : Re: Undergraduate readings?

Hi Stephane

As I see it: the principle of science is to model reality; while the practice of science is to attempt to reach consensus over the accuracy/usefulness of any particular model, on the basis of empirical evidence and logical analysis. The same applies for measurement.

A critical point, here, is that the models that we use in science – i.e. the concepts through which we measure – are (over-)simplified abstractions of reality. So, for instance, when we say that we measure the diameter of a hypodermic needle, we are not actually measuring a single thing. It’s a single thing in the model – sure – but not in reality (when you look at the bore of a needle under a high power magnifier, it’s more like a cave with stalactites and stalagmites). This ambiguity is embodied in the (measurement science) idea of definitional uncertainty – which is a really important concept for social science, incidentally. That’s why hard-science metrologists tend not to advocate the idea of a ‘true score’ as an organising concept, nowadays, appreciating that there is always a range of scores that will be compatible with the model/definition that’s being used.

In other words, science, including measurement, is about modelling, and consensus building. The critical issue is whether there is sufficient (empirical and logical) reason to believe that the model provides a plausible account of reality.

Using my educational measurement example, the reality (that no-one doubts) is that some people graduate with far more knowledge/skill/understanding than others. The tricky bit is how to build a measurement model that distinguishes (with sufficient precision) between students at different points on that scale. The first port of call [but only the first port] in validating any such model (and it’s measuring procedure) is, of course, the judgement of experts, e.g. maths teachers. If the rank order from the measuring procedure is completely contradicted by the judgement of experts, then either the model is wrong, or the measuring procedure, or both. Thus, educational measurement is falsifiable; in essentially the same way as for other kinds of measurement.

Having said all of this, I don’t think that we have many good models in educational measurement. And we’ve been led astray by people who know loads about statistics, yet don’t have a clue about what their statistics are meant to represent. And, since this is the more general point that we’ve been discussing, I do agree!

However, in my view, the right kinds of model are very unlikely to be unidimensional in any way, shape or form. (Isn’t it obvious (from analysis of the concept itself) that the real structure of attainment in maths is not unidimensional?) Attainment in education involves a progression in capability from novice to expert (a la Glaser, Dreyfus and Dreyfus, etc.) so that’s the kind of measurement model that we need to be thinking in terms of. That’s probably not about specifying logical relationships. It’s probably more about identifying underlying patterns. That, incidentally, is why I quite like the baldness scale analogy for education.

We agree on the problem of judges. And I’d say that the lack of agreement between judges sets an upper limit on the validity of any measurement model/measuring procedure in education. But I’d also say that’s true for any model/procedure, whatever the discipline.

Again, my main concern is that, when you put ‘mere assessments’ in the ‘not-science’ bucket, you imply that they should be judged according to less-than-scientific standards… and that makes a bad problem even worse!

Sorry if I’m just waffling now!

Paul

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Stéphane Vautier
Sent: 15 November 2017 09:20
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hi Paul,

I insert my comments in your text.

Best.

Stephane

De : IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] De la part de Paul Newton
Envoyé : mercredi 15 novembre 2017 09:38
À : [log in to unmask]
Objet : Re: Undergraduate readings?

Hi Paul and Stephane

Lots to talk about here, and not enough time, so apologies if I don’t cover everything sensibly; but here’s a few thoughts…

The idea of ‘redefining’ the word measurement (Paul B) isn’t quite right, since there isn’t (currently) a generally accepted, universal definition. It’s work in progress.

However, I am arguing for a wider definition. That’s because what we ought to be doing, in educational measurement, is to adopt a scientific approach (see Paul B below - agreed) to obtain information about ‘real’ properties of human behaviour, which can properly described in terms of amounts. I’m not assuming that educational measurement has a unit – so it doesn’t necessarily have a Quantity under a strong definition of that term – but it’s essentially measurement in all other respects, I would argue. So it makes sense (to me) to design educational assessments on a measurement model and, most importantly, to evaluate them on a measurement model.

SV: So, the problem is very simple. We have to define (i) the set of the possible states of our target (hypothetical) attribute, say the set A, (ii) the set of our observables, say the set O, and the relation R from A to O, in such a way that observing an o, we can know its antecedents in A. And (iv), we have to define the falsifiers of R, that is, to be creative in designing a test experiment. If R has no falsifiers given what we can imagine as a test of a theory, R is not a scientific theory because we do not know how to test it against experience which can be made in O or O^2 etc. But R can serve as a convention, which allows one to assess, not to measure, the state of someone in A given his o (DSM practice). And, consequently, assessment is not based on scientific – that is, falsifiable – knowledge, it is based on a social consensus enabling one to speak the language A starting from the language O.

Note I say ‘measurement model’ not a ‘psychometric model’ – as psychometric models often prejudge a certain kind of structure, without actually testing it effectively. (Testing it is both a logical/linguistic and an empirical matter.) Josh McGrane and Andy Maul have been talking about this recently.

Both Michell and Maraun make a splendid case against assuming that certain psychological attributes have a certain kind of structure (Quantitative). But neither of them (to my mind) presents a strong case that psychological/educational attributes have no quantitative structure, and therefore that its illegitimate to model them as measurements, in a wider sense. So the question is open. Simply to trivialise and dismiss non-Quantitative [assessment] by calling is merely ‘assessment’ makes no positive case for what this ‘assessment’ thing actually is. Is it purely ‘description’ or ‘representation’? Maybe… but maybe it’s actually closer to our conventional usage of ‘measurement’ than say ‘description’… albeit adopting a wider definition.

In educational measurement, I find the case for educational attainment attributes having a ‘real’ quantitative structure (of some sort) hard to reject. How many of us would be prepared to argue that there isn’t a really important and meaningful sense in which some students come out of their studies having attained at a far higher level than others?

SV: yes of course. But the concept of level, here, is a qualitative concept, as the descriptive device we use to specify a given state is a Cartesian product formed by combining several criteria, not a segment.

Similarly, to accept this, but to describe the attribute as somehow not ‘real’ seems equally implausible. So that’s the (common-sense, logical/linguistic) starting point for me. Or, to put it another way, teachers don’t routinely say that their graduating students practice (say) maths like a Bear, or like a Fox, or some other purely qualitative description/representation. The (primary) classification that teachers (and the rest of us make) regarding educational attributes is quantitative, with a small ‘q’, so that’s what we (measurement professionals) ought somehow to be recognising. IMHO.

Right, now to the issue of usage! I’ve spent a lot of time arguing (and publishing) on the problems that measurement professionals have with the word ‘validity’ which is even more contested than the word ‘measurement’. I’ve argued strongly (I hope!) that the case for or against any particular usage is not a matter of logical necessity, but of consequences.

SV: Consequences are a matter of logical necessity, or not?

I’ve also argued that the consequential argument for or against a narrow vs. broad definition of validity is probably quite evenly balanced. Which is a problem.

In relation to the use of ‘measurement’ for educational and psychological attributes, the recent debate seems only to acknowledge the consequential case against (see, e.g., Stephane, in other email). I work in a context in which lots of people have argued a different case for ‘educational assessment’ over ‘educational measurement’. And one of the key arguments, here, seems to be that educational ‘attributes’ are just pure ‘social constructions’ and (I’m simplifying, but not unreasonably, I think) it’s therefore fine for different judges to evaluate the same people in different ways, or, in other words, reliability is seriously over-rated. If that is your view of ‘assessment’ then it really isn’t measurement, in any sense. By reclaiming the concept of ‘educational measurement’ we stress that it’s a special kind of description/representation in which the inter-subjectivity aspect (calibration to a common reference standard) is critical (which seems to be what Mari puts at the heart of his epistemological conception). That’s important, I think. Very important.

SV: I think I agree. The issue is not measurement, but description, and the interchangeability of the “judges” is a serious problem – this is why tests are interesting descriptive techniques because the observations can be said objective in the weak sense of judge interchangeability.

Too little time to continue! But let me just also reference Eran Tal on measurement as modelling. Read that work and you’ll realise that all of the obvious fuzziness in educational and psychological measurement is also there in the rock-hard sciences too. Again, the differences are more of scale than of kind.

Oh yeh… I’m firmly with Samuel Messick that justifying any test or assessment on purely pragmatic grounds is extremely dangerous, for all sorts of reasons. First, we have never been able to crack the ‘criterion problem’ and we never will. Second, if we don’t know the explanation for score variance, then we’re likely to do harm unconsciously, e.g. by reinforcing (and never questioning) pre-existing biases in our tests/assessments.

SV: J.

Must go. Interesting debate!

Best wishes

Paul

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Paul Barrett
Sent: 14 November 2017 20:08
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hello again Paul N!

This leaves open the possibility that, for instance, ordinal ‘measurement’ counts as measurement; perhaps even that nominal ‘measurement’ counts as measurement. Similarly, it leaves open the possibility of measuring all sorts of phenomena, like hardness, or storms, or baldness.

Not a problem - all you are doing is redefining the word ‘measurement’ to include any kind of phenomenal observation scaling/classification.

I work in educational measurement. And I would suggest that attainment in mathematics (for instance) is just as measurable as baldness; possibly more so.

Again, perfectly reasonable given you take care to align the methods you use to manipulate magnitudes of attainment and baldness with the properties of the attributes in question. i.e. if mathematical attainment is indexed as the number of items solved correctly on a particular test, then use of integer arithmetic/frequency/count-based analysis methods is fine. You may also construct a more complex ordered-class assessment which takes into account a composite of indicators - teacher ratings as well as test performance etc. However, if you wish to assert that mathematical attainment varies as a quantity, you need to provide empirical evidence that it does so before using methods of analysis whose validity of results depend upon that assertion being ‘just so’. Same as with ‘baldness’. That latter is the domain of scientific exploration of a phenomenon.

I reiterate: it is about the properties of the attribute required to substantiate a claim of whatever you choose to call a ‘measurement’ of it, and the capacity of the methods/computations you use to manipulate those ‘measurements’ to possess a certain validity, given those attribute properties.

Regards .. Paul

Chief Research Scientist

Cognadev Ltd.

__________________________________________________________________________________

W: www.pbarrett.net

E: [log in to unmask]

M: +64-(0)21-415625

From: IDANET (Individual Differences and Assessment Network) [mailto:[log in to unmask]] On Behalf Of Paul Newton
Sent: Wednesday, November 15, 2017 7:09 AM
To: [log in to unmask]
Subject: Re: Undergraduate readings?

Hi Stephane

Is ‘measurement science’ free of logical necessity? I don’t think that the meaning of words is essentially a matter of logical necessity – e.g. the word ‘measure’ – if that’s what you’re asking.

In terms of Mari’s work, I’d definitely recommend reading it, because the following recollection won’t do it justice! But, as I recall it, he describes measurement in relation to: (a) objectivity in extracting information on the thing you’re measuring; (b) intersubjectivity in interpreting that information, such that the information means the same thing to every measurement user. This leaves open the possibility that, for instance, ordinal ‘measurement’ counts as measurement; perhaps even that nominal ‘measurement’ counts as measurement.

Similarly, it leaves open the possibility of measuring all sorts of phenomena, like hardness, or storms, or baldness.

I work in educational measurement. And I would suggest that attainment in mathematics (for instance) is just as measurable as baldness; possibly more so.

Incidentally, I don’t see the point in insisting on using the word ‘assessment’ for measurement that’s not like length. I don’t see what it buys.

Best wishes

Paul