Can we have both reliability and validity?

So it looks like the French were right after all; Lance Armstrong is a cheat and without the effects of EPO and a corrupt testing regime some unknown rider would have won the Tour de France instead.   Unknown because they all appear to have been doing it, except that some were better cheats than others.   But what about the tests; how can they have been so poor, or in the jargon of assessment, so unreliable and ‘not fit for purpose’?

A little closer to home we too have questions about the reliability of testing.   Not a question of winning the world’s most famous cycle race, but for a teenager seeking to start an A level course or subsequently apply for a top quality university course in a year or two, getting the right grade in GCSE English is just as important.   The 2012 English grading fiasco has made educationists think again about the reliability of assessments.   This time it is not the old arguments about the unreliability of teacher assessments (remember the old mode 3 CSE examinations with 100% teacher-devised and teacher-assessed coursework?) but instead questions about the holy grail of assessment: external, professionally marked exam board papers, moderated, statistically proven, and seemingly bullet-proof.  

Only months before the crisis leaks from government had made it clear that the country needed more of this reliable, exam board stuff, and less of that wishy-washy teacher assessed coursework.   So what has gone wrong?   Why are external examinations suddenly failing the reliability test?

For the past three decades there have been many detailed research studies looking at the reliability and validity of both internal and external assessments.   It makes for interesting reading.   The tension between reliability, essentially ensuring that the right students get the right marks, and validity, in the sense that the assessments cover the knowledge and skills relevant to the particular course of study, is well documented.   In simple terms, greater reliability can be achieved through narrowly focused tests and equally constrained mark schemes.   But such tests are less valid as they inevitably cover fewer skills and a limited range of course elements.   Teachers want both reliability and validity but how can this be achieved?

By way of contrast, internal assessments are much broader in scope, capable of accessing higher level skills and using a greater range of assessment tools.   Furthermore there is evidence to suggest that internal assessments are more motivating for both teachers and students, whereas external assessments inevitably lead to too much ‘teaching to the test’, especially where the stakes are high with teachers and schools continually judged by government league and performance tables.

The situation is further complicated by the use of statistical models to compare successive cohorts.   It was the realisation that such models were used to re-grade this summer’s GCSE English papers that prompted so much justifiable outrage from schools and teachers.   How could it be that well-documented improvements in performance brought about by better teaching over a five-year period could be downgraded on the basis of previous key stage 2 test scores?   To bring this into even sharper focus, one research study carried out in the 1990s concluded that up to a third of key stage 3 assessments were wrong anyway!   Do the statisticians really factor in the potential unreliability of the input data into their sophisticated models?   Perhaps not.

If this summer’s grading crisis has revealed nothing else it is that an examination system based on narrowly focused external tests, with restricted mark schemes and an over-reliance on statistical modelling, is far from fit for purpose.   Ofqual and the examination boards have offered to undertake a thorough review of the examinations system, but it begs the question, will they seriously investigate the balance between reliability and validity, and the role of the professional teacher compared with the statistical machine?

Calls this week for a new professionalism for teachers are timely and should not be ignored.    Part of this new era of professionalism must be a realisation that teachers need to be trusted, not only in terms of their commitment to the welfare of the children in their care, but also as professional educators and assessors, very capable of making accurate and far-reaching judgements on the performance and abilities of their students.

The USADA investigation of a sophisticated culture of doping, seemingly undetected by the most advanced testing regime in the world, has exposed not only the extent to which professional athletes will go to cheat to achieve success, but also the inability of such tests to be either reliable or valid.   In the post-GCSE crisis world of public examinations in England what is needed is a testing regime that rewards a broad range of knowledge, understanding and skills, and one that acknowledges that teachers have a vital role to play in making such judgements.

Ian Power, Membership Secretary, HMC