If you’re interested in the ongoing debates about the use student test scores to measure teacher effectiveness, and the closely related question of what to do with that information if it’s valid, then you should really follow the blog VAMboozled, created and managed by Audrey Amrein-Beardsley, an Associate Professor of Education at Arizona State University. She and her co-bloggers do an excellent job of highlighting the myriad problems with value-added measurement, or VAM, when used for evaluating teachers.
This is a topic I’m quite passionate about (see “VAM on Trial” or “VAM Nauseum: Bleeding the Patient” at my former blog, InterACT), though my concern does not arise because anyone has ever suggested evaluating me based on test scores. Instead, it’s a problem that we overvalue school, district, or state test scores, that we think the scores are important (or even useful) measures of individual student learning, and that evaluating teachers based – at all – on test scores is a misuse of the tests that will also have negative consequences for schools, for individual teachers, for our profession, and for our students.
VAMboozled consistently features posts that go into the technical details of research, though you shouldn’t let the talk of stats and methods stop you. Here’s one recent example:
Regarding an article titled “Sensitivity of Teacher Value-Added Estimates to Student and Peer Control Variables” (Johnson, et. al.), Amrein-Beardsley notes:
- Different VAMs produced similar results overall [when using the same data], almost regardless of specifications. “[T]eacher estimates are highly correlated across model specifications. The correlations [they] observe[d] in the state and district data range[d] from 0.90 to 0.99 relative to [their] baseline specification.”
- However, “even correlation coefficients above 0.9 do not preclude substantial amounts of potential misclassification of teachers across performance categories.” The researchers also found that, even with such consistencies, 26% of teachers rated in the bottom quintile were placed in higher performance categories under an alternative model.
Which means? Advocates of VAM might say that it’s not terribly important which model of VAM is used, since they found a correlation of 0.9-0.99 using different models. (A correlation of 1.0 would be total consistency regardless of which formula was tested for value-added measurement). But as good as it sounds to have 0.9 correlation, teachers rated in the bottom 20% of teachers in one model have a 26% chance of moving up in a different model. So if you’re one of those teachers with a low VAM rating, and that counts for a lot in your district or state, it would be rather troubling to know that there’s a 1-in-4 chance that a different formula would have kept your job from being in jeopardy. The likely reason is that these calculations are used for some rather narrow distinctions in many cases. Let’s put it in baseball terms. If your batting average is .250, you only need one more hit every 40 at bats to raise your average to .275. That’s one more hit over the course of about 10 games. And if .250 is the lowest average on your team, and there are a few batters hitting above .250 and below .275, this slight variation could move you up a few steps from the job insecurity of being the team’s “worst” batter. A small difference in the numbers can be made to look significant when we focus on ranking people.
To see an example of how ridiculous and detrimental this all can be in real life, with real teachers, check out this VAMboozled post on a current trial in New York.