VAM on Trial (Re-posted from InterACT)

imageIn New York right now, a teacher is suing the state over the use of value-added measures in teacher evaluations, claiming that the current system “actually punishes excellence in education through a statistical black box which no rational educator or fact finder could see as fair, accurate or reliable.” It’s unfortunate that we have to have teacher evaluation policies and procedures hashed out in court, but we’re seeing plenty of it in recent years.

This particular case reminded me of a blog post I wrote a few years ago, imagining how difficult it might be defend VAM in teacher evaluation if it took place in a court rather than in a board room or think tank. I’ve reposted that piece here.

Originally posted at InterACT (11/1/11)

Los Angeles Unified School District is embroiled in negotiations over teacher evaluations, and will now face pressure from outside the district intended to force counter-productive teacher evaluation methods into use.  Yesterday, I read this  Los Angeles Times article about a lawsuit to be filed by an unnamed “group of parents and education advocates.”  The article notes that, “The lawsuit was drafted in consultation with EdVoice, a Sacramento-based group. Its board includes arts and education philanthropist Eli Broad, former ambassador Frank Baxter and healthcare company executive Richard Merkin.”  While the defendant in the suit is technically LAUSD, the real reason a lawsuit is necessary according to the article is that “United Teachers Los Angeles leaders say tests scores are too unreliable and narrowly focused to use for high-stakes personnel decisions.”  Note that, once again, we see a journalist telling us what the unions say and think, without ever, ever bothering to mention why, offering no acknowledgment that the bulk of the research and the three leading organizations for education research and measurement (AERA, NCME, and APA) say the same thing as the union (or rather, the union is saying the same thing as the testing expert).  Upon what research does the other side base arguments in favor of using test scores and “value-added” measurement (VAM) as a legitimate measurement of teacher effectiveness?  They never answer, but the debate somehow continues ad nauseum.  

It’s not that the plaintiffs in this case are wrong about the need to improve teacher evaluations.  Accomplished California Teachers has published a teacher evaluation report that has concrete suggestions for improving evaluations as well, and we are similarly disappointed in the implementation of the Stull Act, which has been allowed to become an empty exercise in too many schools and districts.

Over at EdWeek, Stephen Sawchuk picked up the story and wondered if this action is a sign of things to come – litigating education policy.  On one hand, I hate to think that we would resort to the courts to settle matters that can and should be addressed by professional educators based on an understanding of research and best professional practices. On the other hand, I figured a good defense attorney would shred these plaintiffs and the credibility of their cause.  It will be interesting to see the actual language in their lawsuit, but the broader concept is already familiar.  It’s just never been given a courtroom treatment that I know of.  So, I’ve taken the liberty of dreaming up the court transcript ahead of time (using Q: for the defense attorney’s questions and A: for the plaintiff’s answers).   Enjoy this cross-examination.

Q: You are demanding that LAUSD use measures of student growth in teacher evaluations, is that correct?
A: Yes.
Q: And you believe that student test scores are a measure of growth that would reflect teaching quality, correct?
A: Yes.
Q: If LAUSD were to adopt a policy that attributes the growth or lack of growth in student test scores to the student’s teacher, and uses the scores of all students to evaluate the teacher’s effectiveness, you would drop this lawsuit, is that correct?
A: Yes.
Q: How often are these tests administered?
A: Once per year.
Q: And the district has no way of knowing if the student’s performance on that day reflects the student’s ability or perhaps reflects some trauma, distress, boredom, distraction, or rebelliousness?
A: No.
Q: And for students who have changed schools, or changed teachers during the year, there’s no way to factor that into the analysis of data when a student simply shows up on one roster or another, right?
A: That could be adjusted.
Q: There’s no study that would guide you in how to do that with any accuracy, is there?
A: I don’t know.
Q: No evidence that a move at the mid-point of the year gives each teacher half the responsibility for the student’s learning, or that each week has a proportionate effect?
A: None that I know of.
Q: And would the degree of change in a certain classroom affect students in that classroom who had not been part of any change?
A: I don’t know.
Q: Does it seem likely that changing the students in a class would change the class itself and affect some of the students who had been there all along?
A: I guess so.
Q: But you would have no way of knowing which students were affected or how they were affected?
A: Not really, no.
Q: Now, if I were a high school English teacher, I would be responsible for teaching in four standards areas, but would the test cover all four of those areas?
A: No.
Q: How many does it cover?
A: Two.
Q: You’re including writing when you say “two” but in fact there’s no writing on the tests currently used, is there?
A: No.
Q: So more accurately, the test covers one out of the four standards areas?
A: Yes.
Q: Does the test cover every standard in reading?
A: No.
Q: So, you’re proposing basing a significant part of an English teacher’s evaluation, for example, on a test result that covers a small fraction of the standards?
A: It’s the only objective way.
Q: So your answer is yes?
A: Yes.
Q: By objective, you mean it’s the same for every student and teacher?
A: Yes.
Q: Does every teacher have an equal assignment, equal students, classes, and resources?
A: No.
Q: So, you do not concern yourself with objectivity in all of the factors affecting the teacher’s work, but you figure you can evaluate different teachers working with different students and different classes using the same test that covers only a fraction of their standards?
A: Yes.
Q: So is that an objective process for evaluation, or an arbitrary process with an objective element in it?
[Plaintiffs’ counsel objects to argumentative question. Judge upholds the objection.]
Q: Do the words “objective” and “fair” have the same definition?
A: I couldn’t say.
Q: I could give an objective geometry test to every student in an algebra class, but would that be fair?
A: Okay, I see. They have different meanings.
Q: So your claim that the test is objective doesn’t cover the question of fairness, does it?
A: But it is fair!
Q: Please answer the question.  A claim of objectivity is different from a claim of fairness, correct?
A: Yes.
Q: So an objective test may be inappropriate for certain students and therefore unfair, no matter how objective?
A: I would say that the test is fair to everyone.
Q: Like a geometry test for algebra students?
A: Well, no.
Q: Does a student’s linguistic skill relate to their success in a test that requires use of language?
A: Of course.
Q: So a test given in an unfamiliar language might yield a result that reflects linguistic confusion rather than conceptual confusion, or poor teaching?
A: We could adjust for language in a teacher’s evaluation.
Q: In what way?
A: If the student is still learning English their scores could be separated out.
Q: What if a student did well on the test despite being new to the language?
A: Well, we can’t just use the scores that help the teacher.  We have to be fair.
Q: You mean objective?
A: Yes.
Q: Because actually, it would be fair to use the results that are valid and exclude the results that are invalid.  Are you suggesting that such a determination could be made for each student, or that we should come up with a single formula and stick to it?
A: Just use a single formula.
Q: So regardless of the student’s actual linguistic knowledge, you would suggest making assumptions based on a certain number of years for students to learn enough academic English.
A: That would be logical.
Q: No matter the variables in the student’s instruction in English or the amount of time it actually takes them to learn English?
A: It’s the only fair way.
Q: Fair, or objective?
A: Objective.
Q: Objective regarding the student’s knowledge and skill, or objective regarding only measures of time?
A: Time.
Q: Is it fair to use value-added measurements to rank teachers even when numerous studies show that it is a volatile measure with error rates exceeding 25%?
A: It would only be one of multiple measures.
Q: That wasn’t my question.  Is it fair to use an error-prone measure?
A: It’s not fair to exclude student performance from evaluations.
Q: Your Honor, would you instruct the witness to answer the question?
A: I’ll answer.  It may not always be fair in every case, but no method is perfect.
Q: You’re suing the Los Angeles Unified School District to compel them to use a teacher evaluation method that is prone to errors and unfair to perhaps a quarter of the teachers evaluated in this manner, is that correct?
A: Yes!  The alternative is the status quo, which is intolerable.
Q: But there are thriving, high-quality schools around the U.S. and around the world that are not using value-added measures.  Doesn’t that prove that there are alternatives to the LAUSD status quo that are something other than the remedy you seek to impose?
[Plaintiffs’ counsel objects to argumentative question. Judge upholds the objection.]
Q: Have you heard of the National Council for Measurement in Education, the American Psychology Association, the American Education Research Association?
A: Yes.
Q: Are you aware of their position on the lack of validity in using tests designed for one purpose and then used for another purpose?
A: More or less.
Q: I’m quoting from their joint position statement on this topic: “Tests valid for one use may be invalid for another. Each separate use of a high-stakes test, for individual certification, for school evaluation, for curricular improvement, for increasing student motivation, or for other uses requires a separate evaluation of the strengths and limitations of both the testing program and the test itself.”  Does that sound familiar to you?
A: More or less.
Q: In other words, you’ve heard this argument before?
A: Yes.
Q: Is it fair to say that these are the three leading organizations for educational measurement and research?
A: I suppose so.
Q: Are you a professional organization for educational research and measurement?
A: No.
Q: Do you think it’s advisable, or even responsible, to ignore the policy position of these leading organizations?
A: But we know that teachers are the most important in-school factor on student performance!
Q: Okay, no argument there.  But you have no basis upon which to argue against the validity issues raised in that quote, do you?
A: No.
Q: Now, taking up your contention that the teacher is the most important in-school factor, could you say most important out of how many factors?
A: No.
Q: You don’t know how many factors influence student performance?
A: No.
Q: If I threw out a number, like five, would you guess that it’s too low, too high, or about right?
A: That sounds too low.
Q: How about ten?
A: I don’t know, that might be right.
Q: Fifteen?
A:  Maybe.
Q: Just hypothetically, could we proceed on the assumption there are ten factors in schools, other than teachers, that affect student performance?
A: Okay, yes.
Q: Would you expect every factor to have the same influence on every student, or would some factors have strong influences on one student and almost no influence on another student?
A: It would vary.
Q: If you wanted to design a fair formula, you would take those ten factors into account?
A: Yes.
Q: Even though you can’t say for sure how much each factor affects the student?
A: Yes.
Q: You can’t even say with certainty that a specific factor has any effect on a certain student or group of students?
A: No.
Q: So, let’s assume that each of those ten factors could play out in only two different ways: how many possible combinations do we have for each student?
A: Twenty.
Q: I’m sorry to correct your math, but actually, that would be ten-squared, or one-hundred possibilities.
A: Oh, yes, one hundred, I see.
Q: But we don’t know for sure how many factors to consider and what they are.  And if we could actually identify fifteen variables instead of ten, and if each variable could play out in three different ways, would it surprise you to know that there would be 3,375 possible combinations?
A: That sounds like a lot, but you’re just playing with numbers.
Q: “Just playing with numbers.”  I see.  So just because something is true mathematically or statistically, it doesn’t necessarily translate into an actionable policy?
A: That’s not what I said.
Q: Of course you wouldn’t say that.  Your case is predicated on the idea that because you can make value-added calculations that show some teachers are less effective than others, it therefore makes sense to use the numbers in policy that leads to the outcomes you want.  Though again, the actual experts in educational measurement would warn against that, correct?
[Plaintiffs’ counsel objects to argumentative question. Judge upholds the objection.]
Q: That’s what you need to do if you use test scores and value-added measures in teacher evaluation, isn’t it?  Play with the numbers?  You would need to come up with a formula that makes certain assumptions about the effect of each factor, even though you can’t test your assumptions?
A: They’ve been researched!
Q: But you just said that we can’t assume factors are the same for each student – or did you mean that these students in this hypothetical school will have been researched before any formulas are applied to them?
A: No.
Q: Okay, to be fair, let’s assume that we can come up with a formula for each of these individual factors.  Wouldn’t it also be necessary to know about the interactions of the variables?
A: What do you mean?
Q: Well, perhaps we can apply a statistical control for homelessness, another to control for the time of day that the student studies a certain subject, and another to control for the change from last year’s 50-minute class periods to this year’s 90-minute class periods.  Is it likely that there is any research on the effects for homeless students in longer classes at different times of day?
A: No.
Q: So when we combine factors, we not only make assumptions about each one, but also assume that these factors do not influence each other in any way, is that right?
A: You can’t study every little thing.
Q: So, if this were a medicine, you’d be comfortable saying that we have plenty of science about the ingredients and we don’t need to study them in this particular combination in order to assume the effects the medicine will have?
A: I don’t know anything about medicine.
Q: Have you ever been a teacher?
A: No.
Q: Thank you.  No further questions.

2 thoughts on “VAM on Trial (Re-posted from InterACT)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.