One number can’t illustrate teacher effectiveness

Imagine opening the morning paper over coffee and spotting your name on list of fellow nurses or lawyers, musicians or bus drivers. Beside each name rests a stark, lonely number said to gauge the extent to which you advance the growth of your clients or customers.

For the record:: A previous version of this article referred to the “the RAND statistical procedure” for evaluating teachers on a value-added basis. The analysis was not a Rand project, but was done by a Rand researcher on a private basis for The Times. There was no Rand overview of this work.

Orwellian, perhaps. But 6,000 Los Angeles teachers will soon find their names on such a list.

The Times has already published a few “value added” scores for illustrative teachers, detailing the eye-popping variability in learning curves of third- to fifth-graders spread across the Los Angeles Unified School District. The Times claims these scores can validly peg the discrete effect of each teacher on their students’ growth. These journalists draw on a complicated statistical model built by a single Rand Corp. analyst, Richard Buddin, which has yet to be openly reviewed by scholarly peers.

Meteorologists can accurately estimate the average weather pattern for Sept. 1 over the past century, but their predictions for any specific Sept. 1 are much less reliable. Yet wise editors at The Times apparently believe they can magically set aside confounding factors and pinpoint the discrete effects of individual teachers on students’ learning.

The Times published a simple graph on its Aug. 15 front page as a way to publicly “out” a teacher whom its value-added study deemed ineffective. The graph showed declining raw test scores for the teacher’s students over two years. But this fails to take into account differences in student background, including English proficiency, parents’ education levels or home practices that affect children’s learning. Hospitals wouldn’t fire a doctor or nurse who focused on caring for the elderly or poor because his patients die at higher rates.

This is why the Times rightfully asked a qualified researcher at Rand Corp., the Santa Monica-based think tank, to devise a sophisticated statistical model in an attempt to isolate the discrete effect of pedagogical skills on student growth. But as the National Academy of Sciences pointed out last year, successfully doing so requires exhaustive data on each teacher and the contexts in which instruction occurs.

We know that student learning curves are flattened by lousy teachers. The Times’ analysis usefully illuminates the wide variation in the test scores of students across classrooms and schools. What’s risky is moving from a complicated statistical model to estimating the discrete effect of individual teachers, precisely the leap of faith being made by The Times.

Buddin’s statistical procedure, while competently carried out in general, fails to take into account classroom and school contexts that condition the potency of individual teachers. For example, if a teacher is assigned low-track students — those with weaker reading proficiency in English or lower math skills — negative peer effects will drag down student growth over the year, independent of the teacher’s pedagogical skills.

Or if parents self-select into higher-quality schools, as detailed in one Times story, the presence of students with highly dedicated parents will have a positive impact on student growth, again independent of the individual efforts of a teacher. By setting aside contextual effects, The Times overestimates a teacher’s effects — positive or negative — on student growth.

Furthermore, many students are not taught by a single teacher. Some have special reading instruction or oral language development. What if these activities are strikingly effective or make no difference at all? Under The Times’ model, such effects are attributed to the student’s main teacher.

The Times’ study fails to recognize that test score across grade levels cannot be compared, given the limitations of California’s standardized tests. For example, third-grade scores across L.A. Unified were largely flat during the period that students were tracked, while fourth- and fifth-grade scores were climbing overall. Even ranking student scores across grades may be driven by differences in test items, not a student’s skill level. So, when The Times tries to control on family background with a third-grade test score, it does so inadequately — again overestimating the discrete effect of the teacher.

Given analytic weaknesses, the ethical question that arises is whether The Times is on sufficiently firm empirical ground to publish a single number, purporting to gauge the sum total of a teacher’s effect on children.

Based on a generation of research, we know bad teachers drag down student learning. Teachers unions continue to protect their poorly performing members in many cases. But this situation calls for careful science and mindful behavior by reformers and civic leaders, including The Times. Imprudent efforts could discourage strong teachers from working with low-achieving students, now judged on simplistic value-added scores. Dumbing-down the public discourse does little to lift teacher quality.

Bruce Fuller and Xiaoxia Newton are professors of education at UC Berkeley. University of Washington professor Dan Goldhaber, UCLA professor Meredith Phillips and UC Berkeley professor Sophia Rabe-Hesketh contributed to this article.s