The Masterminds Behind the SAT

By ROBERT STRAUSS

March 27, 2001 12 AM PT

Share via
- Email
- Facebook
- X
- LinkedIn
- Threads
- Reddit
- WhatsApp

SPECIAL TO THE TIMES

PRINCETON, N.J. — Each day she comes to the sprawling campus on the outskirts of this quintessential leafy college town, Anne Connell has a mission, one that could affect the educational destinies of thousands of students.

“I am the one who selects the ‘Question of the Day’ on the [Educational Testing Service] Web site,” said Connell, her cheery eyes turning somewhat diabolical as she speaks. “When I’m working on that SAT question, I try to make sure it’s just right, something that will challenge.”

Those simple little letters--SAT--grind an awful lot of fear through the souls of prospective college students. And it is Connell and her cohorts in these low-slung buildings at the Educational Testing Service in the rolling central New Jersey countryside who help determine those teenagers’ fate. They are the creators of the SAT, or SAT I (formerly the Scholastic Aptitude Test), the 138-question test most colleges still require for admission. Nearly 2.4 million SAT and 2.2 million PSAT (Preliminary Scholastic Aptitude Test, the “practice” for the SAT) tests were taken in 1999-2000. (Many students take them more than once.)

Recently, University of California President Richard C. Atkinson suggested that his university and others should de-emphasize the use of the SAT in college admissions. But as the SAT comes under renewed fire from administrators, academics, parents and students, the people who make up the test go about their jobs with calm and serious purpose, confident of the test’s staying power.

After all, the recent California uproar is not the first time the SAT as a universal tester has been called into question. “It comes in waves,” said Gretchen W. Rigol, vice president of higher education services at the College Board, which employs the Educational Testing Service to make the SAT and administers the test. “I’ll tell you why it comes in waves. Until every discernible group in this country gets the same scores--Asian Americans, Latinos, males, females, the handicapped--people are going to say, ‘Why is that?’ One of the answers they like to come up with is that the instrument is biased. But we do our best to make that not be the case.

“As Fred Hargadon, dean of admissions at Princeton, told me, ‘If we didn’t have the SAT, we would have to invent it,’ ” said Rigol.

Those who create the SAT at the testing service are primarily middle-aged and middle-class, most with an educational background of some sort--teaching or administrative. Backpacks abound on the 400-acre campus, and often you will see people gazing at the greenery as they walk about, presumably for inspiration. Temperamentally, the test-makers are a bit bookish, but that is understandable, given the serious implications of their work.

“We have children, too,” said Robin O’Callaghan, who creates questions, among other duties, in her job as director of math skills for the SAT. “We know how important the SAT can be in their lives.”

Like any team members, they have their jargon. For instance, they would call the “Question of the Day” a misnomer.

“They are items, not questions,” said Chancey Jones, the soft-spoken, grandfatherly executive director of the test division, who has been a math specialist at the teaching service for 35 years after an initial teaching career. “We are testing aptitude, not making them recall facts. So it’s hard to call them questions.”

These are folks who dicker with pie charts and nz and analogies and reading comprehension passages on a daily basis. They try to think like teenagers--smart teenagers, to be sure--and are dedicated to the proposition that if all kids cannot test equally, they can at least have an equal shot at getting an item solved correctly.

The nonprofit testing service, which employs 2,100 people full-time, also makes up the tests for such standardized exams as TOEFL (Test of English as a Foreign Language) and GRE (Graduate Record Examinations) and the SAT II (often known as “achievement tests,” those college-entrance exams that measure skills in specific academic areas). But since so many students from so many backgrounds take the SAT, a vast number of items is needed--up to 1,500 for more than 20 versions of the test in some years--and it is a major focus of the testing service.

Creating items in their final form for the test is a collaborative process, much like, say, writing a screenplay. Rarely does an item come through unedited, even from the mind of seasoned writers. A typical item can go through as many as six to eight reviews. Committees both inside the ETS and outside the campus look over every question--the committees outside comprising academics and sometimes laypeople and students around the country.

“We realize that even if we didn’t come from here, we do think like people from the Northeast,” said O’Callaghan. “We need to have people who live in other places review what we’ve done.”

One item, for instance, that ended up eventually not passing muster started out like this, “Blizzard:snowstorm::??” meaning blizzard is to snowstorm as x is to y. There were five choices of answers, the correct one being “downpour:shower.” But while most young people in the Northeast are extremely familiar with blizzards, those who live in the warmer climates of the country may never have seen or even heard of one, so the question was ditched.

It often takes a question 18 months to wend its way from that first thought to an actual SAT test. In addition to the regular reviews, each question gets two different fairness reviews. The first is straightforward. “Basically, it is this: Will this item offend any particular group?” said Sydell Carlton, chair of the ETS Fairness Steering Committee. You will never see mentions of Hitler or the Holocaust, for instance, said Carlton. Anything that stereotypes a certain race, religion, sex or ethnic group is verboten. Birth control, abortion, evolution, gun control and experimentation on animals are out as well--too controversial. “Anything that will get the adrenaline pumping, we don’t need it,” she said.

Then there is differential item functioning. If this piece of jargon sounds like computerese, well, it is. Each item gets a dry run before it takes an official place on the test. The SAT has seven sections, but only six of them count toward a score. One of the sections--the identity of which is never revealed--is used solely for refining the test, and its results are analyzed by computer for ethnic, racial and gender correlations. If one is particularly out of whack on an item, the item is eliminated.

“It’s not necessary to provide a reason, but we do try to determine why so we don’t do it again,” said Kevin Gonzalez, an ETS spokesman. For instance, there was a test item that used the word “obliterate.” Females who scored similarly to males on the test as a whole missed that item frequently, while the males did not. The item-making gurus determined that males are more accustomed to warlike and weapons-oriented words--perhaps because they play strategic video games more--and thus were predisposed to know what “obliterate” meant. So for the purpose of the test, “obliterate” was . . . obliterated.

But aren’t there just some things everyone should know?

“There really is no ‘should,’ ” said Jane Kupin, one of the math specialists. “We’re not supposed to be testing just plain facts. We’ll sometimes even put in [as a footnote to an item] that a dozen equals 12 or that a penny is 1 cent, especially if tests are used in foreign countries. We’re testing how students reason.” Then there is the problem of coordinating questions within a test.

The items--60 in math and 78 in verbal in each test--go from easy to hard within each section. And there can’t be, say, too many items dealing with bar charts or mentioning cooking in each section. Each item is given an arbitrary rating of 0.1 to 0.9 on difficulty.

“We generally know that items dealing with percentages are more difficult,” said O’Callaghan. She said that the “Roman numeral items” are considered harder, too, because they require so much concentration. In these, a passage is followed by three summary statements labeled I, II and III. The answer choices then say something like “I only” or “I and II only” or “All three.”

Easy items have simpler vocabulary and simpler concepts like money and time. “But it’s the moderate ones that are harder for us to do. Obviously difficult and obviously easy come quickly, but you don’t want to skew the moderate ones too far to either side.”

Though there are 47 people at ETS who create items as part of their jobs, about a third of the original items are contributed by freelancers. Jones said it gives the test a greater geographic diversity and lets the staff work more with reviewing and coordinating items into whole tests.

The SAT was developed in 1926 but didn’t come into widespread use as a college-entrance standard until after World War II, when masses of students started applying to colleges and universities. And it has changed remarkably little over the years. Even the fairness reviews are well-entrenched, having been instituted in the late 1960s.

Jones was a teacher before he came to ETS 35 years ago. He admitted that when he arrived, he was a little worried that he wouldn’t come up with proper items on cue. “But one of my mentors here said that I would be surprised, that I would come up with items in the oddest places,” said Jones. “He told me to keep a pad right by the shower. Sure enough, I came up with items when taking a shower, and I wrote them down right away. I still keep Post-Its everywhere. You never know.”

The item-makers don’t think their children have any undue advantage because their parents are at the nexus of testing. Though Princeton area high-schoolers regularly score among the highest in New Jersey on the SAT tests, O’Callaghan was loath to attribute that to some of them being the children of the 2,300 ETS employees.

“This is an area that has a lot of educational institutions,” said O’Callaghan, citing the proximity of Princeton and Rutgers universities, research arms for many pharmaceutical and scientific firms and several top prep schools. “I have six kids between 13 and 25. Two of them took the SAT before I got here. Two took it during my stay. Two will take it in the future. It hasn’t mattered in their scores.”

Nor would test-preparation classes, she and the other question-makers believe. Though a whole industry has grown up around readying students for the test, the best preparation, they say, is free. They recommend that students study old tests in the brochure available from the testing service. And, of course, they can do their schoolwork.

“We do realize that the SAT is not the be-all and end-all,” O’Callaghan said. “You’ve got to do well in school, too.”

The item-writers defend their testing that way--that the more seriously a student takes school, the better he or she will do on the tests. “It’s not a complete correlation, but if you take more math courses, if you expose yourself to more reading, you will probably do better on our items,” said Dan Johnson, a test developer for the SAT verbal sections who used to teach at nearby Rutgers.

Johnson spends a lot of his time on what seems to be the bane for many test-takers, the reading comprehension sections. He and other verbal skills folks scour all sorts of printed materials for just the right passages, between 450 and 800 words, which will yield the proper kinds of questions. Though it is only half of the verbal test, the reading comprehension takes up three-quarters of the test-taker’s time. Thus, it is important not to skew the passages any particular way.

“We only want to have items that ask about rhetorical strategies or assumptions you might make about the passage,” said Johnson. “We usually have to have 24 test items to get down to the few that we will eventually use in a test.”

Sometimes the reading passages come a-cropper for reasons that surprise their selectors. Fairness chieftain Carlton told about one passage she originally thought was going to be grand.

“It dealt with a slave tricking his master,” she said. “We had a visiting scholar program at the time, and one of them was a 60-year-old black teacher from St. Louis named Pearl.” The teacher pointed out that “there was a large body of literature which showed blacks, especially slaves, as being dishonest, and this only served to support that,” Carlton recalled. “Since we don’t want stereotypes of any kind, we dropped the passage.”

There are some kinds of items, though, that despite some statistical biases, still make it into the test. Female students, said O’Callaghan, tend to do better on straightforward math questions dealing only with computation. Males tend to do better with practical ones, using everyday language.

“I don’t know whether it’s that boys figure out batting averages or go to the garage more with their parents, but it seems to correlate that way,” she said. Since it would be too hard to get away from using practical language, those questions will stay, with the hope that females will get better at them as time goes on.

When items finally make it through every review, they can linger for a while. Sometimes, though, the item-writers need to rev up for a few extras. “When the Iron Curtain fell, that was tough for us,” said Carlton. “All those passages with the Soviet Union in them all had to be thrown away. Time marches on, even at ETS sometimes.”

The Masterminds Behind the SAT

More to Read

More From the Los Angeles Times

Podcasts

Most Read in California