If your genome is public, so are you, researchers find
Scouring information available to anyone with an Internet connection, a team of genetic sleuths deduced the names of dozens of supposedly anonymous people who had their DNA analyzed for scientific and medical research.
The snooping feat, which took advantage of genealogy websites that let people compare their DNA to search for relatives, was in full compliance with federal privacy regulations. Experts said it underscored a stark reality about genetic privacy in the age of social media: Don’t count on it.
“Nobody can promise privacy,” said Mildred Cho, who heads up Stanford University’s Center for Integration of Research on Genetics and Ethics, and wasn’t involved with the study.
Whitehead Institute geneticist Yaniv Erlich and his team, who described their work Thursday in the journal Science, didn’t provide a complete recipe that would help others ferret out the identities of research volunteers. Nor did they divulge the names of the people they were able to unmask.
Since the first draft of the human genome was published in 2000, scientists have scrutinized its 3 billion pairs of DNA letters to try to find variants that cause disease, to understand human physiology, and to unravel the evolutionary history of our species.
Toward that end, academic efforts like the 1000 Genomes Project post complete genomes online for public use. The idea is that providing free access to the data will allow scientists to compare DNA from many people and help them discover connections between genes and traits, eventually leading to the development of personalized, targeted treatments for a wide range of disorders.
Keeping genomic data private has been a concern all along. Worries that health insurers or employers might use information about genetic health risks to drop benefits or discriminate against workers inspired the 2008 Genetic Information Nondiscrimination Act, which provides protection against abuse. Last year, the Presidential Commission for the Study of Bioethical Issues recommended a variety of additional measures to further secure genetic data.
Potentially complicating these efforts are the legions of amateur geneticists who want to learn their risk for diseases or gain clues about their ancestry. As sequencing costs have dropped, these enthusiasts have sent vials of saliva, swabs of cheek cells, circles of dried blood or other types of DNA samples to private sequencing companies. Often, they post their tests results online, for the world to see.
Erlich has been interested in privacy since he worked as a professional hacker — breaking into corporate networks as a “vulnerability researcher” for a computer security company — to help support himself in college. He started planning the current research after hearing about a 15-year-old boy who had part of his genome sequenced in 2005 in order to find his biological father, a sperm donor.
The boy compared a pattern of repeating DNA letters from his Y chromosome to the corresponding patterns of men who had posted their genetic data on a genealogy website. Finding several men whose pattern matched his led him to his father’s last name. He then used other clues to make contact.
Y chromosomes correlate with surnames because both are passed directly from father to son.
Erlich said he thought the boy’s approach was “brilliant,” and he wondered if his lab could do something similar with public genome data.
He and his colleagues started by analyzing the repeat patterns of Y chromosomes in published studies of genomes whose owners were known. They used a free genealogy website to look for surname matches.
In two of the cases, the Y chromosome data lined up with relatively common last names, so the results were of little use. But one of the samples — provided by sequencing pioneer J. Craig Venter — matched the surname “Venter.” From there, the team used a free Web directory and personal information that often accompanies genomes in public databases — age and state of residence — to zero in on the scientist.
Then they moved on to 10 mystery genomes collected from Utah residents who participated in public sequencing projects. They found surname matches for five people, then used those names to look at obituaries, family trees on file with the genomic information and other information to link nearly 50 related men and women to their DNA.
Analyzing census and genetic data, the team calculated they could find the correct surnames of white, middle- and upper-class men in the U.S. 12% of the time. Conducting a search using last name, year of birth and state of residence produced lists with about a dozen — a number small enough to investigate in more detail, Erlich said.
The discoveries in the new study point to a new level of vulnerability for research subjects who wish to remain private, Cho and others said.
To Laura Lyman Rodriguez, a policy specialist at the National Human Genome Research Institute in Bethesda, Md., the bottom line is that research subjects should be told that their genomic data could be breached.
“It’s important to be clear,” said Lyman Rodriguez, who co-wrote a commentary that accompanied the report in Science.