Scientists use Wikipedia search data to forecast spread of flu


Can public health experts tell that an infectious disease outbreak is imminent simply by looking at what people are searching for on Wikipedia? Yes, at least in some cases.

Researchers from Los Alamos National Laboratory were able to make extremely accurate forecasts about the spread of dengue fever in Brazil and flu in the U.S., Japan, Poland and Thailand by examining three years’ worth of Wikipedia search data. They also came up with moderately success predictions of tuberculosis outbreaks in Thailand and China, and of dengue fever’s spread in Thailand.

However, their efforts to anticipate cases of cholera, Ebola, HIV and plague by extrapolating from search data left much to be desired, according to a report published Thursday in the journal PLOS Computational Biology. But the researchers believe their general approach could still work if they use more sophisticated statistics and a more inclusive data set.


Accurate data on the spread of infectious diseases can be culled from a variety of sources. Government agencies typically get it from patient interviews and laboratory test results. Other data sources include calls to 911 lines, emergency room admissions and absences from work or school.

The problem with these methods is that they can be time-consuming and costly. By the time the numbers are crunched, an outbreak may be in full swing.

If you want to stop an outbreak before it starts -- and if you want to save lives and money, you certainly do -- what you need is a forecast that is both accurate and timely. And so the Los Alamos researchers turned to the treasure trove that is Wikipedia.

In addition to the about 30 million articles on topics ranging from quantum foam to the First English Civil War to Kim Kardashian, Wikipedia also collects data on the approximately 850 million search requests it gets each day. In previous studies, researchers have used this publicly available data to predict ticket sales for new movies and the movement of stock prices.

When it comes to health, people have found correlations between interest in certain health topics on Wikipedia and sales of medications. Others have linked searches for flu-related topics by American Wikipedia users to actual flu spread in the U.S.

Five members of the LANL’s Defense Systems and Analysis Division thought they could do more. Their goal was to get a read on current and future trends not just for flu in the U.S. but for several diseases in several countries. Ideally, they hoped to come up with a model that could be “trained” with data from a place where it’s available and then applied to another place where it wasn’t.


The researchers decided to focus on seven diseases (cholera, dengue fever, Ebola, HIV/AIDS, influenza, plague and tuberculosis) in nine countries (Brazil, China, Haiti, Japan, Norway, Poland, Thailand, Uganda and the U.S.). They mixed and matched to get models for 14 “location-disease contexts.”

The researchers collected publicly available data on Wikipedia searches between March 2010 and February 2014. They zeroed in on articles related to the seven diseases and calculated what fraction of all searches in any given hour were for these articles. The search data didn’t indicate where searches were done, so the researchers used search language as a proxy for country.

Then they used official disease incidence reports to see whether the patterns of searches predicted current and future disease spread in real life.

Their models had predictive value for eight of their 14 location-disease combinations, as measured by a statistic called r-squared that is measured on a scale of 0 to 1 (the closer the value is to 1, the better the correlation between the model’s data and real-life data). For instance, when it came to predicting the spread of the flu in Japan seven days in the future, the Wikipedia searches scored an r-squared of 0.92. For forecasting cases of dengue fever in Brazil two weeks out, the r-squared was 0.77, and for TB cases in Thailand a month in the future, the r-squared was 0.69.

The most successful cases involved forecasts for flu and dengue fever, and they all had some things in common, the researchers noted. Both diseases are seasonal, so people who are on the lookout for them may do online research in the weeks before the viruses arrive.

More important, both diseases have short incubation periods that are measured in days. People may be seeing their soon-to-be infectors (or those who are a few degrees removed) coming down with a bug and responding by firing up Wikipedia.

The least successful cases had some things in common too, according to the study: The rates at which people become infected with some diseases are either too stable (such as HIV in Japan) or too low (such as plague in the U.S.) for meaningful patterns to emerge from the data.

Another problem was that some signals were overwhelmed by noise. For instance, the model for Ebola in Uganda and the Democratic Republic of Congo failed because most of the people searching for Ebola information on Wikipedia were not actually in those countries, and most people in those countries didn’t have good access to the Internet. The story was similar for the case of cholera outbreaks in Haiti.

Apparently, Google has figured out some of these things already, because it uses proprietary information from the words typed into its search engine to predict near-term outbreaks of flu and dengue fever, but not other diseases. Both of these forecasting tools work well, the Los Alamos researchers acknowledge, but they argue that their model has more potential because it’s based on data from Wikipedia that is available to anyone.

“Our Wikipedia-based approach is sufficiently promising to explore in more detail,” they concluded.

Interested in clever research? Follow me on Twitter @LATkarenkaplan and “like” Los Angeles Times Science & Health on Facebook