VISION
We set out to predict incidence of influenza using online search data and local weather data. Inspired by Google’s original attempt to predict flu cases based off of their search trends, we wanted to improve their results. Google’s model consistently overestimated the number of flu cases. They offered possible explanations of using search terms that only correlated with flu season, but weren’t directly related, and failing to take into account autocomplete searches. To accomplish our goal of improvement, Connecticut was selected as the region of interest, because it has well-maintained health and weather data. Furthermore, the state has a large enough population to provide enough data for accurate modeling, but is small enough geographically to control for weather across the state. Flu cases were modeled per week with the combined flu-related search terms and weather data as predictors.
Weather data was selected because according to literature, low temperatures and low absolute humidity contribute to an increased number of flu cases. Google search trends related to flu correlate well with flu cases as people often “Google” their symptoms or medication when they are ill, frequently before seeking treatment. Unlike Google, we decided to only include flu-specific terms.
DATA/CLEANING/EDA
Weather (NOAA) and flu incidence (CDC/WHO) data were collected for the state of Connecticut from 2010-2018. Additionally, Google search trends data on flu-related search terms were collected for Connecticut.
Flu incidence and Google search trends data were both in weekly format. Weather data was in daily and hourly format, and thus weekly weather data was estimated by taking weekly medians of daily or hourly values.
All data was merged on week number and year. A new season feature was also created which defined each flu season as week 40 of the current year to week 39 of the following year. All missing values in the datasets were imputed with the values of the previous week.
Initial exploratory data analysis was performed with Pearson and Spearman correlation matrices (Figure 1) to identify variables of interest. Further visualizations were created for these variables with time series plots (Figure 2) comparing each respective variable of interest to number of flu cases. Variables that performed especially well included median temperature, absolute humidity, tamiflu searches, and flu symptoms searches, among others. These variables were advanced to the initial modeling phase.


MODELING
Our model aimed to predict weekly flu cases for the 2017 season by using data from the 2010 – 2016 seasons; data from 2010 – 2016 was used as training data and data from 2017 was used as testing data.
BASELINE MODEL
Our baseline model simply assigned a week the average value of cases for that week. over the 2010 – 2016 seasons. The results for this were decent. It was able to do a good job of predicting the start and duration of the season, but it seriously underestimated the peak number of cases. This is most likely due to the fact that the 2017 season had a historically high number of cases, so there was no way our averaging technique would be able to capture that. The magnitude of this effect was even further increased by the fact that one of the seasons in our training set (2011) had a historically low level of cases, bringing our average down even more. This suggests that perhaps using the median would have been a better approach.
FINAL FEATURE SET
In order to get these results, we trimmed down our features from the original 31 that we gathered to 11. These 11 were chosen because they had strong correlations and non-negligible effects on flu cases.
- Google search terms: tamiflu, flu, flu symptoms, flu vaccine, flu clinic, flu shot, influenza, cough medicine
- Weather: median temperature, absolute humidity
- Week Number
ADVANCED MODELS
We considered three more advanced models: linear regression, Bayesian ridge regression, and decision tree regression.
Our results show unambiguously that linear regression is the best model. It had the highest R2 (0.983) and lowest RMSE (0.184) compared to the other two models. Bayesian ridge regression had an R2 of 0.959 and an RMSE of 0.285 whereas the decision tree had values of 0.627, 0.854, respectively. These results show that all of our advanced models outperformed our baseline model, but there was a fairly large range in the accuracy of our advanced models.
The depth of the decision tree was chosen to be 8. This was selected by looking at R2 and RMSE for various depths. 8 was found to have the best balance of high R2, low RMSE, and low complexity.
The decision tree was the only one of our models that suffered from the same underestimating problem described for the baseline case. We again blame the historically high levels of the 2017 season.
The results for our models are compared in the table and figures below.


FINAL MODEL
We chose the Linear Regression model as our final model since it has the largest R2 value and the smallest RMSE error. We can also see in our graphs that the prediction closely matches the actual number of flu cases. It rarely overestimates or underestimates. For Linear Regression, some Google search terms were eliminated from the feature set, as we found that some of these terms reduced the testing accuracy. Thus, only 7 features were used for Linear Regression.
Overall, we made significant progress towards our goal and were able to accurate match the number of flu cases in a given week for the state of Connecticut using Google search queries and weather data. Our work has demonstrated that using only flu-specific search terms and including weather data can lead to better estimates of flu cases than Google’s original method of simply using terms with strong correlations to flu cases, even if they were unrelated.
LIMITATIONS AND FUTURE WORK
Our model can only predict a given week’s number of flu cases based on the Google searches and weather of that week. This means we are currently unable to predict future cases of flu as we do not have Google search data and weather data from the future. However it is possible to adapt our model to predict future weather conditions and search queries in order to predict future number of flu cases. Additionally, since there is a slight delay in the CDC releasing flu data we can predict a given week before this official information is released.
Another slight issue in our model is that we only used one weather station in Connecticut as a proxy for the entire state. Despite its small size, there are likely to be slight variations in weather conditions throughout the state.
Since we only trained our model from the years 2010 to 2016 and tested on 2017, our model may have not worked as well if we trained and tested on different years. There is a possibility that our model may have only worked well for 2017, although we think that the historically high levels of 2017 may have made it abnormally hard to predict, so a more typical year might perform better.
In addition to Google search queries, we could also incorporate social media posts such as tweets and Facebook posts complaining of influenza-like-illness to determine if they have any predictive ability in determining the number of flu cases in a certain time period.
Lastly, Google Trends does not provide absolute number of search terms; it only provides normalized frequencies over the requested time period. This makes incorporating new data into the model quite difficult, as it would be necessary to regenerate the entire dataset.
Sources
Fuhrmann, C. (2010), The Effects of Weather and Climate on the Seasonality of Influenza: What We Know and What We Need to Know. Geography Compass, 4: 718-730. doi:10.1111/j.1749-8198.2010.00343.x
Jeremy Ginsberg, Matthew H. Mohebbi , Rajan S. Patel , Lynnette Brammer, Mark, S. Smolinski and Larry Brilliant (2009), Detecting influenza epidemics using search engine query data. Nature, 457: 1012-1014. doi:10.1038/nature07634
Lazer, David, et al. (2014), The Parable of Google Flu: Traps in Big Data Analysis.Science, 343(6176): 1203-1205. science.sciencemag.org/content/343/6176/1203.full.
Carnotcycle – The Classical Blog on Thermodynamics. https://carnotcycle.wordpress.com/2012/08/04/how-to-convert-relative-humidity-to-absolute-humidity/