Summary

VISION

We set out to predict incidence of influenza using online search data and local weather data. Inspired by Google’s original attempt to predict flu cases based off of their search trends, we wanted to improve their results. Google’s model consistently overestimated the number of flu cases. They offered possible explanations of using search terms that only correlated with flu season, but weren’t directly related, and failing to take into account autocomplete searches. To accomplish our goal of improvement, Connecticut was selected as the region of interest, because it has well-maintained health and weather data. Furthermore, the state has a large enough population to provide enough data for accurate modeling, but is small enough geographically to control for weather across the state. Flu cases were modeled per week with the combined flu-related search terms and weather data as predictors.

Weather data was selected because according to literature, low temperatures and low absolute humidity contribute to an increased number of flu cases. Google search trends related to flu correlate well with flu cases as people often “Google” their symptoms or medication when they are ill, frequently before seeking treatment. Unlike Google, we decided to only include flu-specific terms.

DATA/CLEANING/EDA

Weather (NOAA) and flu incidence (CDC/WHO) data were collected for the state of Connecticut from 2010-2018. Additionally, Google search trends data on flu-related search terms were collected for Connecticut.

Flu incidence and Google search trends data were both in weekly format. Weather data was in daily and hourly format, and thus weekly weather data was estimated by taking weekly medians of daily or hourly values.

All data was merged on week number and year. A new season feature was also created which defined each flu season as week 40 of the current year to week 39 of the following year. All missing values in the datasets were imputed with the values of the previous week.

Initial exploratory data analysis was performed with Pearson and Spearman correlation matrices (Figure 1) to identify variables of interest. Further visualizations were created for these variables with time series plots (Figure 2) comparing each respective variable of interest to number of flu cases. Variables that performed especially well included median temperature, absolute humidity, tamiflu searches, and flu symptoms searches, among others. These variables were advanced to the initial modeling phase.

 

Screen Shot 2018-11-06 at 1.03.37 PM.png

Screen Shot 2018-11-06 at 1.04.13 PM

MODELING

Our model aimed to predict weekly flu cases for the 2017 season by using data from the 2010 – 2016 seasons; data from 2010 – 2016 was used as training data and data from 2017 was used as testing data.

BASELINE MODEL

Our baseline model simply assigned a week the average value of cases for that week. over the 2010 – 2016 seasons. The results for this were decent. It was able to do a good job of predicting the start and duration of the season, but it seriously underestimated the peak number of cases. This is most likely due to the fact that the 2017 season had a historically high number of cases, so there was no way our averaging technique would be able to capture that. The magnitude of this effect was even further increased by the fact that one of the seasons in our training set (2011) had a historically low level of cases, bringing our average down even more. This suggests that perhaps using the median would have been a better approach.

FINAL FEATURE SET

In order to get these results, we trimmed down our features from the original 31 that we gathered to 11. These 11 were chosen because they had strong correlations and non-negligible effects on flu cases.

  • Google search terms: tamiflu, flu, flu symptoms, flu vaccine, flu clinic, flu shot, influenza, cough medicine
  • Weather: median temperature, absolute humidity
  • Week Number

ADVANCED MODELS

We considered three more advanced models: linear regression, Bayesian ridge regression, and decision tree regression.

Our results show unambiguously that linear regression is the best model. It had the highest R2 (0.983) and lowest RMSE (0.184) compared to the other two models. Bayesian ridge regression had an R2 of 0.959 and an RMSE of 0.285 whereas the decision tree had values of 0.627, 0.854, respectively. These results show that all of our advanced models outperformed our baseline model, but there was a fairly large range in the accuracy of our advanced models.

The depth of the decision tree was chosen to be 8. This was selected by looking at R2 and RMSE for various depths. 8 was found to have the best balance of high R2, low RMSE, and low complexity.

The decision tree was the only one of our models that suffered from the same underestimating problem described for the baseline case. We again blame the historically high levels of the 2017 season.

The results for our models are compared in the table and figures below.

Screen Shot 2018-11-06 at 1.04.51 PM.png

Screen Shot 2018-11-06 at 1.05.27 PM

FINAL MODEL

We chose the Linear Regression model as our final model since it has the largest R2 value and the smallest RMSE error. We can also see in our graphs that the prediction closely matches the actual number of flu cases. It rarely overestimates or underestimates. For Linear Regression, some Google search terms were eliminated from the feature set, as we found that some of these terms reduced the testing accuracy. Thus, only 7 features were used for Linear Regression.

Overall, we made significant progress towards our goal and were able to accurate match the number of flu cases in a given week for the state of Connecticut using Google search queries and weather data. Our work has demonstrated that using only flu-specific search terms and including weather data can lead to better estimates of flu cases than Google’s original method of simply using terms with strong correlations to flu cases, even if they were unrelated.

LIMITATIONS AND FUTURE WORK

Our model can only predict a given week’s number of flu cases based on the Google searches and weather of that week. This means we are currently unable to predict future cases of flu as we do not have Google search data and weather data from the future. However it is possible to adapt our model to predict future weather conditions and search queries in order to predict future number of flu cases. Additionally, since there is a slight delay in the CDC releasing flu data we can predict a given week before this official information is released.

Another slight issue in our model is that we only used one weather station in Connecticut as a proxy for the entire state. Despite its small size, there are likely to be slight variations in weather conditions throughout the state.

Since we only trained our model from the years 2010 to 2016 and tested on 2017, our model may have not worked as well if we trained and tested on different years. There is a possibility that our model may have only worked well for 2017, although we think that the historically high levels of 2017 may have made it abnormally hard to predict, so a more typical year might perform better.

In addition to Google search queries, we could also incorporate social media posts such as tweets and Facebook posts complaining of influenza-like-illness to determine if they have any predictive ability in determining the number of flu cases in a certain time period.

Lastly, Google Trends does not provide absolute number of search terms; it only provides normalized frequencies over the requested time period. This makes incorporating new data into the model quite difficult, as it would be necessary to regenerate the entire dataset.

Sources

Fuhrmann, C. (2010), The Effects of Weather and Climate on the Seasonality of Influenza: What We Know and What We Need to Know. Geography Compass, 4: 718-730. doi:10.1111/j.1749-8198.2010.00343.x

Jeremy Ginsberg, Matthew H. Mohebbi , Rajan S. Patel , Lynnette Brammer, Mark, S. Smolinski and Larry Brilliant (2009), Detecting influenza epidemics using search engine query data. Nature, 457: 1012-1014. doi:10.1038/nature07634

Lazer, David, et al. (2014), The Parable of Google Flu: Traps in Big Data Analysis.Science, 343(6176): 1203-1205. science.sciencemag.org/content/343/6176/1203.full.

Carnotcycle – The Classical Blog on Thermodynamics. https://carnotcycle.wordpress.com/2012/08/04/how-to-convert-relative-humidity-to-absolute-humidity/

Machine Learning Blog Post

Train/Test Split

In order to split our data into train and test sets, we defined each flu season from week 40 of one year to week 39 of the subsequent year. Therefore we have 8 years of flu seasons:

  • 2010 Week 40 – 2011 Week 39 (Train)
  • 2011 Week 40 – 2012 Week 39 (Train)
  • 2012 Week 40 – 2013 Week 39 (Train)
  • 2013 Week 40 – 2014 Week 39 (Train)
  • 2014 Week 40 – 2015 Week 39 (Train)
  • 2015 Week 40 – 2016 Week 39 (Train)
  • 2016 Week 40 – 2017 Week 39 (Train)
  • 2017 Week 40 – 2018 Week 39 (Test)

The first 7 years were chosen as our training set while the last year was chosen as our testing set. We believed this train test split is better for this model as opposed to a random split to avoid training on weeks that either have no flu cases or many flu cases. We wanted to represent both of these trends in our training set.

Features

We decided to incorporate 11 features into our model as listed below as they were determined to improve our predictions:

  1. Median Temperature (median_t)
  2. Absolute Humidity (absolute_humidity)
  3. Google Searches for Tamiflu (tamiflu)
  4. Google Searches for Flu Symptoms (flu_symptoms)
  5. Google Searches for Flu (flu)
  6. Google Searches for Influenza (influenza)
  7. Google Searches for Flu Vaccine (flu_vaccine)
  8. Google Searches for Flu Clinic (flu_clinic)
  9. Google Searches for Flu Shot (flu_shot)
  10. Google Searches for Cough Medicine (cough_medicine)
  11. Week Number (week)

Normalization

We did not perform any normalization on the feature columns – we found that normalization made all of our predictions worse.

Our final predictors estimated the number of weekly flu cases in hundreds rather than the true number of weekly flu cases – we found that making this number smaller improved the predictions of some of our models.

Baseline Model

Our initial baseline model was to simply compute the average number of flu cases for each week from years 2010 to 2016 and use these values to predict the corresponding week number for 2017.image4.png

The RMSE of our baseline model was 103.718 with an R2 score of 0.451. As you can see, this baseline model does a fair job at predicting the time frame of flu cases but does a poor job in accurately predicting the height of the spike. It underestimates the number of flu cases, however it is a good baseline model for our problem as seen in the graph above.

Model Selection

Several models were explored to see which had the best prediction accuracy. All models tested were regression type. The primary benchmark for model accuracy was comparison to null accuracy, with comparison to the baseline model as a secondary benchmark of comparison. The models tested include: linear regression, Bayesian ridge regression, decision tree regression.

Final models:

Baseline

Linear

Decision tree (depth = 7)

Bayesian Ridge Regression

Linear Regression Model

For the Linear Regression Model, we removed four of the feature columns: flu_clinic, flu_shot, flu_vaccine, and cough medicine, as we found that removing these columns improved our predictions. However, we found that keeping these columns improved predictions for other regressors, which is why we still included these in our overall feature columns. This model gave us an R2 value of .983 and an RMSE (in terms of hundreds of cases) of .184.

image5.png

 

image7.png

Based on the figures above, we can see that the results of our linear regression model are fairly impressive as the predicted weekly flu cases graph closely matches that of the true weekly flu cases. It does well to match both the height and the time frame of the flu cases spike. We can also see that our model does well to predict the distribution of the number of flu cases.

Decision tree

We used a decision tree with depth=8 to predict the data. This particular depth was chosen since it had the best balance of high R2, low RMSE, yet low complexity, a graph of which can be found below. The results were the worst out of all of the non-base models we considered.

As you can see in the last figure below, this model failed to predict the height of the actual number of flu cases, underestimating this value instead. Additionally, this model failed to predict the spread of cases along the y axis of the violin plot, further proof that we are underestimating our target value.

image2.png

image8.png

image6.png

Bayesian Ridge Regression

Bayesian Ridge Regression was found to be the second predictor in terms of RMSE. Its performance was comparable to the linear model. Its RMSE was 0.289 while the linear model’s was 0.184. This result is by far the closest RMSE to the linear model. However, its R2 was found to be significantly lower than those of all other models.

In the second figure below, we can see that this model is slightly overestimating the number of flu cases. It does well to match the week numbers of flu spikes but is consistently higher than the actual line. We can also see this overestimation illustrated in the violin plot with the Bayesian Ridge Model overestimating the spread along the y-axis.

image1.png

image3.png

Metrics

Model R2 RMSE (Hundreds of Cases)
Baseline 0.451 1.037
Linear 0.983 0.184
Decision Tree 0.627 0.854
Bayesian Ridge 0.959 0.285

Final Machine Learning Design

Based on the results in the table above, we chose our final model to be the linear regression model as it had the largest value of R2 closest to 1 and the smallest RMSE. We aimed to minimize error in our optimal model.

Exploratory Statistics

After wrangling the data, each of us spent some time doing some individual exploratory data analysis. We attempted to get an idea of which features were important through correlation coefficients and statistical plots. Our main methods of visualizations were time-series plots, correlation matrices, and violin plots.

Our initial analysis included plotting the weekly number of flu cases, weekly median temperatures, and weekly absolute humidity for each year. As you can see in the three figures below, there is a clear seasonality associated with the flu, temperature, and humidity.

Figure 1a. Number of Flu Cases in Connecticut for Each Week of the Year

Figure 1b. Median Temperature in Connecticut for Each Week of the Year

Figure 1c. Absolute Humidity (%) in Connecticut for Each Week of the Year

 

Temperature and absolute humidity seem to peak in opposite weeks compared to the flu, suggesting that colder weather and less humidity may contribute to the spread of flu during these weeks. Additionally since we are only considering data from 2010-2018, we will not consider the effects of climate change in our model as median temperature does not seem to strongly increase over the years.

Additionally, we also plotted histograms to see the frequency of flu tests and positive cases shown below. For most weeks, less than 250 people were tested for flu and less than 5% of these specimens were positive. However, there are several weeks with a much larger percent positive value, peaking at around 50%.

Figure 2a and 2b. Histogram of Total Number of Flu Test Specimens (left) and Histogram of Percent Positive Specimens (right)

We constructed a Pearson correlation coefficient matrix to compare the relationships between each variable of interest in our model. This is shown below, with an accompanying variable legend.

Variable legend

  • Cases: Weekly number of influenza cases in Connecticut, estimated by multiplying Total.specimens by Percent.positive
  • Total.prcp: Cumulative weekly precipitation in inches.
  • Median_tmax: Median weekly high temperature in Fahrenheit
  • Median_tmin: Median weekly low temperature in Fahrenheit
  • Median_t: Median weekly average temperature in Fahrenheit. Obtained by averaging median_tmax and median_tmin
  • Week: Week number that data was collected
  • Year: Year that data was collected
  • Humidity: Median weekly relative humidity
  • Absolute_humidity: Median weekly absolute humidity
  • Rate_x_y: Rate of laboratory-confirmed influenza hospitalizations in Connecticut per 100,000 persons, for people in the age category of x to y years old.

Figure 3. Pearson Correlation Coefficient Matrix

As we can seen in Figure 3, there is a strong positive correlation between the number of weekly influenza cases and the weekly frequency of google searches for tamiflu, flu symptoms, and influenza. These have correlations of .876, .866, and .741 respectively. There is also a significant negative correlation (-.538) between the number of weekly influenza cases and median weekly temperature.

Figure 4. Spearman Correlation Coefficient Matrix

The variables that were strongly positively correlated in the Pearson Correlation Coefficient Matrix were less positive in this figure, however the negative correlated variables became stronger with the Spearman Correlation Coefficient. This may mean that temperature and humidity have a monotonic, but not necessarily linear, relationship with the number of flu cases.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

We constructed scatter plots of variables that had either a fairly positive or negative Pearson correlation coefficient to better illustrate these linear relationships.

Since there is a strong correlation between flu-related google search trends and weekly influenza cases, we created time-series plots to show the relationships between these variables:

Figure 9.

Figure 10.

Figure 11.

We also constructed a time-series plot of temperature and weekly influenza cases, as shown below.

Figure 12.

Figure 12 shows that the number of weekly flu cases generally peaks at the same time when temperature reaches its lowest values. However, it cannot match the height of the flu peaks.

Additionally, we created violin plots of cases per season, cases per month, several of the search trends per month, and median temperature per month. The 20XX season is defined to be the 40th week of 20XX to the 39th week of 20X(X+1). The violin plots were able to give us a good idea about the variation of important features within and between seasons and months.

Important correlations and trends are revealed here. There is an enormous amount of variation in the number of cases between and within seasons. The distribution of cases per month shows that the typical month has very few cases, but a few months can have an enormous number.

It is also evident that 2011 and 2017 were fairly abnormal years as far as cases go. 2011 was historically low and 2017 was historically high. The plots show that the number of cases in 2017 was driven up primarily by two months: January and February. Essentially every month in 2011 had an extremely low number of cases. 2011 tends to drive down the values in all plots. There were very few cases, few tests came back positive, and few people were searching for flu remedies (Tamiflu). It is interesting to note that the search trends for “flu vaccine” were not as significantly affected by this year. A likely explanation of this is that “flu vaccine” tends to be searched for before flu season starts. This can also be seen in the above search trend plots. Therefore, searches would have started before it became evident that the 2011 flu season was mild. This does not happen with the searches for “Tamiflu,” likely because people search for remedies once the flu season has begun.

There is also not much variation in monthly median temperature between seasons, which is an expected outcome.

Figure 13. Number of reported flu cases per month. Distribution is over each season.

Figure 14. Number of reported flu cases per season. Distribution is over each month.

Figure 15. Median temperature each month. Distribution is over each season.

Figure 16. Percent of tested samples that were positive for flu. Distribution is over each month.

Figure 17. Search trends for “flu vaccine” each month. Distribution is over each season.

Figure 18. Search trends for “Tamiflu” each month. Distribution is over each month.

 

 

 

 

 

Data Acquisition and Cleaning

WRANGLING

Our data was acquired from several sources: google trends data, Center of Disease Control (CDC) reports, and the National Oceanic and Atmospheric Administration (NOAA). All data was collected from 2010 to 2018 and downloaded in csv format. From google trends, we gathered weekly data on searches for cough medicine, flu clinics, flu shots/vaccinations, flu/influenza, flu medicine (oseltamivir, relenza, tamiflu, zanamivir), and flu symptoms in Connecticut.

Data collected from the United States World Health Organization (WHO) Collaborating Laboratories and National Respiratory and Enteric Virus Surveillance System (NREVSS) laboratories was used for the CDC’s State Level Outpatient Illness and Viral Surveillance. It reports the weekly number of specimens tested and percent positive rate for influenza in Connecticut.

The CDC also has an Influenza Hospitalization Surveillance Network (FluSurv-NET) which identifies weekly laboratory-confirmed influenza hospitalizations in multiple states including Connecticut. It contains the rate of hospitalizations per 100,000 in the total population and also by age category (0-4, 5-17, 18-49, 50-64, 65+).

Information on the number of vaccinations performed up to a given week and vaccine effectiveness were gathered from the CDC’s FluVaxView website which estimates the influenza vaccination coverage nationally within the United States.

Weather data was gathered from the National Oceanic and Atmospheric Administration’s Climate Data Online (CDO) portal: https://www.ncdc.noaa.gov/cdo-web/datasets, which has lots of weather and climate data from various US weather stations. Data was gathered from the Hartford-Brainard Field station as a proxy for Connecticut’s overall weather from 2010 to 2018. Two csv datasets from this weather station were used: one contained daily weather summaries of temperature and precipitation and one contained more detailed climatological data that was collected on an hourly basis.

CLEANING GOOGLE TRENDS DATA

One challenging aspect of using google trends for weekly search data is that it is not possible to obtain weekly google trends data in a single csv for a time period of longer than 5 years. Additionally, google trends does not give the user an absolute number of searches for a particular time period; instead, it provides the relative frequency of that search term over the selected time period in a value from 0-100. Thus, it is impossible to compare frequency values from different time periods without additional information.

To get google trends data over the full 2010-2018 time period, google trends data was initially downloaded in two separate csv files: one file contained trends data from 2010-2015, and one contained trends data from 2016-2018. The data was separately downloaded in csv files for each search term and then merged on year and week number.

Monthly trends data for the full 2010-2018 time period was compiled in a csv file by downloading csv files containing trends for each search term and merging them on month and year. The monthly trends data was then used to combine the two separate data files on weekly trends so that numbers from each could be properly compared. The intuition was that for any given month, the average search frequency of a week in that month should be equal to the total search frequency for that month. Thus, the weekly frequencies were adjusted according to the following formula:

\text{new weekly frequency} = \text{weekly frequency}\cdot \frac{\text{monthly frequency}}{\text{average weekly frequency for month of specified week}}

The new weekly frequency entries are the final search frequencies used for EDA and prediction.

CLEANING WHO-NREVSS DATA

Data was initially reported by region, year, and week. The two datasets were cleaned separately before merger. Extraneous columns, such as flu strain were removed. NA values were reported as an X in the given field, so these were changed to NA. They were not removed or imputed immediately because we wanted to perform exploratory data analysis before deciding how to handle these missing values. Only data for Connecticut was kept. Following this, the two datasets were merged on year and week.

CLEANING FLU-SURV-NET DATA

Only columns of interest such as year, week, age range, and weekly rate were not removed. Data was missing for several weeks for certain age groups so when merged, it was replaced with NaN.

CLEANING NOAA DATA

Relevant fields were selected from the raw dataset (date, daily precipitation (inches) and daily minimum and maximum temperature (℉)). A new column calculating the week from the date field was created. The year was also extracted from the full date field. Data was then grouped by year and week to calculate weekly total precipitation, median humidity, and median minimum, maximum, overall temperature. Hourly relative humidity (%) was extracted from another NOAA dataset of hourly data at the same weather station. Week and year were extracted from the date-time field in the same manner as the first dataset. The weekly median relative humidity was then calculated by grouping by year and week fields. The cleaned data was then merged into the cleaned WHO-NREVVS on year and week. Missing entries in this dataset were marked as NA when merged into the cleaned dataframe.

ADDITIONAL FEATURES

All data sets were merged on year and week. Additional features were added to the merged dataset. These included weekly cases, calculated as the number of laboratory tests multiplied by the percent positive tests. Additionally, absolute humidity was calculated as a function of relative humidity (rh) and temperature in Celsius (T). Absolutely humidity was chosen to be a better measure since it measures moisture in the air regardless of temperature unlike relative humidity.

ah3

https://carnotcycle.wordpress.com/2012/08/04/how-to-convert-relative-humidity-to-absolute-humidity/

 

Project Introduction

The primary goal of this project is to predict the number of influenza cases in the state of Connecticut. This particular state was chosen due to it’s small size and its availability of flu data compared to other states. We are focused on building a model which incorporates factors such as weather conditions, Google search trends, vaccination rates, and vaccine effectiveness to predict the weekly number of flu cases in Connecticut.

Literature Review

1. Do weather conditions affect the number of flu cases?

“The Effects of Weather and Climate on the Seasonality of Influenza: What We Know and What We Need to Know” by Christopher Fuhrmann discusses the potential link between weather and the flu. Influenza has a distinct seasonality which usually ranges from November to March in the Northern Hemisphere. This suggests that certain weather factors may affect viral transmission, host susceptibility, and virulence. According to Fuhrmann, temperature and humidity play a strong role in the transmission of flu. Humidity affects the size of viral respiratory particles with dry air allowing small drops to remain airborne for a long period of time. Breathing cold air hinders the body’s ability to filter out pathogens in their nasal passages and upper respiratory tract. Therefore, we have chosen to include temperature and humidity in our model.

2. How can Google search trends predict the flu?

In 2009, researchers at Google published a paper titled “Detecting Influenza Epidemics Using Search Engine Query Data” [2] with the claim that search queries are a valuable source of information about health trends since many Americans search medical problems online. A linear model was used to calculate the log-odds of an influenza-like illness (ILI) related search query and ILI physician visit.

P is the percentage of ILI physician visits, Q is the ILI-related query fraction, β0 is the intercept, β1 is the coefficient, and ε is the error term. Different queries were tested to identify with ones mostly accurately predicted CDC ILI data with 45 queries being chosen in the end for the linear model. However, this model failed spectacularly in later years by consistently overestimating the prevalence of flu [3]. It was prone to overfitting seasonal terms unrelated to flu such as “high school basketball”, highlighting terms that correlated strongly with the flu by chance. Additionally, the suggested search feature caused certain search words to become more prevalent which greatly affected Google’s algorithm. Therefore, we have only chosen search terms directly related to influenza and have avoided using likely suggested search terms.

Methodology

We plan to find a strong relationship between specific weather variables, google search queries, vaccination rates, and vaccine effectiveness and the number of flu cases within Connecticut. Once we identify this correlation, we plan to predict the number of flu cases in the state for any given week. Data cleaning, munging, and merging will be performed in R and Python. Packages such as pandas and scikit-learn, matplotlib, and seaborn in Python will be used to apply appropriate machine learning algorithms and graph our results.

After collecting and cleaning our data, we plan to perform exploratory statistics to identify which variables are significantly correlated with the number of flu cases. A combination of these significant variables will later be applied in building our prediction model.

Links

Git Repo

Contact

Spencer Boyum

Sean Flannery

Francesca Lim

Jason Terry

References

1. Fuhrmann, C. (2010), The Effects of Weather and Climate on the Seasonality of Influenza: What We Know and What We Need to Know. Geography Compass, 4: 718-730. doi:10.1111/j.1749-8198.2010.00343.x

2. Jeremy Ginsberg, Matthew H. Mohebbi , Rajan S. Patel , Lynnette Brammer, Mark S. Smolinski and Larry Brilliant (2009), Detecting influenza epidemics using search engine query data. Nature, 457: 1012-1014. doi:10.1038/nature07634

3. Lazer, David, et al. (2014), The Parable of Google Flu: Traps in Big Data Analysis. Science, 343(6176): 1203-1205. science.sciencemag.org/content/343/6176/1203.full.