Train/Test Split
In order to split our data into train and test sets, we defined each flu season from week 40 of one year to week 39 of the subsequent year. Therefore we have 8 years of flu seasons:
- 2010 Week 40 – 2011 Week 39 (Train)
- 2011 Week 40 – 2012 Week 39 (Train)
- 2012 Week 40 – 2013 Week 39 (Train)
- 2013 Week 40 – 2014 Week 39 (Train)
- 2014 Week 40 – 2015 Week 39 (Train)
- 2015 Week 40 – 2016 Week 39 (Train)
- 2016 Week 40 – 2017 Week 39 (Train)
- 2017 Week 40 – 2018 Week 39 (Test)
The first 7 years were chosen as our training set while the last year was chosen as our testing set. We believed this train test split is better for this model as opposed to a random split to avoid training on weeks that either have no flu cases or many flu cases. We wanted to represent both of these trends in our training set.
Features
We decided to incorporate 11 features into our model as listed below as they were determined to improve our predictions:
- Median Temperature (median_t)
- Absolute Humidity (absolute_humidity)
- Google Searches for Tamiflu (tamiflu)
- Google Searches for Flu Symptoms (flu_symptoms)
- Google Searches for Flu (flu)
- Google Searches for Influenza (influenza)
- Google Searches for Flu Vaccine (flu_vaccine)
- Google Searches for Flu Clinic (flu_clinic)
- Google Searches for Flu Shot (flu_shot)
- Google Searches for Cough Medicine (cough_medicine)
- Week Number (week)
Normalization
We did not perform any normalization on the feature columns – we found that normalization made all of our predictions worse.
Our final predictors estimated the number of weekly flu cases in hundreds rather than the true number of weekly flu cases – we found that making this number smaller improved the predictions of some of our models.
Baseline Model
Our initial baseline model was to simply compute the average number of flu cases for each week from years 2010 to 2016 and use these values to predict the corresponding week number for 2017.
The RMSE of our baseline model was 103.718 with an R2 score of 0.451. As you can see, this baseline model does a fair job at predicting the time frame of flu cases but does a poor job in accurately predicting the height of the spike. It underestimates the number of flu cases, however it is a good baseline model for our problem as seen in the graph above.
Model Selection
Several models were explored to see which had the best prediction accuracy. All models tested were regression type. The primary benchmark for model accuracy was comparison to null accuracy, with comparison to the baseline model as a secondary benchmark of comparison. The models tested include: linear regression, Bayesian ridge regression, decision tree regression.
Final models:
Baseline
Linear
Decision tree (depth = 7)
Bayesian Ridge Regression
Linear Regression Model
For the Linear Regression Model, we removed four of the feature columns: flu_clinic, flu_shot, flu_vaccine, and cough medicine, as we found that removing these columns improved our predictions. However, we found that keeping these columns improved predictions for other regressors, which is why we still included these in our overall feature columns. This model gave us an R2 value of .983 and an RMSE (in terms of hundreds of cases) of .184.


Based on the figures above, we can see that the results of our linear regression model are fairly impressive as the predicted weekly flu cases graph closely matches that of the true weekly flu cases. It does well to match both the height and the time frame of the flu cases spike. We can also see that our model does well to predict the distribution of the number of flu cases.
Decision tree
We used a decision tree with depth=8 to predict the data. This particular depth was chosen since it had the best balance of high R2, low RMSE, yet low complexity, a graph of which can be found below. The results were the worst out of all of the non-base models we considered.
As you can see in the last figure below, this model failed to predict the height of the actual number of flu cases, underestimating this value instead. Additionally, this model failed to predict the spread of cases along the y axis of the violin plot, further proof that we are underestimating our target value.



Bayesian Ridge Regression
Bayesian Ridge Regression was found to be the second predictor in terms of RMSE. Its performance was comparable to the linear model. Its RMSE was 0.289 while the linear model’s was 0.184. This result is by far the closest RMSE to the linear model. However, its R2 was found to be significantly lower than those of all other models.
In the second figure below, we can see that this model is slightly overestimating the number of flu cases. It does well to match the week numbers of flu spikes but is consistently higher than the actual line. We can also see this overestimation illustrated in the violin plot with the Bayesian Ridge Model overestimating the spread along the y-axis.


Metrics
| Model | R2 | RMSE (Hundreds of Cases) |
| Baseline | 0.451 | 1.037 |
| Linear | 0.983 | 0.184 |
| Decision Tree | 0.627 | 0.854 |
| Bayesian Ridge | 0.959 | 0.285 |
Final Machine Learning Design
Based on the results in the table above, we chose our final model to be the linear regression model as it had the largest value of R2 closest to 1 and the smallest RMSE. We aimed to minimize error in our optimal model.