Machine Learning Blog Post

Train/Test Split

In order to split our data into train and test sets, we defined each flu season from week 40 of one year to week 39 of the subsequent year. Therefore we have 8 years of flu seasons:

  • 2010 Week 40 – 2011 Week 39 (Train)
  • 2011 Week 40 – 2012 Week 39 (Train)
  • 2012 Week 40 – 2013 Week 39 (Train)
  • 2013 Week 40 – 2014 Week 39 (Train)
  • 2014 Week 40 – 2015 Week 39 (Train)
  • 2015 Week 40 – 2016 Week 39 (Train)
  • 2016 Week 40 – 2017 Week 39 (Train)
  • 2017 Week 40 – 2018 Week 39 (Test)

The first 7 years were chosen as our training set while the last year was chosen as our testing set. We believed this train test split is better for this model as opposed to a random split to avoid training on weeks that either have no flu cases or many flu cases. We wanted to represent both of these trends in our training set.

Features

We decided to incorporate 11 features into our model as listed below as they were determined to improve our predictions:

  1. Median Temperature (median_t)
  2. Absolute Humidity (absolute_humidity)
  3. Google Searches for Tamiflu (tamiflu)
  4. Google Searches for Flu Symptoms (flu_symptoms)
  5. Google Searches for Flu (flu)
  6. Google Searches for Influenza (influenza)
  7. Google Searches for Flu Vaccine (flu_vaccine)
  8. Google Searches for Flu Clinic (flu_clinic)
  9. Google Searches for Flu Shot (flu_shot)
  10. Google Searches for Cough Medicine (cough_medicine)
  11. Week Number (week)

Normalization

We did not perform any normalization on the feature columns – we found that normalization made all of our predictions worse.

Our final predictors estimated the number of weekly flu cases in hundreds rather than the true number of weekly flu cases – we found that making this number smaller improved the predictions of some of our models.

Baseline Model

Our initial baseline model was to simply compute the average number of flu cases for each week from years 2010 to 2016 and use these values to predict the corresponding week number for 2017.image4.png

The RMSE of our baseline model was 103.718 with an R2 score of 0.451. As you can see, this baseline model does a fair job at predicting the time frame of flu cases but does a poor job in accurately predicting the height of the spike. It underestimates the number of flu cases, however it is a good baseline model for our problem as seen in the graph above.

Model Selection

Several models were explored to see which had the best prediction accuracy. All models tested were regression type. The primary benchmark for model accuracy was comparison to null accuracy, with comparison to the baseline model as a secondary benchmark of comparison. The models tested include: linear regression, Bayesian ridge regression, decision tree regression.

Final models:

Baseline

Linear

Decision tree (depth = 7)

Bayesian Ridge Regression

Linear Regression Model

For the Linear Regression Model, we removed four of the feature columns: flu_clinic, flu_shot, flu_vaccine, and cough medicine, as we found that removing these columns improved our predictions. However, we found that keeping these columns improved predictions for other regressors, which is why we still included these in our overall feature columns. This model gave us an R2 value of .983 and an RMSE (in terms of hundreds of cases) of .184.

image5.png

 

image7.png

Based on the figures above, we can see that the results of our linear regression model are fairly impressive as the predicted weekly flu cases graph closely matches that of the true weekly flu cases. It does well to match both the height and the time frame of the flu cases spike. We can also see that our model does well to predict the distribution of the number of flu cases.

Decision tree

We used a decision tree with depth=8 to predict the data. This particular depth was chosen since it had the best balance of high R2, low RMSE, yet low complexity, a graph of which can be found below. The results were the worst out of all of the non-base models we considered.

As you can see in the last figure below, this model failed to predict the height of the actual number of flu cases, underestimating this value instead. Additionally, this model failed to predict the spread of cases along the y axis of the violin plot, further proof that we are underestimating our target value.

image2.png

image8.png

image6.png

Bayesian Ridge Regression

Bayesian Ridge Regression was found to be the second predictor in terms of RMSE. Its performance was comparable to the linear model. Its RMSE was 0.289 while the linear model’s was 0.184. This result is by far the closest RMSE to the linear model. However, its R2 was found to be significantly lower than those of all other models.

In the second figure below, we can see that this model is slightly overestimating the number of flu cases. It does well to match the week numbers of flu spikes but is consistently higher than the actual line. We can also see this overestimation illustrated in the violin plot with the Bayesian Ridge Model overestimating the spread along the y-axis.

image1.png

image3.png

Metrics

Model R2 RMSE (Hundreds of Cases)
Baseline 0.451 1.037
Linear 0.983 0.184
Decision Tree 0.627 0.854
Bayesian Ridge 0.959 0.285

Final Machine Learning Design

Based on the results in the table above, we chose our final model to be the linear regression model as it had the largest value of R2 closest to 1 and the smallest RMSE. We aimed to minimize error in our optimal model.

Leave a comment