After wrangling the data, each of us spent some time doing some individual exploratory data analysis. We attempted to get an idea of which features were important through correlation coefficients and statistical plots. Our main methods of visualizations were time-series plots, correlation matrices, and violin plots.
Our initial analysis included plotting the weekly number of flu cases, weekly median temperatures, and weekly absolute humidity for each year. As you can see in the three figures below, there is a clear seasonality associated with the flu, temperature, and humidity.
Figure 1a. Number of Flu Cases in Connecticut for Each Week of the Year
Figure 1b. Median Temperature in Connecticut for Each Week of the Year
Figure 1c. Absolute Humidity (%) in Connecticut for Each Week of the Year
Temperature and absolute humidity seem to peak in opposite weeks compared to the flu, suggesting that colder weather and less humidity may contribute to the spread of flu during these weeks. Additionally since we are only considering data from 2010-2018, we will not consider the effects of climate change in our model as median temperature does not seem to strongly increase over the years.
Additionally, we also plotted histograms to see the frequency of flu tests and positive cases shown below. For most weeks, less than 250 people were tested for flu and less than 5% of these specimens were positive. However, there are several weeks with a much larger percent positive value, peaking at around 50%.
Figure 2a and 2b. Histogram of Total Number of Flu Test Specimens (left) and Histogram of Percent Positive Specimens (right)
We constructed a Pearson correlation coefficient matrix to compare the relationships between each variable of interest in our model. This is shown below, with an accompanying variable legend.
Variable legend
- Tamiflu, flu, influenza, flu_symptoms, oseltamivir, cough_medicine, flu_shot, flu_vaccine, zanamivir, and relenza are all frequency results from google trends
- Total.specimens: Weekly number of specimens tested, as determined from CDC’s State Level Outpatient Illness and Viral Surveillance dataset.
- Percent.positive: Weekly percentage of specimens that tested positive for influenza, as determined by CDC’s State Level Outpatient Illness and Viral Surveillance dataset.
- Cases: Weekly number of influenza cases in Connecticut, estimated by multiplying Total.specimens by Percent.positive
- Total.prcp: Cumulative weekly precipitation in inches.
- Median_tmax: Median weekly high temperature in Fahrenheit
- Median_tmin: Median weekly low temperature in Fahrenheit
- Median_t: Median weekly average temperature in Fahrenheit. Obtained by averaging median_tmax and median_tmin
- Week: Week number that data was collected
- Year: Year that data was collected
- Humidity: Median weekly relative humidity
- Absolute_humidity: Median weekly absolute humidity
- All_rate: Rate of laboratory-confirmed influenza hospitalizations in Connecticut per 100,000 persons, as inferred from the Influenza Hospitalization Surveillance Network (FluSurv-NET) dataset.
- Rate_x_y: Rate of laboratory-confirmed influenza hospitalizations in Connecticut per 100,000 persons, for people in the age category of x to y years old.
Figure 3. Pearson Correlation Coefficient Matrix
As we can seen in Figure 3, there is a strong positive correlation between the number of weekly influenza cases and the weekly frequency of google searches for tamiflu, flu symptoms, and influenza. These have correlations of .876, .866, and .741 respectively. There is also a significant negative correlation (-.538) between the number of weekly influenza cases and median weekly temperature.
Figure 4. Spearman Correlation Coefficient Matrix
The variables that were strongly positively correlated in the Pearson Correlation Coefficient Matrix were less positive in this figure, however the negative correlated variables became stronger with the Spearman Correlation Coefficient. This may mean that temperature and humidity have a monotonic, but not necessarily linear, relationship with the number of flu cases.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
We constructed scatter plots of variables that had either a fairly positive or negative Pearson correlation coefficient to better illustrate these linear relationships.
Since there is a strong correlation between flu-related google search trends and weekly influenza cases, we created time-series plots to show the relationships between these variables:
Figure 9.
Figure 10.
Figure 11.
We also constructed a time-series plot of temperature and weekly influenza cases, as shown below.
Figure 12.
Figure 12 shows that the number of weekly flu cases generally peaks at the same time when temperature reaches its lowest values. However, it cannot match the height of the flu peaks.
Additionally, we created violin plots of cases per season, cases per month, several of the search trends per month, and median temperature per month. The 20XX season is defined to be the 40th week of 20XX to the 39th week of 20X(X+1). The violin plots were able to give us a good idea about the variation of important features within and between seasons and months.
Important correlations and trends are revealed here. There is an enormous amount of variation in the number of cases between and within seasons. The distribution of cases per month shows that the typical month has very few cases, but a few months can have an enormous number.
It is also evident that 2011 and 2017 were fairly abnormal years as far as cases go. 2011 was historically low and 2017 was historically high. The plots show that the number of cases in 2017 was driven up primarily by two months: January and February. Essentially every month in 2011 had an extremely low number of cases. 2011 tends to drive down the values in all plots. There were very few cases, few tests came back positive, and few people were searching for flu remedies (Tamiflu). It is interesting to note that the search trends for “flu vaccine” were not as significantly affected by this year. A likely explanation of this is that “flu vaccine” tends to be searched for before flu season starts. This can also be seen in the above search trend plots. Therefore, searches would have started before it became evident that the 2011 flu season was mild. This does not happen with the searches for “Tamiflu,” likely because people search for remedies once the flu season has begun.
There is also not much variation in monthly median temperature between seasons, which is an expected outcome.
Figure 13. Number of reported flu cases per month. Distribution is over each season.
Figure 14. Number of reported flu cases per season. Distribution is over each month.
Figure 15. Median temperature each month. Distribution is over each season.
Figure 16. Percent of tested samples that were positive for flu. Distribution is over each month.
Figure 17. Search trends for “flu vaccine” each month. Distribution is over each season.
Figure 18. Search trends for “Tamiflu” each month. Distribution is over each month.