Data Acquisition and Cleaning

WRANGLING

Our data was acquired from several sources: google trends data, Center of Disease Control (CDC) reports, and the National Oceanic and Atmospheric Administration (NOAA). All data was collected from 2010 to 2018 and downloaded in csv format. From google trends, we gathered weekly data on searches for cough medicine, flu clinics, flu shots/vaccinations, flu/influenza, flu medicine (oseltamivir, relenza, tamiflu, zanamivir), and flu symptoms in Connecticut.

Data collected from the United States World Health Organization (WHO) Collaborating Laboratories and National Respiratory and Enteric Virus Surveillance System (NREVSS) laboratories was used for the CDC’s State Level Outpatient Illness and Viral Surveillance. It reports the weekly number of specimens tested and percent positive rate for influenza in Connecticut.

The CDC also has an Influenza Hospitalization Surveillance Network (FluSurv-NET) which identifies weekly laboratory-confirmed influenza hospitalizations in multiple states including Connecticut. It contains the rate of hospitalizations per 100,000 in the total population and also by age category (0-4, 5-17, 18-49, 50-64, 65+).

Information on the number of vaccinations performed up to a given week and vaccine effectiveness were gathered from the CDC’s FluVaxView website which estimates the influenza vaccination coverage nationally within the United States.

Weather data was gathered from the National Oceanic and Atmospheric Administration’s Climate Data Online (CDO) portal: https://www.ncdc.noaa.gov/cdo-web/datasets, which has lots of weather and climate data from various US weather stations. Data was gathered from the Hartford-Brainard Field station as a proxy for Connecticut’s overall weather from 2010 to 2018. Two csv datasets from this weather station were used: one contained daily weather summaries of temperature and precipitation and one contained more detailed climatological data that was collected on an hourly basis.

CLEANING GOOGLE TRENDS DATA

One challenging aspect of using google trends for weekly search data is that it is not possible to obtain weekly google trends data in a single csv for a time period of longer than 5 years. Additionally, google trends does not give the user an absolute number of searches for a particular time period; instead, it provides the relative frequency of that search term over the selected time period in a value from 0-100. Thus, it is impossible to compare frequency values from different time periods without additional information.

To get google trends data over the full 2010-2018 time period, google trends data was initially downloaded in two separate csv files: one file contained trends data from 2010-2015, and one contained trends data from 2016-2018. The data was separately downloaded in csv files for each search term and then merged on year and week number.

Monthly trends data for the full 2010-2018 time period was compiled in a csv file by downloading csv files containing trends for each search term and merging them on month and year. The monthly trends data was then used to combine the two separate data files on weekly trends so that numbers from each could be properly compared. The intuition was that for any given month, the average search frequency of a week in that month should be equal to the total search frequency for that month. Thus, the weekly frequencies were adjusted according to the following formula:

\text{new weekly frequency} = \text{weekly frequency}\cdot \frac{\text{monthly frequency}}{\text{average weekly frequency for month of specified week}}

The new weekly frequency entries are the final search frequencies used for EDA and prediction.

CLEANING WHO-NREVSS DATA

Data was initially reported by region, year, and week. The two datasets were cleaned separately before merger. Extraneous columns, such as flu strain were removed. NA values were reported as an X in the given field, so these were changed to NA. They were not removed or imputed immediately because we wanted to perform exploratory data analysis before deciding how to handle these missing values. Only data for Connecticut was kept. Following this, the two datasets were merged on year and week.

CLEANING FLU-SURV-NET DATA

Only columns of interest such as year, week, age range, and weekly rate were not removed. Data was missing for several weeks for certain age groups so when merged, it was replaced with NaN.

CLEANING NOAA DATA

Relevant fields were selected from the raw dataset (date, daily precipitation (inches) and daily minimum and maximum temperature (℉)). A new column calculating the week from the date field was created. The year was also extracted from the full date field. Data was then grouped by year and week to calculate weekly total precipitation, median humidity, and median minimum, maximum, overall temperature. Hourly relative humidity (%) was extracted from another NOAA dataset of hourly data at the same weather station. Week and year were extracted from the date-time field in the same manner as the first dataset. The weekly median relative humidity was then calculated by grouping by year and week fields. The cleaned data was then merged into the cleaned WHO-NREVVS on year and week. Missing entries in this dataset were marked as NA when merged into the cleaned dataframe.

ADDITIONAL FEATURES

All data sets were merged on year and week. Additional features were added to the merged dataset. These included weekly cases, calculated as the number of laboratory tests multiplied by the percent positive tests. Additionally, absolute humidity was calculated as a function of relative humidity (rh) and temperature in Celsius (T). Absolutely humidity was chosen to be a better measure since it measures moisture in the air regardless of temperature unlike relative humidity.

ah3

https://carnotcycle.wordpress.com/2012/08/04/how-to-convert-relative-humidity-to-absolute-humidity/

 

Leave a comment