01: Getting Data

TL;DR

This section is all about getting data for visualization. You will read “live” COVID data from the NYTimes and join the most current date to a collection of USA county centroids. if you haven’t yet installed R and RStudio - or set up your COVID-19-dashboard project - please revisit here

Introduction

To make our Shiny Application we first need data! In this section we are going to build functions to generate datasets that will be used in the application. These include:

A county-level timeseries of the most up to data COVID-19 cases and deaths
A spatial dataset of the county centroids joined to 'todays' COVID-19 counts.

Streaming COVID-19 time series Data

The New York Times archives a daily record of county-level USA COVID-19 cases and deaths in a GitHub repository here.

In this repository the most current, raw, CSV can be accessed at the following URL:

url ='https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'

R is really cool, and lets read this URL directly using read.csv by treating the string as a connection:

# Here we are reading the URL directly using the url variable created above
covid19 = read.csv(url, stringsAsFactors = FALSE)

and look at it using the head() function to print the first few rows:

head(covid19)

##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

That was easy! 😄

Now lets look at the structure (str) of the data:

str(covid19)

## 'data.frame':    209599 obs. of  6 variables:
##  $ date  : chr  "2020-01-21" "2020-01-22" "2020-01-23" "2020-01-24" ...
##  $ county: chr  "Snohomish" "Snohomish" "Snohomish" "Cook" ...
##  $ state : chr  "Washington" "Washington" "Washington" "Illinois" ...
##  $ fips  : int  53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
##  $ cases : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ deaths: int  0 0 0 0 0 0 0 0 0 0 ...

Things look pretty good but lets change our date values to date objects for nicer plotting.

We will also add a new variable called “name” which will be a text string concatenating the county and state names. Here we use the dplyr keyword mutate which allows us to create or transform variables. Since we are using mutate we need to attach the dplyr package.

library(dplyr)

covid19  = covid19 %>% 
  mutate(date = as.Date(date), # Override date with formated date
         name = paste0(county, " County, ", state)) # Make a new text string

## Lets check!
str(covid19)

## 'data.frame':    209599 obs. of  7 variables:
##  $ date  : Date, format: "2020-01-21" ...
##  $ county: chr  "Snohomish" "Snohomish" "Snohomish" "Cook" ...
##  $ state : chr  "Washington" "Washington" "Washington" "Illinois" ...
##  $ fips  : int  53061 53061 53061 17031 53061 6059 17031 53061 4013 6037 ...
##  $ cases : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ deaths: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ name  : chr  "Snohomish County, Washington" "Snohomish County, Washington" "Snohomish County, Washington" "Cook County, Illinois" ...

Great! Now we see date is in fact a date object, and name gives us the combined state-county name.

Now lets us see how the data works using LA County (FIP = 6037) as an example. Here we use the dplyr filter keyword to return rows matching a prescribed condition. Since dplyr was already attached above, we do not need to re-attach the package. This is also a good time to introduce that R lets you access any column of a data.frame using the $. For example, covid19$date would return all dates in our covid19 data.

# Filter the covid19 data to only those records were fips equals 6037
test = filter(covid19, fips == 6037)

# Plot the data
plot(x = test$date, # define the x axis variable using all the dates in test
     y = test$cases, # define the y axis variable
     type  = "l", # we wnat a line "l" plot
     xlab = "Date", ylab = "Cases", # refine the x and y labels
     main = test$name[1]) # define the plot title.

Yikes LA! That is quite a curve….

Getting Spatial Data for the Counties:

There are many ways to get spatial data into R. I have saved the USA low-resolution county geometries here for you to download. Please do so and save it into the data folder folder you created.

Next, we will attach the sf library, read the downloaded RDS file, and plot the geometry attribute of the data!

library(sf) # Attach sf library

counties = readRDS("./data/counties.rds") # read rds data using the path to your resource

plot(counties$geometry) # plot the county geometries

Getting county centroids and joining to COVID data

Great! Now we have a spatial representation of all continental USA counties and a time series of COVID-19 cases and deaths. Lets create a new dataset that joins the most recent daily statistics to the spatial geometries For faster rendering, we will convert the county polygons into centroids using the following steps:

Filter our covid19_data to the the most current (maximum) date
Join it to the counties using the same-named fips variable,
Remove NA values
scale the cases to determine a standardized case count.

today = filter(covid19, date == max(date)) %>% # filter all data to the maximum datae
  left_join(st_centroid(counties), by = 'fips') %>%  # cast county geometries to centroid, and join by fip
  na.omit() %>% # remove NA values
  mutate(size = abs(cases - mean(cases)) / sd(cases)) # Compute a scaled case count
  
# Plot and size (cex) the cnetroid data
plot(today$geometry, cex = today$size)

I know that was a lot, if you understood what was going on in the code great! If you didn’t, that is totally OK too (it just means you haven’t had as much programming experience).

Adding Data Functions to `helpers.R`

Now lets turn the above processes into functions that we can save to our helpers.R file and latter use in our Shiny Application.

You can either type them in or use the copy icon to put these in your helpers.R file

#Load Spatial Data
counties = readRDS("./data/counties.rds")

# Read in COVID-19 Timesries from URL
read_covid19 = function(){
  url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
  read.csv(url, stringsAsFactors = FALSE) %>% 
    mutate(date = as.Date(date), 
           fips = as.numeric(fips), 
           name = paste0(county, " County, ", state))
}

# Join County Data with 
today_centroids = function(counties, covid_data){
  filter(covid_data, date == max(date)) %>% 
    left_join(st_centroid(counties), by = 'fips') %>% 
    na.omit() %>% 
    mutate(size = abs(cases - mean(cases)) / sd(cases)) %>% 
    st_as_sf()
}

Conclusion

Fantastic, now we have methods and data needed to build our Shiny App, and we have useful functions to let us manipulate that data. If you feel lost, again, it is not a problem! Lets move to the next section where we will make interactive graphs of our time series data using the dygraphs package and we will start to see things coming together