In this lab you will practice data wrangling and visualization skills using real-time COVID-19 data maintained by the New York Times.

Set-up

  1. Create a geog-176A-labs GitHub repository
  2. Clone it to your machine
  3. Create a data, docs, and img directory
  4. In docs, create a new .Rmd file called lab-02.Rmd
  5. Populate its YML with a title, author, subtitle, output type and theme. For example:
---
title: "Geography 13"
author: "[Mike Johnson](https://mikejohnson51.github.io)"
subtitle: 'Lab 02: COVID-19 Pandemic'
output:
  html_document:
    theme: journal
---

Be sure to associate your name with your personal webpage via a link.


Libraries

You will need a few libraries for this lab. Make sure they are installed and loaded in your Rmd.

  1. tidyverse (data wrangling and visualization)
  2. knitr (make nice tables)
  3. readxl (read excel files)
  4. zoo (rolling averages)

Data

We are going to practice some data wrangling skills using a real-world dataset about COVID cases curated and maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:

“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.”

For this lab we will use the historic, county level data which is stored as an updating CSV at this URL:

https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv

Question 1:

You are a data scientist for the state of California Department of Public Health.

You’ve been tasked with giving a report to Governor Newsom every morning about the most current COVID-19 conditions at the county level.

As it stands, the California Department of Public Health maintains a watch list of counties that are being monitored for worsening coronavirus trends. There are six criteria used to place counties on the watch list:

  1. Doing fewer than 150 tests per 100,000 residents daily (over a 7-day average)

  2. More than 100 new cases per 100,000 residents over the past 14 days…

  3. 25 new cases per 100,000 residents and an 8% test positivity rate

  4. 10% or greater increase in COVID-19 hospitalized patients over the past 3 days

  5. Fewer than 20% of ICU beds available

  6. Fewer than 25% ventilators available

Of these 6 conditions, you are in charge of monitoring condition number 2.

To do this job well, you should set up a reproducible framework to communicate the following in a way that can be updated every time new data is released:

  1. cumulative cases in the 5 worst counties
  2. total NEW cases in the 5 worst counties
  3. A list of safe counties
  4. A text report describing the total new cases, total cumulative cases and number of safe counties.

You should build this analysis in such a way that running it will extract the most current data straight from the NY-Times URL and the state name is a parameter that can be changed allowing this report to be run for other states.


Steps:

  1. Start by reading in the data from the NY-Times URL with read_csv (make sure to attach the tidyverse)

  2. This data is considered our “raw data”. Remember to always leave “raw-data-raw” and to generate meaningful subsets as you go. Start by making a subset that filters the data to California and add a new column (mutate) with the daily new cases using diff/lag by county.

(Hint: you will need some combination of filter, group_by, mutate, diff/lag, and ungroup)

  1. Using your subset, generate (2) tables. The first should show the 5 counties with the most cumulative cases, and the second should show the 5 counties with the most NEW cases. Your tables should have clear column names and descriptive captions.

(Hint: Use knitr::kable and the parameters caption and col.names)

  1. To determine the number of cases in the last 14 days per 100,000 people we need population estimates. Population data is offered by the USDA and can be found here. Please download the data and store it in the data directory of your project.

  2. Load the population data with the “dataset importer” (find the file in your data directory via the file explorer –> click on it –> select “Import Dataset”). Be sure to copy the code preview (ignore the View(...)) and insert it in your Rmarkdown. This will allow the data to be referenced every time the file is run!

(Hint: Be careful about how the data is imported (do any rows need to be skipped?))

  1. Now, explore the data… what attributes does it have, what are the names of the columns? Do any match the COVID data we have? What are the dimensions…

(Hint: names(), dim(), nrow(), str() …))

  1. Join the population data to the California COVID data.

  2. Generate (2) new tables. The first should show the 5 counties with the most cumulative cases per capita, and the second should show the 5 counties with the most NEW cases per capita. Your tables should have clear column names and descriptive captions.

(Hint: Use knitr::kable)

  1. Filter the merged COVID/Population data to only include the last 14 days. Remember this should be a programmatic request and not hard-coded. Then, use the group_by/summarize paradigm to determine the total number of new cases in the last 14 days per 100,000 people.

(Hint: Dates are numeric in R and thus operations like max min, -, +, >, and< work.)

  1. Write the results of your analysis using in line r chunks to describe (1) the total number of cases, (2) the total number of new cases, and (3) the total number of safe counties as a single sentence or 2.

Question 2:

In this question, we are going to look at the story of 4 states and the impact scale can have on data interprtation. The states include: New York, California, Louisiana, and Florida.

Your task is to make a faceted bar plot showing the number of daily, new cases at the state level.


Steps:

  1. First, we need to group/summarize our county level data to the state level, filter it to the four states of interest, and calculate the number of daily new cases (diff/lag) and the 7-day rolling mean.

Hint: You will need two group_by calls and the zoo::rollmean function.

  1. Using the modified data, make a facet plot of the daily new cases and the 7-day rolling mean. Your plot should use compelling geoms, labels, colors, and themes.

  2. The story of raw case counts can be misleading. To understand why, lets explore the cases per capita of each state. To do this, join the state COVID data to the USDA population estimates and calculate the \(new cases / total population\). Additionally, calculate the 7-day rolling mean of the new cases per capita counts. This is a tricky task and will take some thought, time, and modification to existing code (most likley)!

Hint: You may need to modify the columns you kept in your original population data. Be creative with how you join data (inner vs outer vs full)!

  1. Using the per capita data, replicate the previous facet plot with compelling labels, colors, and theme. While looking similar to the other, it should be visually different (e.g. chose different colors but the same theme)

  2. Briefly describe the influence scaling by population had on the analysis? Does it make some states look better? Some worse? How so?


Question 3:

This question is extra credit!

  • Here we will explore our first spatial example. In it we will calculate the Weighted Mean Center of the COVID-19 outbreak in the USA to understand the movement of the virus through time.

  • To do this, we need to join the COVID data with location information. I have staged that latitude and longitude of county centers here. For reference, this data was processed like this:

counties = USAboundaries::us_counties() %>% 
  select(fips = geoid, name, state_name) %>% 
  st_centroid() %>% 
  mutate(LON = st_coordinates(.)[,1], 
         LAT = st_coordinates(.)[,2]) %>% 
  st_drop_geometry()

write.csv(counties, "../docs/data/county-centroids.csv")

Please download the data, place it in your data directory; read it in (readr::read_csv()); and join it to your raw COVID-19 data using the fips attributes.

  • The mean center of a set of spatial points is defined as the average X and Y coordinate. A weighted mean center is found by weighting the X and Y coordinates by another variable:

\[X_{coord} = \sum{(X_{i} * w_{i})} / \sum(w_{i})\] \[Y_{coord} = \sum{(Y_{i} * w_{i})}/ \sum(w_{i})\]

  • For each date, calculate the Weighted Mean \(X_{coord}\) and \(Y_{coord}\) for each county using the daily cumulative cases and the weight \(w_{i}\). In addition, calculate the total cases for each day, as well as the month.

Hint: the month can be extracted from the date column using format(date, "%m")

  • Plot the weighted mean center (aes(x = LNG, y = LAT)), colored by month, and sized by total cases for each day. These points should be plotted over a map of the USA states which can be added to a ggplot object with:
borders("state", fill = "gray90", colour = "white")

(feel free to modify fill and colour (must be colour (see documentation)))

  • In a few sentences, describe the movement of the COVID-19 weighted mean throughout the USA and possible drivers of its movement given your knowledge of the outbreak hot spots.

Rubric

  • Question 1 (40)
  • Question 2 (35)
  • Question 3 (30)
  • Well Structured and appealing Rmd (20)
  • Deployed as web page (5)

Total: 100 points (130 possible)

Submission

For this lab you will submit a URL to a webpage deployed with GitHub pages To do this:

  • Knit your lab 2 document
  • Stage/commit/push your files
  • Activate Github Pages (GitHub –> Setting –> GitHub pages) and deploy from the docs folder
  • If you followed the naming conventions in the “Set Up”, your lab 2 link will be available at:

https://USERNAME.github.io/geog-13-labs/lab-02.html

Submit this URL in the appropriate Gauchospace dropbox. Also take a moment to update your personal webpage with this link and some bullet points of what you learned. While not graded as part of this lab, it will be your final!