In this lab you will practice data wrangling and visualization skills using real-time COVID-19 data maintained by the New York Times.
geog-176A-labs
GitHub repositorydata
, docs
, and img
directorylab-02.Rmd
---
title: "Geography 13"
author: "[Mike Johnson](https://mikejohnson51.github.io)"
subtitle: 'Lab 02: COVID-19 Pandemic'
output:
html_document:
theme: journal
---
Be sure to associate your name with your personal webpage via a link.
You will need a few libraries for this lab. Make sure they are installed and loaded in your Rmd.
tidyverse
(data wrangling and visualization)knitr
(make nice tables)readxl
(read excel files)zoo
(rolling averages)We are going to practice some data wrangling skills using a real-world dataset about COVID cases curated and maintained by the New York Times. The data has been used to create reports and data visualizations like this, and are archived on a GitHub repo here. Looking at the README in this repository we read:
“We are providing two sets of data with cumulative counts of coronavirus cases and deaths: one with our most current numbers for each geography and another with historical data showing the tally for each day for each geography … the historical files are the final counts at the end of each day … The historical and live data are released in three files, one for each of these geographic levels: U.S., states and counties.”
For this lab we will use the historic, county level data which is stored as an updating CSV at this URL:
https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
You are a data scientist for the state of California Department of Public Health.
You’ve been tasked with giving a report to Governor Newsom every morning about the most current COVID-19 conditions at the county level.
As it stands, the California Department of Public Health maintains a watch list of counties that are being monitored for worsening coronavirus trends. There are six criteria used to place counties on the watch list:
Doing fewer than 150 tests per 100,000 residents daily (over a 7-day average)
More than 100 new cases per 100,000 residents over the past 14 days…
25 new cases per 100,000 residents and an 8% test positivity rate
10% or greater increase in COVID-19 hospitalized patients over the past 3 days
Fewer than 20% of ICU beds available
Fewer than 25% ventilators available
Of these 6 conditions, you are in charge of monitoring condition number 2.
To do this job well, you should set up a reproducible framework to communicate the following in a way that can be updated every time new data is released:
You should build this analysis in such a way that running it will extract the most current data straight from the NY-Times URL and the state name is a parameter that can be changed allowing this report to be run for other states.
Start by reading in the data from the NY-Times URL with read_csv
(make sure to attach the tidyverse)
This data is considered our “raw data”. Remember to always leave “raw-data-raw” and to generate meaningful subsets as you go. Start by making a subset that filters the data to California and add a new column (mutate) with the daily new cases using diff/lag
by county.
(Hint: you will need some combination of filter
, group_by
, mutate
, diff/lag
, and ungroup
)
(Hint: Use knitr::kable
and the parameters caption
and col.names
)
To determine the number of cases in the last 14 days per 100,000 people we need population estimates. Population data is offered by the USDA and can be found here. Please download the data and store it in the data
directory of your project.
Load the population data with the “dataset importer” (find the file in your data directory via the file explorer –> click on it –> select “Import Dataset”). Be sure to copy the code preview (ignore the View(...)
) and insert it in your Rmarkdown. This will allow the data to be referenced every time the file is run!
(Hint: Be careful about how the data is imported (do any rows need to be skipped?))
(Hint: names()
, dim()
, nrow()
, str()
…))
Join the population data to the California COVID data.
Generate (2) new tables. The first should show the 5 counties with the most cumulative cases per capita, and the second should show the 5 counties with the most NEW cases per capita. Your tables should have clear column names and descriptive captions.
(Hint: Use knitr::kable
)
group_by
/summarize
paradigm to determine the total number of new cases in the last 14 days per 100,000 people.(Hint: Dates are numeric in R and thus operations like max
min
, -
, +
, >
, and<
work.)
In this question, we are going to look at the story of 4 states and the impact scale can have on data interprtation. The states include: New York, California, Louisiana, and Florida.
Your task is to make a faceted bar plot showing the number of daily, new cases at the state level.
group/summarize
our county level data to the state level, filter
it to the four states of interest, and calculate the number of daily new cases (diff/lag
) and the 7-day rolling mean.Hint: You will need two group_by
calls and the zoo::rollmean
function.
Using the modified data, make a facet plot of the daily new cases and the 7-day rolling mean. Your plot should use compelling geoms, labels, colors, and themes.
The story of raw case counts can be misleading. To understand why, lets explore the cases per capita of each state. To do this, join the state COVID data to the USDA population estimates and calculate the \(new cases / total population\). Additionally, calculate the 7-day rolling mean of the new cases per capita counts. This is a tricky task and will take some thought, time, and modification to existing code (most likley)!
Hint: You may need to modify the columns you kept in your original population data. Be creative with how you join data (inner vs outer vs full)!
Using the per capita data, replicate the previous facet plot with compelling labels, colors, and theme. While looking similar to the other, it should be visually different (e.g. chose different colors but the same theme)
Briefly describe the influence scaling by population had on the analysis? Does it make some states look better? Some worse? How so?
This question is extra credit!
Here we will explore our first spatial example. In it we will calculate the Weighted Mean Center of the COVID-19 outbreak in the USA to understand the movement of the virus through time.
To do this, we need to join the COVID data with location information. I have staged that latitude and longitude of county centers here. For reference, this data was processed like this:
counties = USAboundaries::us_counties() %>%
select(fips = geoid, name, state_name) %>%
st_centroid() %>%
mutate(LON = st_coordinates(.)[,1],
LAT = st_coordinates(.)[,2]) %>%
st_drop_geometry()
write.csv(counties, "../docs/data/county-centroids.csv")
Please download the data, place it in your data
directory; read it in (readr::read_csv()
); and join it to your raw COVID-19 data using the fips
attributes.
\[X_{coord} = \sum{(X_{i} * w_{i})} / \sum(w_{i})\] \[Y_{coord} = \sum{(Y_{i} * w_{i})}/ \sum(w_{i})\]
Hint: the month can be extracted from the date column using format(date, "%m")
borders("state", fill = "gray90", colour = "white")
(feel free to modify fill and colour (must be colour (see documentation)))
Total: 100 points (130 possible)
For this lab you will submit a URL to a webpage deployed with GitHub pages To do this:
https://USERNAME.github.io/geog-13-labs/lab-02.html
Submit this URL in the appropriate Gauchospace dropbox. Also take a moment to update your personal webpage with this link and some bullet points of what you learned. While not graded as part of this lab, it will be your final!