In this lab we will an explore the impacts of tessellated surfaces and the modifiable areal unit problem (MAUP) using the National Dam Inventory maintained by the United States Army Corps of Engineers. Doing this will require repetitive tasks that we will write as functions and careful consideration of feature aggregation/simplification, spatial joins, and data visualization. The end goal is to visualize the distribution of dams and there purposes across the country.
DISCLAIMER: This lab will be crunching a TON of data, in some cases 562,590,604 values for a single process! Therefore, I encourage you to run your code chuck-by-chunk rather then regularly knitting. Your final knit may take a couple of minutes to process. I know this is painful but be proud that, all said, your report will be analyzing billions of meaningful data and geometric relations.
This labs covers 4 main skills:
Here we will prepare five tesselated surfaces from CONUS and write a function to plot them in a descriptive way.
First, we need a spatial file of CONUS counties. For future area calculations we want these in an equal area projection (EPSG:5070
).
To achieve this:
get an sf
object of US counties (USAboundaries::us_counties()
)
remove those not in CONUS
transform the data to EPSG:5070
For triangle based tessellations we need point locations to serve as our “anchors”.
To achieve this:
generate county centroids using st_centroid
Since, we can only tessellate over a feature we need to combine or union the resulting 3,108 POINT
features into a single MULTIPOINT
feature
Since these are point objects, the difference between union/combine is mute
Tessellations/Coverages describe the extent of a region with geometric shapes, called tiles, with no overlaps or gaps.
Tiles can range in size, shape, area and have different methods for being created.
Some methods generate triangular tiles across a set of defined points (e.g. voroni
and delauny triangulation
)
Others generate equal area tiles over a known extent (st_make_grid
)
For this lab, we will create surfaces of CONUS using using 4 methods, 2 based on an extent and 2 based on point anchors:
Tessellations :
st_voroni
: creates voroni tessellation
st_traingulate
: triangulates set of points (not constrained)
Coverages:
st_make_grid
: Creates a square grid covering the geometry of an sf or sfc object
st_make_grid(square = FALSE)
: Create a hexagonal grid covering the geometry of an sf or sfc object
The side of coverage tiles can be defined by a cell resolution or a specificed number of cell in the X and Y direction
For this step:
MULTIPOINT
)MULTIPOINT
)In addition to creating these 4 coverages we need to add an ID to each tile.
To do this:
add a new column to each tessellation that spans from 1:n()
.
Remember that ALL tessellation methods return an sfc
GEOMETRYCOLLECTION
, and to add attribute information - like our ID - you will have to coerse the sfc
list into an sf
object (st_sf
or st_as_sf
)
Last, we want to ensure that our surfaces are topologically valid/simple.
To ensure this, we can pass our surfaces through st_cast
.
Remember that casting an object explicitly (e.g. st_cast(x, "POINT")
) changes a geometry
If no output type is specified (e.g. st_cast(x)
) then the cast attempts to simplifiy the geometry.
If you don’t do this you might get unexpected “TopologyException” errors.
If you plot the above tessellations you’ll see the triangulated surfaces produce regions far beyond the boundaries of CONUS.
We need to cut these boundaries to CONUS border.
To do this, we will call on st_intersection
, but will first need a geometry of CONUS to serve as our differencing feature. We can get this by unioning our existing county boundaries.
With a single feature boundary, we must carefully consider the complexity of the geometry. Remember, the more points our geometry contains, the more computations needed for spatial predicates our differencing. For a task like ours, we do not need a finely resolved coastal boarder.
To achcineve this:
Simplify your unioned border using the Visvalingam algotithm provided by rmapshaper::ms_simplify
.
Choose what percentage of vertices to retain using the keep
argument and work to find the highest number that provides a shape you are comfortable with for the analysis:
Once you are happy with your simplification, use the mapview::npts
function to report the number of points in your original object, and the number of points in your simplified object.
How many points were you able to remove? What are the consequences of doing this computationally?
Finally, use your simplified object to crop the two triangulated tessellations with st_intersection
:
The last step is to plot your tessellations. We don’t want to write out 5 ggplots (or mindlessly copy and paste 😄)
Instead, lets make a function that takes an sf
object as arg1 and a character string as arg2 and returns a ggplot object showing arg1 titled with arg2.
The form of a function is:
name = function(arg1, arg2) {
... code goes here ...
}
For this function:
The name can be anything you chose, arg1 should take an sf
object, and arg2 should take a character string that will title the plot
In your function, the code should follow our standard ggplot
practice where your data is arg1, and your title is arg2
The function should also enforce the following:
a white
fill
a navy
border
a size
of 0.2
`theme_void``
a caption that reports the number of features in arg1
Use your new function to plot each of your tessellated surfaces and the original county data (5 plots in total):
In this question, we will write out a function to summarize our tessellated surfaces.
First, we need a function that takes a sf
object and a character
string and returns a data.frame
.
For this function:
The function name can be anything you chose, arg1 should take an sf
object, and arg2 should take a character string describing the object
In your function, calculate the area of arg1
; convert the units to km2; and then drop the units
Next, create a data.frame
containing the following:
text from arg2
the number of features in arg1
the mean area of the features in arg1 (km2)
the standard deviation of the features in arg1
the total area (km2) of arg1
Return this data.frame
Use your new function to summarize each of your tessellations and the origional counties.
Multiple data.frame
objects can bound row-wise with bind_rows
into a single data.frame
For example, if your function is called sum_tess
, the following would bind your summaries of the triangulation and voroni object.
tess_summary = bind_rows(
sum_tess(triangulation ,"triangulation"),
sum_tess(voroni, "voroni"))
Once your 5 summaries are bound (2 tessellations, 2 coverages, and the raw counties) print the data.frame
as a nice table using knitr
/kableExtra.
Comment on the traits of each tessellation. Be specific about how these traits might impact the results of a point-in-polygon analysis in the contexts of the modifiable areal unit problem and with respect computational requirements.
The data we are going to analysis in this lab is from US Army Corp of Engineers National Dam Inventory (NID). This dataset documents ~91,000 dams in the United States and a variety of attribute information including design specifications, risk level, age, and purpose.
For the remainder of this lab we will analysis the distributions of these dams (Q3) and their purpose (Q4) through using a point-in-polygon analysis.
In the tradition of this class - and true to data science/GIS work - you need to find, download, and manage raw data:
NID2019
hyperlink to download the data for the country.data
folder of your geog-13-labs
project.read_excel
from the readxl
package
!is.na()
)data.frame
to a sf
object by defining the coordinates and CRSFollowing the in-class examples develop an efficient point-in-polygon function that takes:
arg1
,arg2
,arg3
The function should make use of spatial and non-spatial joins, sf coercion and dplyr::count
. The returned object should be input sf
object with a column - n
- counting the number of points in each tile.
Apply your point-in-polygon function to each of your five tessellated surfaces where:
Lets continue the trend of automating our repetitive tasks through function creation. This time make a new function that extends your previous plotting function.
For this function:
The name can be anything you chose, arg1 should take an sf
object, and arg2 should take a character string that will title the plot
The function should also enforce the following:
the fill aesthetic is driven by the count column n
the col is NA
the fill is scaled to a continuous viridis
color ramp
theme_void
a caption that reports the number of dams in arg1 (e.g. sum(n)
)
Apply your plotting function to each of the 5 tessellated surfaces with Point-in-Polygon counts:
Comment on the influence of the tessellated surface in the visualization of point counts. How does this related to the MAUP problem. Moving forward you will only use one tessellation, which will you chose and why?
While there is not “right” answer, justify your selection here.
The NID provides a comprehensive data dictionary here. In it we find that dam purposes are designated by a character code.
These are shown below for convenience (built using knitr on a data.frame called nid_classifier
):
abbr | purpose |
---|---|
I | Irrigation |
H | Hydroelectric |
C | Flood Control |
N | Navigation |
S | Water Supply |
R | Recreation |
P | Fire Protection |
F | Fish and Wildlife |
D | Debris Control |
T | Tailings |
G | Grade Stabilization |
O | Other |
In the data dictionary, we see a dam can have multiple purposes.
In these cases, the purpose codes are concatenated in order of decreasing importance. For example, SCR
would indicate the primary purposes are Water Supply, then Flood Control, then Recreation.
A standard summary indicates there are over 400 unique combinations of dam purposes:
unique(dams2$PURPOSES) %>% length
[1] 493
To overcome this data structure limitation, we can identify how many dams serve each purpose by splitting the PURPOSES values (strsplit
) and tabulating the unlisted results as a data.frame. Effectively this is double/triple/quadruple counting dams bases on how many purposes they serve:
# create a vector of all characters in your purpose and unlist
dam_freq <- strsplit(dams2$PURPOSES, split = "") %>%
unlist() %>%
table() %>%
as.data.frame() %>%
setNames(c("abbr", "count")) %>%
left_join(nid_classifier) %>%
mutate(lab = paste0(purpose, "\n(", abbr, ")"))
The result of this would indicate:
Your task is to create point-in-polygon counts for at least 4 of the above dam purposes:
You will use grepl
to filter the complete dataset to those with your chosen purpose
Remember that grepl
returns a boolean if a given pattern is matched in a string
grepl
is vectorized so can be used in dplyr::filter
For example:
# Find flood control dams in the first 5 records:
dams2$PURPOSES[1:5]
[1] "FR" "R" "C" "FR" "R"
grepl("F", dams2$PURPOSES[1:5])
[1] TRUE FALSE FALSE TRUE FALSE
For your analysis, choose at least four of the above codes, and describe why you chose them. Then for each of them, create a subset of dams that serve that purpose using dplyr::filter
and grepl
Finally, use your point-in-polygon
function to count each subset across your elected tessellation
Now use your plotting function from Q3 to map these counts.
But! you will use gghighlight
to only color those tiles where the count (n) is greater then the (mean + 1 standard deviation
) of the set
Since your plotting function returns a ggplot
object already, the gghighlight
call can be added “+
” directly to the function.
The result of this exploration is to highlight the areas of the country with the most
Comment of geographic distribution of dams you found. Does it make sense? How might the tessellation you chose impact your findings? How does the distribution of dams coiencide with other geogaphic factors such as river systems, climate, ect?
You have also been asked to identify the largest, at risk, flood control dams in the country
You must also map the Mississippi River System - This data is avialable here under the ‘Data & Resources’ tab - Download the shapefile and unzip it into your data directory. - Use read_sf
to import this data and filter it to only include the Mississippi SYSTEM
To achieve this:
Create an interactive map using leaflet
to show the largest (NID_STORAGE); high-hazard (HAZARD == “H”) dam in each state
leafem::popup
and should only include the dam name, storage, purposes, and year completed.Total: 100 points (120 points total)
You will submit a URL to your web page deployed with GitHub pages.
To do this:
https://USERNAME.github.io/geog-13-labs/lab-04.html
Submit this URL in the appropriate Gauchospace dropbox. Also take a moment to update your personal webpage with this link and some bullet points of what you learned. While not graded as part of this lab, it will be your final!