Writing Functions
We have learned a lot!
R is a statistical computing language that provides features as functions
Even with just its base installation, R provides hundreds of functions:
We have been using functions written for us - mostly by sf
and the tidyverse
This how any commercial GIS suite operates as well
Analysis and workflows are limited to the tools kits and options exposed to the user
In R, a lot more is actually exposed!
Every time we install a new package, we download code that provides new specific features (as functions)
Every time we attach a package to a working session (library()
) we are making those functions available/visible
Just like x = 10
binds the value of 10 to the name x creating an object visible in the environment,
functions are objects that can be called by name to execute a set of directions over defined arguments.
One of the best ways to improve your skills as a data scientist is to write functions.
Functions allow you to automate common tasks in a more general way than copy-and-pasting.
The more times you apply a function, the more incentive you have to optimize it for speed/accuracy
The more creative/unique your analyses and questions can be
The process can be named
As requirements change, you only need to update code in one place
You eliminate the chance of making incidental mistakes when you copy and paste (forgetting to change dist_to_state
to dist_to_border
).
functions can be ‘sourced’ into Qmd/Rmd files and R Scripts
You save yourself time
Data is the first argument (better for pipes!)
Whenever you have copy-and pasted code more than twice you should write a function
For example how many times have we coded:
Or made the same table with knitr
/kableExtra
only changing the col.names
and data.frame
Or calculated a distance, changed the units, and dropped them?
All of these task are repetitive and prone to making errors that could impact our analysis but not break our code…
Creating a function follows the form:
Where:
name
is the function name (e.g. st_as_sf
)
arg1
is the first input
arg2
is the second input
*
is any other argument you want to define
code ...
defines the instructions to carry out on arg1
and arg2
return(...)
is what the function returns
Here the input data (us_states) could change
So could the variable name we filter by (name)
So, lets start with a function that takes general input data and a variable name
Function arguments typically include two two broad sets: - the data to compute on, - arguments that control the details of the calculation
In st_transform
x
is the data, crs
is the proj4string/EPSG code
In ms_simplify
input
is the data, keep
defines the directions
In get_conus
: data
provides the data, var
defines the column to filter
na.rm = FALSE
We then have to carry these generalizations into the function directions using the arguments as our operators:
here, we replace us_states()
with data
we use get()
to return the value of a named object
We assign our filtered object to the name conus
And explicitly return the conus
object from the function
The value returned by the function is usually the last evaluated statement, if we don’t specify return we can take advantage of this default:
Like any object, we have to run the lines of code to save it as an object before we can use it directly in our code:
But then …
Another advantage of functions is that if our requirements change, we only need to make the change our code in one place.
This also means we can spend more time fine-tuning our code since we know it will be recycled.
Here we can be more focused and make sure to remove other potential “non-conus” states from any input object:
The power of GIS lies in analyzing multiple data sources together.
Often the answer you want lies in many different layers and you need to do some analysis to extract and compile information.
One common analysis is Points-in-Polygon (PIP).
PIP is useful when you want to know how many - or what kind of - points fall within the bounds of each polygon
To count the Starbucks locations in CA counties, we start by joining the CA counties to the locations:
Here we uses the counties
as the x table and the locations as the y table
This is because we want to add the starbucks information to the county
sf object.
Remember the default of st_join
is a left_join
on the st_intersects
predicate
(starbucks1 = st_join(co, starbucks))
#> Simple feature collection with 511 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -1146480 ymin: 1566911 xmax: -504612.5 ymax: 2073715
#> Projected CRS: NAD83 / Conus Albers
#> First 10 features:
#> name geoid state_name store_name
#> 1 Yuma 08125 Colorado <NA>
#> 2 San Juan 08111 Colorado <NA>
#> 3 Baca 08009 Colorado <NA>
#> 4 Prowers 08099 Colorado <NA>
#> 5 Custer 08027 Colorado <NA>
#> 6 Fremont 08043 Colorado City Market - Canon City #417
#> 7 Mesa 08077 Colorado I-70 Business Loop & 32 Rd - C
#> 7.1 Mesa 08077 Colorado Hwy 6 & 25 Rd - Grand Junction
#> 7.2 Mesa 08077 Colorado City Market-Grand Junction #444
#> 7.3 Mesa 08077 Colorado City Market - Grand Junction #432
#> geometry
#> 1 MULTIPOLYGON (((-575240.5 1...
#> 2 MULTIPOLYGON (((-1042591 16...
#> 3 MULTIPOLYGON (((-617878.8 1...
#> 4 MULTIPOLYGON (((-587200.4 1...
#> 5 MULTIPOLYGON (((-847581.3 1...
#> 6 MULTIPOLYGON (((-863875.9 1...
#> 7 MULTIPOLYGON (((-1114392 18...
#> 7.1 MULTIPOLYGON (((-1114392 18...
#> 7.2 MULTIPOLYGON (((-1114392 18...
#> 7.3 MULTIPOLYGON (((-1114392 18...
count()
is a dplyr function that “lets you quickly count the unique values of one or more variables: df |> count(a, b) is roughly equivalent to df |> group_by(a, b) |> summarize(n = n())”
(count(starbucks1, geoid))
#> Simple feature collection with 64 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -1146480 ymin: 1566911 xmax: -504612.5 ymax: 2073715
#> Projected CRS: NAD83 / Conus Albers
#> First 10 features:
#> geoid n geometry
#> 1 08001 35 MULTIPOLYGON (((-765835.6 1...
#> 2 08003 1 MULTIPOLYGON (((-878691.4 1...
#> 3 08005 58 MULTIPOLYGON (((-769015.2 1...
#> 4 08007 1 MULTIPOLYGON (((-1004101 16...
#> 5 08009 1 MULTIPOLYGON (((-617878.8 1...
#> 6 08011 1 MULTIPOLYGON (((-640691.8 1...
#> 7 08013 36 MULTIPOLYGON (((-819541.3 1...
#> 8 08014 9 MULTIPOLYGON (((-775389.1 1...
#> 9 08015 1 MULTIPOLYGON (((-905015.4 1...
#> 10 08017 1 MULTIPOLYGON (((-616732 176...
We can anticipate that PIP is a useful process we want to implement over variable points and polygons pairs
So, lets make a function named point_in_polygon
, that takes a point
dataset and a polygon
dataset
count
variableIn its current form, point_in_polygon
only counts on geoid
Lets modify that by making the variable name an input
get()
to return the value of a named objectid
name
Lets apply our function over the counties and see how long it takes
We can check the time it takes by wrapping our function in system.time
~ 30 seconds to test ~282,100,000 point/polygon relations is not bad, but could be a bottle neck in analysis
Lets look at a common means for improvement…
Remember our geometries are sticky, that means they carry through all calculations - whether they are needed or not
We can ease alot of computational overhead by being mindful of when we retain our geometry data with our attribute data.
Effectively a 86% decrease in time needed ((29-4) / 29)
What about a really nice, clean, informative plot?
ggplots look great but can be time consuming to program…
A function would allow us to take care of the groundwork
plot_pip = function(data){
ggplot() +
geom_sf(data = data, aes(fill = log(n)), alpha = .9, size = .2) +
scale_fill_gradient(low = "white", high = "darkgreen") +
theme_void() +
theme(legend.position = 'none',
plot.title = element_text(face = "bold", color = "darkgreen", hjust = .5, size = 24)) +
labs(title = "Starbucks Locations",
caption = paste0(sum(data$n), " locations represented"))
}
Here is a nice tutorial of point-in-polygon in QGIS. It is looking at earthquakes by country.
Just like us they use naturalearth
data for the polygon areas
And they are looking at earthquake data maintained by NOAA.
In R, we can read the NOAA data directly from a URL.
The data is a tab delimited txt file so we use readr::read_delim()
sf::sf_use_s2(FALSE)
countries = st_as_sf(rnaturalearth::countries110)
quakes = 'data/earthquakes-2025-03-29_21-55-22_-0600.tsv' |>
read_delim(delim = "\t") |>
filter(!is.na(Latitude), !is.na(Longitude)) |>
st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) |>
st_transform(st_crs(countries))
nrow(countries)
#> [1] 177
nrow(quakes)
#> [1] 6443
nrow(st_intersection(quakes, countries))
#> [1] 4317
We can use our functions right out of the box for this data
But… somethings are not quite right..
point_in_polygon3(quakes, countries, var = "ADMIN") |>
plot_pip() +
labs(title = "Earthquake Locations",
subtitle = "Most impacted countries") +
theme(plot.subtitle = element_text(hjust = .5),
plot.title = element_text(color = "navy")) +
scale_fill_viridis_c() +
geom_sf(data = quakes, size = .3, alpha = .05, col = 'red') +
gghighlight::gghighlight(n > (mean(n) + sd(n)))