Week 3

Writing Functions

Over the last few weeks :

We have learned a lot!

tools for reading and writing data:

read_sf / write_sf
read_csv / write_csv
read_excel

tools for data maniputation

filter
select
mutate
group_by
summarize
rollmean, lag

tools to manage/manipulate data type/structure

as.numeric
as.factor
st_as_sf
data.frame

tools for merging and shaping data

inner_join
left_join
right_join
full_join
pivot_longer
pivot_wider

tools for measuring geometries:

st_distance
st_area
st_length
set_units / drop_units

These tools are all functions

R is a statistical computing language that provides features as functions
Even with just its base installation, R provides hundreds of functions:

length(lsf.str("package:base")) + length(lsf.str("package:stats")) + length(lsf.str("package:utils"))
#> [1] 1947

sf provides over 100 more

length(lsf.str("package:sf")) 
#> [1] 150

and the core tidyverse packages (that we use) an additional ~750

length(lsf.str("package:dplyr")) +
length(lsf.str("package:ggplot2")) +
length(lsf.str("package:tidyr")) +
length(lsf.str("package:forcats")) +
length(lsf.str("package:purrr")) 
#> [1] 1025

To date …

We have been using functions written for us - mostly by sf and the tidyverse
This how any commercial GIS suite operates as well
Analysis and workflows are limited to the tools kits and options exposed to the user
In R, a lot more is actually exposed!
Every time we install a new package, we download code that provides new specific features (as functions)
Every time we attach a package to a working session (library()) we are making those functions available/visible

Functions are objects

Just like x = 10 binds the value of 10 to the name x creating an object visible in the environment,
functions are objects that can be called by name to execute a set of directions over defined arguments.

class(sf::st_intersects)
#> [1] "function"
class(sf::st_as_sf)
#> [1] "function"

Our own functions are visable as objects in the environemnt

x = 10
y = data.frame(x = 1:10, y = 10:1)
f = function(x,y){ x  + y }

Advancing your programming skills

One of the best ways to improve your skills as a data scientist is to write functions.
Functions allow you to automate common tasks in a more general way than copy-and-pasting.
The more times you apply a function, the more incentive you have to optimize it for speed/accuracy
The more creative/unique your analyses and questions can be

So why write functions opposed to scripts?

The process can be named
As requirements change, you only need to update code in one place
You eliminate the chance of making incidental mistakes when you copy and paste (forgetting to change dist_to_state to dist_to_border).
functions can be ‘sourced’ into Qmd/Rmd files and R Scripts
You save yourself time

Rule of thumb

Data is the first argument (better for pipes!)
Whenever you have copy-and pasted code more than twice you should write a function

For example how many times have we coded:

states = USAboundaries::us_states() |> 
  filter(!name %in% c("Hawaii", "Puerto Rico", "Alaska"))

Or made the same table with knitr/kableExtra only changing the col.names and data.frame
Or calculated a distance, changed the units, and dropped them?
All of these task are repetitive and prone to making errors that could impact our analysis but not break our code…

The form of a function:

Creating a function follows the form:

name = function(arg1, arg2, *){
  code
  ..
  return(...)
}

Where:

name is the function name (e.g. st_as_sf)
- This is the name on which R is able to call the object
arg1 is the first input
arg2 is the second input
* is any other argument you want to define
code ... defines the instructions to carry out on arg1 and arg2
return(...) is what the function returns

Defining a function

To define a function we need to identify the code we have, and what can/should generalized for future uses?

states = USAboundaries::us_states() |> 
  filter(!name %in% c("Hawaii", "Puerto Rico", "Alaska"))

Here the input data (us_states) could change
So could the variable name we filter by (name)

Function Signiture

So, lets start with a function that takes general input data and a variable name

get_conus = function(data, var){

}

Function arguments

Function arguments typically include two two broad sets: - the data to compute on, - arguments that control the details of the calculation

In st_transform x is the data, crs is the proj4string/EPSG code
In ms_simplify input is the data, keep defines the directions
In get_conus: data provides the data, var defines the column to filter

Generally, data arguments should come first.
Detail arguments should go on the end
It can be useful - and good practice - to define default values.
- should almost always be the most common value.
- The exceptions to this rule are to do with safety of the process.
  - e.g. na.rm = FALSE

Code body

We then have to carry these generalizations into the function directions using the arguments as our operators:

get_conus = function(data, var){
  conus = filter(data, !get(var) %in% c("Hawaii", "Puerto Rico", "Alaska"))
  return(conus)
}

here, we replace us_states() with data
we use get() to return the value of a named object
We assign our filtered object to the name conus
And explicitly return the conus object from the function
The value returned by the function is usually the last evaluated statement, if we don’t specify return we can take advantage of this default:

get_conus = function(data, var){
  filter(data, !get(var) %in% c("Hawaii", "Puerto Rico", "Alaska"))
}

Using our function:

Like any object, we have to run the lines of code to save it as an object before we can use it directly in our code:
But then …

States

x = get_conus(data = us_states(), var = "name")
plot(x$geometry)

Counties

x2 = get_conus(data = us_counties()[,-9], var = "state_name")
plot(x2$geometry)

Cities

cities = read_csv("../labs/data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) |> 
  get_conus("state_name")

plot(cities$geometry, pch = 16, cex = .1)

It’s ok to be more detailed

Another advantage of functions is that if our requirements change, we only need to make the change our code in one place.
This also means we can spend more time fine-tuning our code since we know it will be recycled.
Here we can be more focused and make sure to remove other potential “non-conus” states from any input object:

get_conus = function(data, var){
  filter(data, !get(var) %in% 
           c("Hawaii", "Puerto Rico", "Alaska",
             "Guam", "District of Columbia"))
}

Using our function

conus = get_conus(us_states(), "name")
nrow(conus)
#> [1] 48

Point-in-Polygon Case Study

The power of GIS lies in analyzing multiple data sources together.
Often the answer you want lies in many different layers and you need to do some analysis to extract and compile information.
One common analysis is Points-in-Polygon (PIP).
PIP is useful when you want to know how many - or what kind of - points fall within the bounds of each polygon

Data

CONUS counties

counties = st_transform(us_counties()[,-9], 5070) |> 
  select(name, geoid, state_name) |> 
  get_conus("state_name")

CONUS Starbucks

starbucks = read_csv('data/directory.csv') |> 
  filter(!is.na(Latitude), Country == "US") |> 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) |> 
  st_transform(5070) |> 
  st_filter(counties) |> 
  select(store_name = `Store Name`)

Colorado Counties

co = filter(counties, state_name == "Colorado")

Step 1: Spatial Join

To count the Starbucks locations in CA counties, we start by joining the CA counties to the locations:

Here we uses the counties as the x table and the locations as the y table
This is because we want to add the starbucks information to the county sf object.
Remember the default of st_join is a left_join on the st_intersects predicate

(starbucks1 = st_join(co, starbucks))
#> Simple feature collection with 511 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -1146480 ymin: 1566911 xmax: -504612.5 ymax: 2073715
#> Projected CRS: NAD83 / Conus Albers
#> First 10 features:
#>         name geoid state_name                        store_name
#> 1       Yuma 08125   Colorado                              <NA>
#> 2   San Juan 08111   Colorado                              <NA>
#> 3       Baca 08009   Colorado                              <NA>
#> 4    Prowers 08099   Colorado                              <NA>
#> 5     Custer 08027   Colorado                              <NA>
#> 6    Fremont 08043   Colorado     City Market - Canon City #417
#> 7       Mesa 08077   Colorado    I-70 Business Loop & 32 Rd - C
#> 7.1     Mesa 08077   Colorado    Hwy 6 & 25 Rd - Grand Junction
#> 7.2     Mesa 08077   Colorado   City Market-Grand Junction #444
#> 7.3     Mesa 08077   Colorado City Market - Grand Junction #432
#>                           geometry
#> 1   MULTIPOLYGON (((-575240.5 1...
#> 2   MULTIPOLYGON (((-1042591 16...
#> 3   MULTIPOLYGON (((-617878.8 1...
#> 4   MULTIPOLYGON (((-587200.4 1...
#> 5   MULTIPOLYGON (((-847581.3 1...
#> 6   MULTIPOLYGON (((-863875.9 1...
#> 7   MULTIPOLYGON (((-1114392 18...
#> 7.1 MULTIPOLYGON (((-1114392 18...
#> 7.2 MULTIPOLYGON (((-1114392 18...
#> 7.3 MULTIPOLYGON (((-1114392 18...

Step 2: Point counts by Polygon

count() is a dplyr function that “lets you quickly count the unique values of one or more variables: df |> count(a, b) is roughly equivalent to df |> group_by(a, b) |> summarize(n = n())”

(count(starbucks1, geoid))
#> Simple feature collection with 64 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -1146480 ymin: 1566911 xmax: -504612.5 ymax: 2073715
#> Projected CRS: NAD83 / Conus Albers
#> First 10 features:
#>    geoid  n                       geometry
#> 1  08001 35 MULTIPOLYGON (((-765835.6 1...
#> 2  08003  1 MULTIPOLYGON (((-878691.4 1...
#> 3  08005 58 MULTIPOLYGON (((-769015.2 1...
#> 4  08007  1 MULTIPOLYGON (((-1004101 16...
#> 5  08009  1 MULTIPOLYGON (((-617878.8 1...
#> 6  08011  1 MULTIPOLYGON (((-640691.8 1...
#> 7  08013 36 MULTIPOLYGON (((-819541.3 1...
#> 8  08014  9 MULTIPOLYGON (((-775389.1 1...
#> 9  08015  1 MULTIPOLYGON (((-905015.4 1...
#> 10 08017  1 MULTIPOLYGON (((-616732 176...

Step 3: Combine the processes …

starbucks1 = st_join(co, starbucks) |> 
   count(geoid)

plot(starbucks1['n'])

Now for Colorado?

We can anticipate that PIP is a useful process we want to implement over variable points and polygons pairs

So, lets make a function named point_in_polygon, that takes a point dataset and a polygon dataset

point_in_polygon = function(points, polygon){
 st_join(polygon, points) |> 
   count(geoid)
}

Test

co_sb = point_in_polygon(starbucks, co)
plot(co_sb['n'])

ca_sb = point_in_polygon(starbucks, 
                         filter(counties, state_name == "California"))
plot(co_sb['n'])

or_sb = point_in_polygon(starbucks, 
                         filter(counties, 
                                state_name == "Oregon"))
plot(or_sb['n'])

ny_sb = point_in_polygon(starbucks, 
                         filter(counties, 
                                state_name == "New York"))
plot(ny_sb['n'])

Generalizing the `count` variable

In its current form, point_in_polygon only counts on geoid
Lets modify that by making the variable name an input
- again, we use get() to return the value of a named object
- we call this, variable id

point_in_polygon2 = function(points, polygon, var){
  st_join(polygon, points) |> 
    count(get(var))
}

Applying the new function

Here, we can apply the PIP function over states and count by name

states = get_conus(us_states(), "name") |> 
  st_transform(5070)

state_sb = point_in_polygon2(starbucks, states, 'name')

plot(state_sb['n'])

Optimizing functions

Lets apply our function over the counties and see how long it takes
We can check the time it takes by wrapping our function in system.time

system.time({
  us = point_in_polygon(starbucks, counties)
})

# user    system  elapsed 
# 3.719   0.354   4.309

Thats not bad…

# How many seconds per point?
(point_per_sec = 4.3 / (nrow(counties) * nrow(starbucks)))
#> [1] 1.036916e-07

# How will this scale to our dams data
# (assuming the process is linear)

point_per_sec * (nrow(counties) * 91000)
#> [1] 29.31745

~ 30 seconds to test ~282,100,000 point/polygon relations is not bad, but could be a bottle neck in analysis
Lets look at a common means for improvement…

To keep geometery or not?

Remember our geometries are sticky, that means they carry through all calculations - whether they are needed or not
We can ease alot of computational overhead by being mindful of when we retain our geometry data with our attribute data.

system.time({
  st_join(counties, starbucks) |> 
    count(geoid) 
})

#user    system  elapsed 
#3.970   0.421   5.521

system.time({
  st_join(counties, starbucks) |> 
    st_drop_geometry() |> 
    count(geoid) |> 
    left_join(counties, by = 'geoid') |> 
    st_as_sf() 
})

# user    system  elapsed 
# 0.396   0.017   0.598

# How many seconds per point?
# How many seconds per point?
(point_per_sec = .598 / (nrow(counties) * nrow(starbucks)))
#> [1] 1.442037e-08

# How will this scale to our dams data
# (assuming the process is linear)

point_per_sec * (nrow(counties) * 91000)
#> [1] 4.077171

Awesome!

Effectively a 86% decrease in time needed ((29-4) / 29)

“Function-izing” our improvements

point_in_polygon3 = function(points, polygon, var){
  st_join(polygon, points) |> 
    st_drop_geometry() |> 
    count(get(var)) |> 
    setNames(c(var, "n")) |> #<< 
    left_join(polygon, by = var) |> 
    st_as_sf() 
}

What else can we wrap?

What about a really nice, clean, informative plot?
ggplots look great but can be time consuming to program…
A function would allow us to take care of the groundwork

plot_pip = function(data){
  ggplot() + 
    geom_sf(data = data, aes(fill = log(n)), alpha = .9, size = .2) + 
    scale_fill_gradient(low = "white", high = "darkgreen") + 
    theme_void() + 
    theme(legend.position = 'none',
          plot.title = element_text(face = "bold", color = "darkgreen", hjust = .5, size = 24)) +
    labs(title = "Starbucks Locations",
         caption = paste0(sum(data$n), " locations represented")) 
}

This is great because we can devote the time to making a nice plot and we will be able to recycle the work over other cases…

Test

point_in_polygon3(starbucks, filter(counties, state_name == "California"), "geoid") |> 
plot_pip()

point_in_polygon3(starbucks, filter(counties, state_name == "New York"), "geoid") |> 
plot_pip()

point_in_polygon3(starbucks, counties, "geoid") |> 
  plot_pip()

point_in_polygon3(starbucks, states, var = "name") |> 
  plot_pip()

Moving beyond Starbucks?

Here is a nice tutorial of point-in-polygon in QGIS. It is looking at earthquakes by country.
Just like us they use naturalearth data for the polygon areas
And they are looking at earthquake data maintained by NOAA.
In R, we can read the NOAA data directly from a URL.
The data is a tab delimited txt file so we use readr::read_delim()

sf::sf_use_s2(FALSE)
countries = st_as_sf(rnaturalearth::countries110)

quakes = 'data/earthquakes-2025-03-29_21-55-22_-0600.tsv' |> 
  read_delim(delim = "\t") |> 
  filter(!is.na(Latitude), !is.na(Longitude)) |> 
  st_as_sf(coords = c("Longitude", "Latitude"), crs = 4326) |>     
  st_transform(st_crs(countries))

nrow(countries)
#> [1] 177
nrow(quakes)
#> [1] 6443
nrow(st_intersection(quakes, countries))
#> [1] 4317

PIP –> Plotting Method

We can use our functions right out of the box for this data
But… somethings are not quite right..

point_in_polygon3(quakes, countries, var = "ADMIN") |> 
  plot_pip()

Modify for our analysis …

point_in_polygon3(quakes, countries, var = "ADMIN") |> 
  plot_pip() + #<<
  labs(title = "Earthquake Locations") + 
  scale_fill_viridis_c() + 
  geom_sf(data = quakes, size = .25, alpha = .05, col = 'red')

Improve the anaylsis…

point_in_polygon3(quakes, countries, var = "ADMIN") |> 
  plot_pip() + 
  labs(title = "Earthquake Locations",
       subtitle = "Most impacted countries") + 
  theme(plot.subtitle = element_text(hjust = .5),
        plot.title = element_text(color = "navy")) + 
  scale_fill_viridis_c() + 
  geom_sf(data = quakes, size = .3, alpha = .05, col = 'red') +
  gghighlight::gghighlight(n > (mean(n) + sd(n)))

Week 3

Over the last few weeks :

tools for reading and writing data:

tools for data maniputation

tools to manage/manipulate data type/structure

tools for merging and shaping data

tools for measuring geometries:

These tools are all functions

To date …

Functions are objects

Our own functions are visable as objects in the environemnt

Advancing your programming skills

So why write functions opposed to scripts?

Rule of thumb

The form of a function:

Defining a function

Function Signiture

Function arguments

Code body

Using our function:

States

Counties

Cities

It’s ok to be more detailed

Using our function

Point-in-Polygon Case Study

Data

CONUS counties

CONUS Starbucks

Colorado Counties

Step 1: Spatial Join

Step 2: Point counts by Polygon

Step 3: Combine the processes …

Now for Colorado?

Test

Generalizing the count variable

Applying the new function

Optimizing functions

Thats not bad…

To keep geometery or not?

Awesome!

“Function-izing” our improvements

What else can we wrap?

Test

Moving beyond Starbucks?

PIP –> Plotting Method

Modify for our analysis …

Improve the anaylsis…

Generalizing the `count` variable