Week 2

Vector Data — Features, Projections & Measures

Today’s Arc

From Tabular to Spatial

Last week: data as rows and columns — wrangling, joining, visualizing, modeling.

This week: data as features in space — the same verbs, but now every observation has a geometry that describes where it is and what shape it takes.

Today’s arc:

Simple Features — the standard and the R implementation
Spatial Computing: The Stack — GDAL, PROJ, GEOS (the C++ libraries under sf)
Coordinate Systems — GCS, datums, projections
Measurements — distance, area, units

(We’ll take a 5-minute break after Simple Features.)

R Objects and Classes

Why Classes Matter: One Data Structure, Many Behaviors

You’ve already used S3 classes in tidyverse without noticing. The same object type can behave differently based on its class.

Understanding S3 classes in R explains:

Why print() on a tibble looks different than a regular data.frame

Why filter() on grouped data filters within each group instead of globally

Why summarize() on a grouped_df produces a different output shape

This is the callback pattern — R looks at an object’s class and dispatches to the right function. Later, sf will extend data.frame with spatial methods that work the same way.

Adding Class: Matrix → Data Frame

A class is metadata attached to an object telling R how to interpret and operate on it:

# Start with a matrix — just numbers in rows and columns
v <- c(1,2,3,4,5,6)
class(v)
#> [1] "numeric"
m <- matrix(v, nrow = 3)
class(m)
#> [1] "matrix" "array"
m
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5
#> [3,]    3    6

# Add class metadata: "this is a data frame"
df <- as.data.frame(m)
class(df)
#> [1] "data.frame"
df
#>   V1 V2
#> 1  1  4
#> 2  2  5
#> 3  3  6

Same data. But now df has column names, printed summary, and methods that respect rectangular structure. The class changed the behavior without changing the underlying values.

Class Hierarchies: Adding Layers

Objects can have multiple classes in a hierarchy. Adding structure adds more classes:

# Plain data.frame
df <- data.frame(gauge = c("A", "B", "C"), flow = c(100, 120, 95))
class(df)
#> [1] "data.frame"

# Add grouping — now it has *two* classes
gdf <- group_by(df, gauge)
class(gdf)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

# Methods are dispatched: dplyr functions check for "grouped_df" first
slice(gdf, 1)  # respects grouping — one row per group, not one global row
#> # A tibble: 3 × 2
#> # Groups:   gauge [3]
#>   gauge  flow
#>   <chr> <dbl>
#> 1 A       100
#> 2 B       120
#> 3 C        95
slice(df, 1)   # no grouping — just one row total
#>   gauge flow
#> 1     A  100

The key insight: the same function behaves differently based on class. This is the callback mechanism. When you call slice(), R internally asks: “What class is this object?” and dispatches to the right implementation.

Methods: Functions That Know Their Object’s Class

Every function that respects class has methods or versions specialized for each class:

# Base R example: print() has different methods
print.data.frame()   # specialized for data.frames
print.default()      # fallback for other objects

# You can list all methods for a class:
methods(class = "data.frame")   # all functions with .data.frame versions

select(), filter(), mutate() are dplyr verbs that have methods:

methods("filter")    # shows filter.data.frame, filter.grouped_df, filter.ts, etc.
#> [1] filter.data.frame* filter.sf*         filter.ts*        
#> see '?methods' for accessing help and source code

When you call filter() on a grouped dataframe, R internally does:

Check: “What class is this object?” (grouped_df)
Look for filter.grouped_df()
If it exists, call it; if not, fall back to filter.data.frame()
The filter.grouped_df() method applies the condition within each group — the key behavior

Compare the same code on grouped vs ungrouped data:

df <- data.frame(site = c("A", "A", "B", "B"), flow = c(100, 120, 95, 110))

# On ungrouped data: keeps rows where flow > 105
df |> filter(flow > 105)
#>   site flow
#> 1    A  120
#> 2    B  110

# On grouped data: keeps earliest row per group where flow > 105
group_by(df, site) |> filter(flow > 105)
#> # A tibble: 2 × 2
#> # Groups:   site [2]
#>   site   flow
#>   <chr> <dbl>
#> 1 A       120
#> 2 B       110

The function is identical; the method dispatched changes based on class.

Spatial Computing: The Stack

The Dependencies Under `sf`

sf is powerful because it’s not built from scratch — it wraps three long standing C/C++ libraries, each with a focused job:

The Spatial Stack

Your R code:

sf::st_read("data.shp")      # I/O
sf::st_distance(a, b)        # Geometry
sf::st_transform(x, 4269)    # CRS

📦 GDAL

File I/O

Format translators
Reads/writes Shapefile, GeoJSON, GeoPackage, GeoTIFF
Returns WKB geometry

🧭 PROJ

Coordinate Systems

Transforms between 10,000+ CRS definitions
Handles datum shifts & projections
Underlies st_transform()

⚙️ GEOS

Geometry Ops

Intersection, union, buffer, centroid
Distance calculations
Spatial predicates (st_intersects())

Why This Matters

Each library solves a hard problem:

Library	Solves	Water Resources Example
GDAL	“How do I read a shapefile or GeoPackage?”	Read NHD flowlines, USGS boundaries, stream gauges
PROJ	“How do I convert WGS84 to Albers?”	Transform NWIS gauges from GPS coordinates to federal projection
GEOS	“Is this point inside this polygon? What’s the distance?”	Snap gauge to nearest flowline, clip to basin

If we had to write these from scratch in R, it would take years. Instead, spatial computing in R benefits from decades of work by the OSGeo community.

The `sf` Wrapper

sf::sf_extSoftVersion()
#>           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
#>       "3.14.1"       "3.12.2"        "9.7.1"         "true"         "true" 
#>           PROJ 
#>        "9.7.1"

sf doesn’t reinvent these wheels. It:

Calls GDAL for I/O (st_read(), st_write())
Calls PROJ for CRS transformations (st_transform())
Calls GEOS for geometry operations (st_distance(), st_buffer(), spatial predicates)
Wraps results with S3 classes and methods so dplyr verbs (filter(), mutate(), select()) just work — same as tidyverse

This diagram shows the relationship: sf methods dispatch to GEOS for the heavy lifting, then return R objects.

Note

When you install the sf package on your machine, the R package manager compiles and links against GDAL, PROJ, and GEOS. If installation fails, it’s usually because one of these C/C++ libraries is missing from your system. This is why setting up spatial computing can be more complex than installing a pure-R package.

What Is GDAL?

GDAL = Geospatial Data Abstraction Layer

A C++ library (developed by OSGeo, used by every GIS software) that:

Abstracts file formats — speaks shapefile, GeoJSON, GeoPackage, etc., but exposes the same API to your program
Translates between formats — convert any format to any other with one command
Manages drivers — each format has a “driver” (code that knows how to read/write that specific format)
Handles metadata — CRS, bounds, layer names, attribute types

# GDAL (C++ library) <- drivers (one per format)
#     ├── Shapefile driver
#     ├── GeoJSON driver
#     ├── GeoPackage driver
#     ├── GeoTIFF driver
#     ├── NetCDF driver
#     └── ... 200+ more

When you call st_read("data.shp"), here’s what happens:

R/sf calls GDAL: “Read this file”
GDAL looks at the extension (.shp)
GDAL finds the Shapefile driver
The driver reads the .shp and its sidecars
GDAL returns data to sf
sf wraps it in an sfc geometry + sf object structure and hands it to your R code

Note

You don’t write GDAL code directly. But st_read() and st_write() are thin wrappers around GDAL. Understanding this explains why st_read() “just works” on 50+ formats and why format conversions are trivial in R but painful elsewhere.

GDAL Drivers and Water Resources

sf::st_drivers() # shows Format, type (vector/raster), open, write

Key drivers for water resources:

Driver	Format	Extension	Common Use
ESRI Shapefile	Shapefile	.shp	USGS NHD, NOAA, boundaries
GeoJSON	GeoJSON	.geojson	Google APIs, web services
GPKG	GeoPackage	.gpkg	NHDPlus HR, modern data
GTiff	GeoTIFF	.tif	USGS 3DEP elevation, imagery
netCDF	NetCDF	.nc	NOAA climate grids
HDF5	HDF5	.h5	NASA satellite data

Important

GeoPackage replaces Shapefiles: one file, complete metadata, SQLite-based. Shapefiles require sidecars (.shx, .dbf, .prj) and lack metadata. Use GeoPackage when possible.

How Core Libraries Pass Data to R: WKT and WKB

When GDAL reads a file or GEOS performs an operation, it uses two standard encodings to pass geometry to R:

Well-Known Text (WKT) — human-readable (you’ll see this in error messages and discussions):

x <- st_linestring(matrix(10:1, 5))
st_as_text(x)
#> [1] "LINESTRING (10 5, 9 4, 8 3, 7 2, 6 1)"

LINESTRING = type; parentheses contain points; points separated by commas; coordinates separated by spaces

Well-Known Binary (WKB) — machine-readable, used internally by GDAL/GEOS for speed and lossless communication:

st_as_binary(x)
#>  [1] 01 02 00 00 00 05 00 00 00 00 00 00 00 00 00 24 40 00 00 00 00 00 00 14 40
#> [26] 00 00 00 00 00 00 22 40 00 00 00 00 00 00 10 40 00 00 00 00 00 00 20 40 00
#> [51] 00 00 00 00 00 08 40 00 00 00 00 00 00 1c 40 00 00 00 00 00 00 00 40 00 00
#> [76] 00 00 00 00 18 40 00 00 00 00 00 00 f0 3f

Note

You rarely write WKT/WKB directly, but understanding they exist explains why st_read() “just works”—GDAL converts any format to WKB, sf wraps it as an R object, and the roundtrip is seamless.

Simple Features

Today’s Data: Colorado Counties

co
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    geoid       name      aland state_nm                       geometry
#> 1  08001      Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2  08003    Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3  08005   Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4  08007  Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5  08009       Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6  08011       Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7  08013    Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8  08014 Broomfield   85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9  08015    Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017   Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...

What Is a Simple Feature?

Simple features (ISO 19125-1:2004) is a standard describing how real-world objects can be represented in computers, with emphasis on their spatial geometry
It defines:
1. A class hierarchy of geometry types
2. A set of operations on those geometries
3. Binary and text encodings for storage and transfer

Widely implemented: PostGIS, ESRI ArcGIS, GDAL, GeoJSON — and sf in R
A subset of simple features (the big 7) forms the GeoJSON specification

What Is a Feature?

A feature is a thing in the real world — a river, a gauge, a watershed, a county.

Features often consist of other features:
- A river system → a river → an outlet point
- A watershed → sub-basins → individual reaches

The standard says: “A simple feature has both spatial and non-spatial attributes. Spatial attributes are geometry valued.”

The geometry describes where the feature is and how it is represented
A river’s geometry could be its watershed polygon, its mainstem linestring, or its outlet point — depending on the question

str(co)
#> Classes 'sf' and 'data.frame':   64 obs. of  5 variables:
#>  $ geoid   : chr  "08001" "08003" "08005" "08007" ...
#>  $ name    : chr  "Adams" "Alamosa" "Arapahoe" "Archuleta" ...
#>  $ aland   : num  3.02e+09 1.87e+09 2.07e+09 3.50e+09 6.62e+09 ...
#>  $ state_nm: chr  "Colorado" "Colorado" "Colorado" "Colorado" ...
#>  $ geometry:sfc_MULTIPOLYGON of length 64; first list element: List of 1
#>   ..$ :List of 1
#>   .. ..$ : num [1:132, 1:2] -105 -105 -105 -105 -105 ...
#>   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#>  - attr(*, "sf_column")= chr "geometry"
#>  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#>   ..- attr(*, "names")= chr [1:4] "geoid" "name" "aland" "state_nm"

POINT

A point is composed of one coordinate pair in a specific coordinate system.

Points have no length or area
Example: a gauge location, a dam, a city

ggplot() + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), size = 3) + 
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

LINESTRING

A linestring is composed of an ordered sequence of two or more coordinate points.

Points in a line are called vertices and explicitly define the connection between two points
A line has length, but no area
Example: a stream centerline, a river reach, a road

ggplot() + 
  geom_line(aes(x = c(1, 2, 3), y = c(3, 1, 2)), linewidth = 1) + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), col = "red", size = 3) + 
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

POLYGON

A polygon is composed of 4 or more points whose starting and ending point are the same.

Polygons enclose a closed ring (first point = last point)
Polygons have both length and area
Example: a watershed boundary, a county border, a floodplain

ggplot() + 
  geom_polygon(aes(x = c(1, 2, 3, 1), y = c(3, 1, 2, 3)), fill = "steelblue", alpha = 0.5, color = "black") + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), col = "red", size = 3) +
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

Multi-Geometries and Collections

Multi- variants are just sets of the same type:

MULTIPOINT = set of points
MULTILINESTRING = set of lines (e.g., a braided river system)
MULTIPOLYGON = set of polygons (e.g., an island chain, a fragmented watershed)

GEOMETRYCOLLECTION = a set of mixed geometry types (rare, usually from intersections)

The Big 7 Geometry Types

Single	Description
`POINT`	Zero-dimensional — a single coordinate pair
`LINESTRING`	Sequence of points connected by straight line pieces — 1D
`POLYGON`	Closed ring with positive area — 2D; may have interior holes

Multi (same type)	Description
`MULTIPOINT`	Set of points
`MULTILINESTRING`	Set of linestrings
`MULTIPOLYGON`	Set of polygons

Mixed	Description
`GEOMETRYCOLLECTION`	Set of geometries of any type

Note

These 7 types cover virtually everything in water resources work: gauge locations (POINT), stream reaches (LINESTRING), watershed boundaries (POLYGON), and their multi-feature equivalents.

Dimensions

All geometries are composed of points defined in 2-, 3-, or 4-D space:

XY — standard 2D coordinates (longitude, latitude or easting, northing)
XYZ — adds altitude (elevation of stream bed, DEM surface)
XYM — adds a measure attribute independent of the feature (e.g., streamflow at a vertex — rarely used)
XYZM — both Z and M

Note: “dimension” here means coordinate dimension (XY vs XYZ), not whether the geometry has length or area:

m1 <- rbind(c(8,1), c(2,5), c(3,2))
st_dimension(st_multipoint(m1))    # 0 — point set, no length
#> [1] 0
st_dimension(st_linestring(m1))    # 1 — has length, no area
#> [1] 1

Valid Geometries

Rules for valid geometries:

LINESTRING — no self-intersections
POLYGON — rings must be closed
POLYGON — holes inside exterior only
POLYGON — rings touch at points only
POLYGON — no repeated paths

Warning

Run st_is_valid() to check your data. Fix invalid geometries with st_make_valid().

Invalid Examples

Self-intersecting line:

x1 <- st_linestring(cbind(c(0,1,0,1), 
                          c(0,1,1,0)))
st_is_simple(x1)  # FALSE

Invalid polygon (crossing edges):

x2 <- st_polygon(list(cbind(
  c(0,1,1,1,0,0), 
  c(0,0,1,0.6,1,0))))
st_is_valid(x2)  # FALSE

These operations fail silently unless checked.

Empty Geometries

EMPTY geometries serve as “NA” placeholders for geometry types — they arise naturally from spatial operations:

(e <- st_intersection(st_point(c(0,0)), st_point(c(1,1))))
#> GEOMETRYCOLLECTION EMPTY
st_is_empty(e)
#> [1] TRUE

plot(st_point(c(0,0)), pch = 16)
plot(st_point(c(1,1)), add = TRUE, col = 'red', pch = 16)

sfg → sfc → sf

Storing Geometry: Nested Lists

Remember nested lists? A geometry is just coordinates:

# One geometry: a pair of coordinates
c(x = 1, y = 2)

# Many geometries: nested list of coordinate pairs
list(list(c(1, 2)), list(c(3, 4)), list(c(5, 6)))

R stores multiple geometries side-by-side this way: Each row in a data.frame holds one geometry (which is a list of coordinates). This creates a geometry column — many rows, each with its own nested geometry.

One row = one feature (with attributes + geometry)
Each row’s geometry = one nested coordinate structure
All rows together = a geometry list-column

Three Classes, One Object

🟢 sf — a data.frame with feature attributes and a geometry column (class: sf)
🔴 sfc — the geometry list-column (set of geometries)
🔵 sfg — individual simple feature geometry (one geometry per feature)

Important

This is the S3 class hierarchy in action. sf is a subclass of data.frame adding the geometry column and specialized methods. When you call filter() on an sf object, R looks at its class and dispatches to the spatial-aware filter.sf() method, which respects the geometry column. Same pattern as grouped_df from tidyverse.

`sfg`: Simple Feature Geometry

sfg objects carry the geometry for a single feature. Storage rules:

A single POINT → numeric vector
Points in a LINESTRING or POLYGON ring → matrix (each row = a point)
Any other set → list

Creator functions (rarely used in practice — you usually read existing data):

(x <- st_point(c(1, 2)))
str(x)
#>  'XY' num [1:2] 1 2

(x <- st_linestring(matrix(c(1,2,3,4), ncol = 2)))
str(x)
#>  'XY' num [1:2, 1:2] 1 2 3 4

`sfg`: Class Structure

All geometry objects have an S3 class indicating (1) dimension, (2) type, (3) superclass:

(pt  <- st_point(c(0, 1)))
attributes(pt)
#> $class
#> [1] "XY"    "POINT" "sfg"

(pt2 <- st_point(c(0, 1, 4)))   # XYZ
attributes(pt2)
#> $class
#> [1] "XYZ"   "POINT" "sfg"

`sfg`: Same Points, Different Meaning

The same coordinate matrix creates geometries with entirely different dimensionality:

m1 <- rbind(c(8,1), c(2,5), c(3,2))

(mp <- st_multipoint(m1))   # zero-dimensional point set
(ls <- st_linestring(m1))   # one-dimensional line

st_dimension(mp); st_length(mp)
#> [1] 0
#> [1] 0
st_dimension(ls); st_length(ls)
#> [1] 1
#> [1] 10.37338

Mixed Geometries and Collections

When features have different geometry types, the type becomes GEOMETRY.

Example: A collection with POINT + LINESTRING:

g <- st_sfc(st_point(c(0, 0)), 
            st_linestring(rbind(c(0, 0), c(1, 1))))

Filter by type with st_is():

st_sf(geometry = g) |>
  filter(st_is(geometry, "LINESTRING"))   # keep only lines
#> Simple feature collection with 1 feature and 0 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA
#>                geometry
#> 1 LINESTRING (0 0, 1 1)

When intersections produce mixed types, use st_collection_extract():

pt   <- st_point(c(1, 0))
ls   <- st_linestring(matrix(c(4, 3, 0, 0), ncol = 2))
poly <- st_polygon(list(matrix(c(5.5, 7, 7, 6, 5.5, 0, 0, -0.5, -0.5, 0), ncol = 2)))
collection <- st_geometrycollection(list(pt, ls, poly))

st_collection_extract(collection, "POLYGON")  # extract only polygons

Conversion Between Types: `st_cast`

(co1   <- filter(co, name == "Larimer")$geometry)
#> Geometry set for 1 feature 
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS:  WGS 84
(co_ls <- st_cast(co1, "MULTILINESTRING"))
#> Geometry set for 1 feature 
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS:  WGS 84

Casting: Feature vs. Set Level

Casting operates at the feature level, not the set level:

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_cast("POINT")   # ⚠️ converts ONE line to ONE point — not what you want
#> Warning in st_cast.LINESTRING(st_linestring(rbind(c(0, 0), c(1, 1), c(1, :
#> point from first coordinate only
#> POINT (0 0)

To extract individual points, work at the set level (sfc):

(p <- rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
   st_linestring() |>
   st_sfc() |>           # create a SET of one geometry
   st_cast("POINT"))     # now produces 4 individual points
#> Geometry set for 4 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1))

#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    1    1
#> [3,]    1    0
#> [4,]    0    1

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring()

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc()

#> Geometry set for 1 feature 
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc() |>
  st_cast("POINT")

#> Geometry set for 4 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc() |>
  st_cast("POINT") ->
  p

Combining and Dissolving Geometries

Two key operations for working with geometry sets:

st_combine() — joins geometries while preserving interior boundaries:

(co_combined <- st_combine(co$geometry))
#> Geometry set for 1 feature 
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84

st_union() — joins geometries and dissolves interior boundaries (merges into one feature):

(co_unioned <- st_union(co$geometry))
#> Geometry set for 1 feature 
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84

Visualizing Combine vs. Union

st_combine (boundaries preserved):

st_union (boundaries dissolved):

Water resources example: Use st_combine() to keep sub-basins separate, use st_union() to create a single merged basin outline.

`sfc`: Geometry List-Columns

st_sfc creates sets of geometries — what becomes the sticky list-column in sf objects:

(sfc <- st_sfc(st_point(c(0,1)), st_point(c(-3,2))))
#> Geometry set for 2 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -3 ymin: 1 xmax: 0 ymax: 2
#> CRS:           NA
class(sfc)
#> [1] "sfc_POINT" "sfc"
attributes(sfc) |> names()
#> [1] "class"     "precision" "bbox"      "crs"       "n_empty"

The sfc object records for the whole set: precision, bounding box, CRS, and count of empty geometries.

Combine + Cast: From Boundaries to Lines

You can combine geometries and then cast them to a different type — useful for creating line networks from polygon boundaries:

co_borders_combined <- st_combine(co$geometry) |> st_cast("MULTILINESTRING")
co_borders_unioned  <- st_union(co$geometry) |> st_cast("MULTILINESTRING")

Tip

This distinction matters constantly in water resources: county boundaries vs. watershed boundaries, stream reaches vs. the river network, individual gauge catchments vs. the basin outline.

CONUS: Combine vs. Union

us_c_ml <- st_combine(conus) |> st_cast("MULTILINESTRING")
us_u_ml <- st_union(conus)   |> st_cast("MULTILINESTRING")

Distance to Denver: Thinking About Feature Level

Question: how far is Denver from the nearest state border?

denver_sf <- data.frame(y = 39.7392, x = -104.9903, name = "Denver") |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

Attempt 1: distance to states (polygons)

conus

#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    state_region state_division feature_code           state_name state_abbr
#> 1             3              6      1779775              Alabama         AL
#> 2             4              8      1779777              Arizona         AZ
#> 3             3              7      0068085             Arkansas         AR
#> 4             4              9      1779778           California         CA
#> 5             4              8      1779779             Colorado         CO
#> 6             1              1      1779780          Connecticut         CT
#> 7             3              5      1779781             Delaware         DE
#> 8             3              5      1702382 District of Columbia         DC
#> 9             3              5      0294478              Florida         FL
#> 10            3              5      1705317              Georgia         GA
#>                    name fip_class tiger_class combined_area_code
#> 1               Alabama      <NA>       G4000                 NA
#> 2               Arizona      <NA>       G4000                 NA
#> 3              Arkansas      <NA>       G4000                 NA
#> 4            California      <NA>       G4000                 NA
#> 5              Colorado      <NA>       G4000                 NA
#> 6           Connecticut      <NA>       G4000                 NA
#> 7              Delaware      <NA>       G4000                 NA
#> 8  District of Columbia      <NA>       G4000                 NA
#> 9               Florida      <NA>       G4000                 NA
#> 10              Georgia      <NA>       G4000                 NA
#>    metropolitan_area_code functional_status    land_area  water_area fip_code
#> 1                    <NA>                 A 131175477769  4591897964       01
#> 2                    <NA>                 A 294363973043   855871553       04
#> 3                    <NA>                 A 134660767709  3121950081       05
#> 4                    <NA>                 A 403671756816 20293573058       06
#> 5                    <NA>                 A 268418796417  1185716938       08
#> 6                    <NA>                 A  12541690473  1816424193       09
#> 7                    <NA>                 A   5046731559  1399179670       10
#> 8                    <NA>                 A    158316124    18709762       11
#> 9                    <NA>                 A 138961722096 45972570361       12
#> 10                   <NA>                 A 149486624386  4418360134       13
#>                          geometry
#> 1  MULTIPOLYGON (((-85.4883 30...
#> 2  MULTIPOLYGON (((-110.7507 3...
#> 3  MULTIPOLYGON (((-90.95577 3...
#> 4  MULTIPOLYGON (((-116.1062 3...
#> 5  MULTIPOLYGON (((-105.155 36...
#> 6  MULTIPOLYGON (((-72.5279 41...
#> 7  MULTIPOLYGON (((-75.13846 3...
#> 8  MULTIPOLYGON (((-77.00244 3...
#> 9  MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name)

#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTIPOLYGON (((-85.4883 30...
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4            California MULTIPOLYGON (((-116.1062 3...
#> 5              Colorado MULTIPOLYGON (((-105.155 36...
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7              Delaware MULTIPOLYGON (((-75.13846 3...
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9               Florida MULTIPOLYGON (((-83.10874 2...
#> 10              Georgia MULTIPOLYGON (((-81.09538 3...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf))

#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry          dist
#> 1               Alabama MULTIPOLYGON (((-85.4883 30... 1571007.9 [m]
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...  466602.9 [m]
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...  975519.6 [m]
#> 4            California MULTIPOLYGON (((-116.1062 3... 1000950.5 [m]
#> 5              Colorado MULTIPOLYGON (((-105.155 36...       0.0 [m]
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41... 2636635.7 [m]
#> 7              Delaware MULTIPOLYGON (((-75.13846 3... 2485997.1 [m]
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3... 2388920.9 [m]
#> 9               Florida MULTIPOLYGON (((-83.10874 2... 1847447.4 [m]
#> 10              Georgia MULTIPOLYGON (((-81.09538 3... 1788781.6 [m]

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3)

#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS:  WGS 84
#>   state_name         dist                       geometry
#> 1   Colorado      0.0 [m] MULTIPOLYGON (((-105.155 36...
#> 2    Wyoming 140002.6 [m] MULTIPOLYGON (((-106.3212 4...
#> 3   Nebraska 161243.2 [m] MULTIPOLYGON (((-95.93779 4...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3) ->
  state_data

NOTE: the `%>% pipe needed here …
Distance to Colorado is 0 — Denver is inside a polygon, so distance = 0
Polygons describe areas — distance to the interior is always 0

Attempt 2: cast to lines first

conus

#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    state_region state_division feature_code           state_name state_abbr
#> 1             3              6      1779775              Alabama         AL
#> 2             4              8      1779777              Arizona         AZ
#> 3             3              7      0068085             Arkansas         AR
#> 4             4              9      1779778           California         CA
#> 5             4              8      1779779             Colorado         CO
#> 6             1              1      1779780          Connecticut         CT
#> 7             3              5      1779781             Delaware         DE
#> 8             3              5      1702382 District of Columbia         DC
#> 9             3              5      0294478              Florida         FL
#> 10            3              5      1705317              Georgia         GA
#>                    name fip_class tiger_class combined_area_code
#> 1               Alabama      <NA>       G4000                 NA
#> 2               Arizona      <NA>       G4000                 NA
#> 3              Arkansas      <NA>       G4000                 NA
#> 4            California      <NA>       G4000                 NA
#> 5              Colorado      <NA>       G4000                 NA
#> 6           Connecticut      <NA>       G4000                 NA
#> 7              Delaware      <NA>       G4000                 NA
#> 8  District of Columbia      <NA>       G4000                 NA
#> 9               Florida      <NA>       G4000                 NA
#> 10              Georgia      <NA>       G4000                 NA
#>    metropolitan_area_code functional_status    land_area  water_area fip_code
#> 1                    <NA>                 A 131175477769  4591897964       01
#> 2                    <NA>                 A 294363973043   855871553       04
#> 3                    <NA>                 A 134660767709  3121950081       05
#> 4                    <NA>                 A 403671756816 20293573058       06
#> 5                    <NA>                 A 268418796417  1185716938       08
#> 6                    <NA>                 A  12541690473  1816424193       09
#> 7                    <NA>                 A   5046731559  1399179670       10
#> 8                    <NA>                 A    158316124    18709762       11
#> 9                    <NA>                 A 138961722096 45972570361       12
#> 10                   <NA>                 A 149486624386  4418360134       13
#>                          geometry
#> 1  MULTIPOLYGON (((-85.4883 30...
#> 2  MULTIPOLYGON (((-110.7507 3...
#> 3  MULTIPOLYGON (((-90.95577 3...
#> 4  MULTIPOLYGON (((-116.1062 3...
#> 5  MULTIPOLYGON (((-105.155 36...
#> 6  MULTIPOLYGON (((-72.5279 41...
#> 7  MULTIPOLYGON (((-75.13846 3...
#> 8  MULTIPOLYGON (((-77.00244 3...
#> 9  MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...

Attempt 2: cast to lines first

conus |>
  select(state_name)

#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTIPOLYGON (((-85.4883 30...
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4            California MULTIPOLYGON (((-116.1062 3...
#> 5              Colorado MULTIPOLYGON (((-105.155 36...
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7              Delaware MULTIPOLYGON (((-75.13846 3...
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9               Florida MULTIPOLYGON (((-83.10874 2...
#> 10              Georgia MULTIPOLYGON (((-81.09538 3...

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING")

#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTILINESTRING ((-85.4883 ...
#> 2               Arizona MULTILINESTRING ((-110.7507...
#> 3              Arkansas MULTILINESTRING ((-90.95577...
#> 4            California MULTILINESTRING ((-116.1062...
#> 5              Colorado MULTILINESTRING ((-105.155 ...
#> 6           Connecticut MULTILINESTRING ((-72.5279 ...
#> 7              Delaware MULTILINESTRING ((-75.13846...
#> 8  District of Columbia MULTILINESTRING ((-77.00244...
#> 9               Florida MULTILINESTRING ((-83.10874...
#> 10              Georgia MULTILINESTRING ((-81.09538...

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING") %>%
  mutate(dist = st_distance(., denver_sf))

#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry          dist
#> 1               Alabama MULTILINESTRING ((-85.4883 ... 1571007.9 [m]
#> 2               Arizona MULTILINESTRING ((-110.7507...  466602.9 [m]
#> 3              Arkansas MULTILINESTRING ((-90.95577...  975519.6 [m]
#> 4            California MULTILINESTRING ((-116.1062... 1000950.5 [m]
#> 5              Colorado MULTILINESTRING ((-105.155 ...  140002.6 [m]
#> 6           Connecticut MULTILINESTRING ((-72.5279 ... 2636635.7 [m]
#> 7              Delaware MULTILINESTRING ((-75.13846... 2485997.1 [m]
#> 8  District of Columbia MULTILINESTRING ((-77.00244... 2388920.9 [m]
#> 9               Florida MULTILINESTRING ((-83.10874... 1847447.4 [m]
#> 10              Georgia MULTILINESTRING ((-81.09538... 1788781.6 [m]

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING") %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3)

#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS:  WGS 84
#>   state_name         dist                       geometry
#> 1   Colorado 140002.6 [m] MULTILINESTRING ((-105.155 ...
#> 2    Wyoming 140002.6 [m] MULTILINESTRING ((-106.3212...
#> 3   Nebraska 161243.2 [m] MULTILINESTRING ((-95.93779...

The Right Feature Level

Computing distance to 48 state borders when we only need the nearest one is wasteful. The feature should be the complete set of borders — one MULTILINESTRING:

st_distance(denver_sf, st_cast(st_combine(conus), "MULTILINESTRING"))
#> Units: [m]
#>          [,1]
#> [1,] 140002.6

1 calculation instead of 48
Correct answer: distance to the border, not to individual state features
Same principle applies to: distance to national border, distance to any river network, distance to nearest dam

The `sf` Object

`sf`: Attributes + Geometry

co
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    geoid       name      aland state_nm                       geometry
#> 1  08001      Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2  08003    Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3  08005   Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4  08007  Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5  08009       Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6  08011       Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7  08013    Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8  08014 Broomfield   85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9  08015    Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017   Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...
class(co)
#> [1] "sf"         "data.frame"
attr(co, "sf_column")
#> [1] "geometry"

An sf object is a data.frame with one special list-column: the geometry. It extends data.frame with spatial awareness.

Sticky Geometry

The geometry column persists through dplyr operations:

co |> 
  select(name) |> 
  slice(1:2)   # geometry comes along
#> Simple feature collection with 2 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -106.0393 ymin: 37.3562 xmax: -103.7057 ymax: 40.00137
#> Geodetic CRS:  WGS 84
#>      name                       geometry
#> 1   Adams MULTIPOLYGON (((-105.0532 3...
#> 2 Alamosa MULTIPOLYGON (((-105.4855 3...

Drop it explicitly when you don’t need it:

co |> st_drop_geometry() |> select(name) |> slice(1:2)
#>      name
#> 1   Adams
#> 2 Alamosa

Plotting `sf` Objects with ggplot2

geom_sf() is the bridge between sf and ggplot2 — everything from last week works:

ggplot(data = co) +
  geom_sf(aes(fill = aland)) +
  scale_fill_viridis_c(name = "Land Area (m²)") +
  labs(title = "Colorado Counties by Land Area") +
  theme_minimal()

All dplyr verbs work on sf objects first, then pipe into ggplot:

co |>
  filter(name == "Larimer") |>
  ggplot() + geom_sf(fill = "#1E4D2B") + theme_void()

The Sticky Geometry Structure

The geometry binding is unique — unlike bind_cols() or cbind(), the sfc list-column carries CRS, bounding box, and precision metadata that regular column binding does not.

Simple Features: What You’ve Learned

You now understand:

The seven geometry types — POINT, LINESTRING, POLYGON, and their multi-equivalents
The class hierarchy — sfg (individual geometry) → sfc (set) → sf (dataframe with sticky geometry)
Valid geometries — self-intersecting lines and invalid polygons break spatial operations silently
WKT and WKB — human-readable and machine-readable encodings
Casting and dissolving — changing geometry type and boundary representation
Sticky geometry in dplyr — sf objects work naturally with pipes, filters, and ggplot

Now a 5-minute break, then: What makes these coordinates actually spatial? We’ll cover the Coordinate Reference System — the foundation that anchors every spatial operation.

☕ Break

5 Minutes

Back in 5. Next: coordinate systems, projections, and the three-part system (ellipsoid, datum, prime meridian) that determines whether your spatial analysis is correct or silently wrong.

Coordinate Systems

What Makes Spatial Data Spatial?

The geometry alone is meaningless without knowing where on Earth those coordinates refer to. That’s the job of the Coordinate Reference System.

Coordinate Reference Systems

A Coordinate Reference System (CRS) defines how spatial features relate to the surface of the Earth.

Two broad types:

Geographic Coordinate System (GCS) — locations on the curved surface of the Earth, measured in angular units (degrees)
Projected Coordinate System (PCS) — locations on a flat surface, measured in linear units (meters, feet)

In sf, three tools:

Function	Does
`st_crs()`	Retrieve CRS from an `sf` or `sfc` object
`st_set_crs()`	Set or replace CRS (no coordinate transformation)
`st_transform()`	Transform coordinates from one CRS to another

Geographic Coordinate Systems

A GCS identifies locations on the curved surface of the Earth using angular measurements from the center:

Latitude — vertical angle (North/South from equator)
Longitude — horizontal angle (East/West from prime meridian)
North and East are positive; South and West are negative

A GCS is defined by three components:

Ellipsoid — the mathematical model of Earth’s shape (flattened sphere, not perfect)
Datum — how the ellipsoid is anchored to the real Earth (locally fitted or globally centered)
Prime meridian — the origin of longitude (Greenwich, 0°E/W)

Why this matters: two datasets may use the same coordinates but different datums. The difference is small (sub-meter) but compounds in spatial joins and distance calculations.

Common Datums in Water Resources

Three datums account for most US data:

Datum	Acronym	Used by	Typical error vs. NAD83
North American Datum 1983	NAD83	NHDPlus, most USGS vector data	Reference datum
World Geodetic System 1984	WGS84	GPS, NWIS gauges, web maps	<1 meter in CONUS
NAD27 (historic)	NAD27	Legacy sources only	Can be 100+ meters off

Note

NAD83 and WGS84 differ by less than 1 meter for most of CONUS. But they are not the same CRS. Mixing them without explicit transformation introduces sub-meter errors that compound in spatial joins. Always check with st_crs(a) == st_crs(b) before joining.

Projected Coordinate Systems

A Projected Coordinate System (PCS) transforms the curved Earth onto a 2D plane. Every transformation introduces distortion.

Three main projection families:

Planar — flat surface touches globe at a point (best for polar regions)
Cylindrical — cylinder wraps around globe (world maps, Mercator)
Conic — cone around globe (mid-latitude areas, CONUS)

Spatial Properties and Distortion

Each type of projection prioritizes one property at the cost of others:

Property	Preserved by	Used for
Area	Equal-area (Albers)	Accurate area calculations, choropleth maps
Distance	Equidistant	Distance from a fixed center point
Shape	Conformal (Mercator)	Navigation, preserving local angles
Direction	Azimuthal	Navigation from a pole

Projection Choice: Water Resources Defaults

For CONUS analysis, use these defaults:

Important

Computing area (choropleth maps, watershed statistics) → EPSG:5070 (Albers Equal Area). This is the standard federal projection for CONUS.
Measuring distance (gauge to outlet, confluences, dam spacing) → Keep data in geographic (WGS84, EPSG:4326) and use st_distance() with default S2 spherical geometry — it’s geodesic and correct.
Sub-state local analysis → Use a local UTM zone (e.g., NAD83 / UTM zone 13) if you need projected planar distance.
Visualization/mapping → Use whatever looks right for your region (Web Mercator for web, Albers for print).

EPSG Codes and PROJ Strings

sf stores CRS in two ways:

EPSG code — integer ID for a known CRS (e.g., 4326 = WGS84 geographic)
PROJ4 string — generic text description understood by the PROJ library

Common codes for water resources work:

EPSG	Name	Use
4326	WGS84 geographic	GPS coordinates, web maps, NWIS gauges
4269	NAD83 geographic	NHDPlus, most USGS vector data
5070	NAD83 / Conus Albers	Area calculations, CONUS-wide analysis
26913–26915	UTM NAD83 zones 13–15	Local/state-level distance calculations

Practical rule: Use EPSG:5070 for CONUS area calculations; use a local UTM (e.g., NAD83 / UTM zones) for sub-state analyses.

Checking and Setting CRS

st_crs(conus)$epsg
#> [1] 4326
st_crs(conus)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"
st_crs(conus)$datum
#> [1] "WGS84"

conus5070 <- st_transform(conus, 5070)
st_crs(conus5070)$epsg
#> [1] 5070
st_crs(conus5070)$proj4string
#> [1] "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"

Visualizing Projections

#1 Reproducibility Habit: Always Check CRS

Before any spatial operation — joins, distance calculations, area calculations — verify CRS match:

# Guard pattern: check before joining
if (st_crs(gauges) != st_crs(flowlines)) {
  flowlines <- st_transform(flowlines, st_crs(gauges))
}

# Or use the equality check
st_crs(gauges) == st_crs(flowlines)  # must be TRUE

Real example: NWIS gauges (WGS84, EPSG:4326) + NHD flowlines (NAD83, EPSG:4269):

st_crs(4326) == st_crs(4269)    # FALSE — they are not the same!
#> [1] FALSE

They look similar (sub-meter differences in CONUS), but snapping a point to a line or computing a distance will be subtly wrong without transformation.

Important

sf will warn you about CRS mismatch but won’t stop you. The result will execute and look plausible — but numbers will be wrong. Build the guard-check habit now.

Measurements

Geodesic vs. Planar Measurements

Geodesic distance: shortest path on an ellipsoid — true Earth surface distance.

Planar distance: Euclidean distance in projected coordinates — fast but introduces error.

pts <- data.frame(y = c(40.7128, 34.4208), x = c(-74.0060, -119.6982)) |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

st_distance(pts)                    # Geodesic: ~3,962 km (correct)
st_distance(st_transform(pts, 5070))# Planar: ~3,950 km (distorted)

Important

For water resources: use geodesic (default). In geographic CRS, st_distance() automatically computes geodesic (spherical) distances — correct for continental scales.

Note

sf uses geodesic by default in geographic CRS. Toggle S2 with sf::sf_use_s2(TRUE/FALSE) for fine control (rarely needed).

Note

Performance: S2 geodesic calculations used to be substantially slower, but modern versions of sf and the S2 library have optimized performance significantly. Geodesic is now only ~25% slower than planar for large operations — a measurable but not critical difference for most workflows. Use geodesic unless you’re profiling and find it’s a real bottleneck.

What Is S2?

S2 is Google’s spherical geometry library — a modern C++ implementation for accurate calculations on a sphere/ellipsoid.

sf uses S2 under the hood when data is in geographic coordinates (e.g., WGS84). Instead of flattening Earth to a 2D plane (which introduces distortion), S2 works directly with spherical coordinates:

Ellipsoidal geometry — computes distances, intersections, areas on the actual Earth shape
Efficient spatial indexing — divides the sphere into hierarchical cells for fast operations
Replaces GEOS for geographic data — when your CRS is in degrees, S2 handles the heavy lifting
Fast enough in practice — modern optimizations have made geodesic calculations competitive with planar

Important

S2 is automatic: if your data is in geographic coordinates, st_distance(), st_intersects(), and other spatial operations use S2 geodesic calculations by default. You don’t need to do anything.

Controlling S2: When and How

S2 is the right default for water resources work (continental scales, mixed coordinate systems), but you can toggle it for specialized cases:

# Disable S2 (use GEOS planar geometry instead)
sf::sf_use_s2(FALSE)

# Re-enable S2
sf::sf_use_s2(TRUE)

# Check current setting
sf::sf_use_s2()

When to disable S2: - Very small local areas (< 10 km) where planar error is negligible and you want marginal speed gains - Specialized geometric algorithms that require GEOS planar operations (rare in practice)

Leave S2 enabled (default): - Continental and larger scales (your typical water resources workflow) - Mixing geographic and projected data - When correctness matters more than squeezing out microseconds

Units as an S3 Class — Workflow

Results carry units as an S3 class attribute — both powerful for correctness and a source of confusion in practice.

class(conus)
#> [1] "sf"         "data.frame"
attributes(conus)
#> $names
#>  [1] "state_region"           "state_division"         "feature_code"          
#>  [4] "state_name"             "state_abbr"             "name"                  
#>  [7] "fip_class"              "tiger_class"            "combined_area_code"    
#> [10] "metropolitan_area_code" "functional_status"      "land_area"             
#> [13] "water_area"             "fip_code"               "geometry"              
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
#> 
#> $sf_column
#> [1] "geometry"
#> 
#> $agr
#>           state_region         state_division           feature_code 
#>                   <NA>                   <NA>                   <NA> 
#>             state_name             state_abbr                   name 
#>                   <NA>                   <NA>                   <NA> 
#>              fip_class            tiger_class     combined_area_code 
#>                   <NA>                   <NA>                   <NA> 
#> metropolitan_area_code      functional_status              land_area 
#>                   <NA>                   <NA>                   <NA> 
#>             water_area               fip_code 
#>                   <NA>                   <NA> 
#> Levels: constant aggregate identity
#> 
#> $class
#> [1] "sf"         "data.frame"

…

class(st_length(conus))
#> [1] "units"
attributes(st_length(conus)) |> unlist()
#> units.numerator           class 
#>             "m"         "units"

Units Workflow: Create → Convert → Drop

Create units naturally from measurements, convert between units, then drop before plotting or filtering:

Convert between units with the units package:

co <- AOI::aoi_get(state = "CO")
(l = st_length(st_cast(co, "MULTILINESTRING")))
#> 2100640 [m]

units::set_units(l, "km")
#> 2100.64 [km]
units::set_units(l, "mile")
#> 1305.277 [mile]

(a = st_area(co))
#> 2.69382e+11 [m^2]

units::set_units(a, "ha")
#> 26938196 [ha]
units::set_units(a, "km2")
#> 269382 [km^2]

Strip units when needed for arithmetic or plotting:

units::drop_units(l)
#> [1] 2100640
as.numeric(l)
#> [1] 2100640

Units Are a Class — Know the Rules

Units enforce dimensional correctness but can surprise you:

st_length(conus) + 100   # can't add meters + unitless number

filter(cities, dist_to_border < 160)  # may warn or fail

The corrected patterns:

# Create numeric version for filtering/plotting
cities <- cities |>
  mutate(dist_to_border_km = as.numeric(units::set_units(dist_to_border, "km")))

# Now filter and plot safely
cities |>
  filter(dist_to_border_km < 160) |>
  ggplot(aes(color = dist_to_border_km)) +
  geom_sf()

Getting Spatial Data into R

Building `sf` from a CSV

When coordinates are stored as numeric columns in a CSV — the most common format for point data from APIs, sensors, and databases — use st_as_sf():

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"),  # column names: X first, then Y
           crs    = 4326)             # raw lat/lon is almost always WGS84

Three things to always specify:

coords = — column names in order X (longitude), Y (latitude)
crs = — the CRS the raw numbers are in (not what you want them in)
Then st_transform() to convert to your analysis CRS

Full pipeline:

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) |>
  st_transform(5070)                # reproject for distance analysis

End-to-End: CSV → Filtered → Transformed → Measured

This is the complete workflow to compute the distance from every city to the national border:

# Pre-compute boundary in projected space (outside the pipeline)
boundary <- st_cast(st_union(st_transform(conus, 5070)), "MULTILINESTRING")

# Main workflow: read, build, transform, measure
system.time({
cities <- read_csv("data/uscities.csv", show_col_types =FALSE) |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) |>
  st_transform(5070) %>%
  mutate(
    dist_to_border_m = st_distance(., boundary),
    dist_to_border_km = units::set_units(dist_to_border_m, "km")
  )
})
#>    user  system elapsed 
#>   1.665   0.018   1.633

# user  system elapsed 
# 1.648   0.014   1.677

Key takeaways: - Build sf: CSV coordinates + CRS they’re in → st_as_sf() - Transform: project to EPSG:5070 for measurements - Measure: distance returns units object; convert as needed - Reuse: pass pre-computed boundary to st_distance() — clean, fast, readable

Common Pitfalls ⚠️

Pitfall: Distances in Geographic Space

boundary <- st_cast(st_union(conus),"MULTILINESTRING")

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
  mutate(dist = st_distance(., boundary))

# user    system  elapsed
# 80.763  0.300   81.055

Rule: Transform geometry to projected CRS before expensive operations like st_union(), st_intersection(), etc.

Pitfall 2: Comparing Units

dist <- st_distance(cities, boundary)  # Returns [m]

filter(cities, dist < 100)              # ❌ ERROR
filter(cities, as.numeric(dist) < 100)  # ✅ CORRECT

Rule: Drop units with as.numeric() before comparisons.

Important

Principle: Choose CRS strategically.

Geographic (4326) by default for input/output (standard data format)
Project immediately to EPSG:5070 (or your region’s standard) for analysis
Pre-compute expensive operations (union, clip, buffer) in projected space outside the data pipeline
This respects both correctness (geodesic) and performance (planar operations where they matter)

Visualizing Spatial Data

`geom_sf()`: The Foundation

geom_sf() is ggplot2’s spatial geometry layer. It automatically extracts coordinates from the geometry column and plots them. No need to specify x and y aesthetics.

Points

ggplot(cities) +
  geom_sf(color = "navy", size = 1)

Polygons

ggplot(conus) +
  geom_sf(fill = "lightblue", color = "black")

Key advantage: All standard ggplot2 aesthetics work — color, fill, size, alpha, linetype, etc. And you can layer multiple geom_sf() calls:

ggplot() +
  geom_sf(data = conus, fill = "white") +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)), color = "red", size = 0.1) +
  theme_void()

Tip

Always add theme_void() to remove axes, gridlines, and background — maps don’t need Cartesian coordinates.

`ggrepel`: Non-Overlapping Labels

ggrepel solves label collision — essential any time you label points on a map:

library(ggrepel)

ggplot() +
  geom_sf(data = conus, fill = "white") +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)), color = "red", size = 0.1) +
  ggrepel::geom_label_repel(
    data = filter(cities, city == "Denver"),
    aes(geometry = geometry, label = city),
    stat = "sf_coordinates",  # extracts coordinates from sf geometry
    size = 3
  ) +
  theme_void()

stat = "sf_coordinates" is the key argument — it tells ggrepel to extract X/Y from the geometry column rather than expecting explicit x and y aesthetics. Without it, you get an error.

`gghighlight`: Emphasize a Subset

gghighlight greys out features that do not meet a condition, keeping them as spatial context:

library(gghighlight)

ggplot() +
  geom_sf(data = conus) +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)),
          aes(color = as.numeric(dist_to_border_km)),
          size = 0.3) +
  gghighlight(as.numeric(dist_to_border_km) < 100) +
  scale_color_gradient(low = "orange", high = "darkred") +
  labs(title = "Cities within 100 km of the national border",
       color = "Distance (km)") +
  theme_void()

Features outside the condition are rendered in grey at reduced opacity — they provide spatial context without competing visually with the highlighted subset. Works with any geom_* — points, lines, polygons.

Wrapping Up

What We Covered

Simple Features

The ISO standard and its R implementation
The big 7 geometry types
sfg → sfc → sf class hierarchy
WKT / WKB encodings
Casting, combining, dissolving
Sticky geometry + geom_sf()

Coordinate Systems

GCS: ellipsoid, geoid, datum
PCS: projection types and distortion
EPSG codes and st_transform()
Always check CRS before joining

Measurements

Geodesic vs. planar distance
st_distance(), st_area(), st_length()
Units as an S3 class — set_units(), drop_units()
Converting and dropping units

Key habits built today:

st_is_valid() on every dataset you read
st_crs(a) == st_crs(b) before every spatial join
Use equal-area CRS for areas, equidistant for distances
as.numeric() before filtering on unit quantities

Next Week

Week 3: Spatial Predicates & Joins — relationships between geometries, filtering by space, and combining data sources

Week 2

Today’s Arc

From Tabular to Spatial

R Objects and Classes

Why Classes Matter: One Data Structure, Many Behaviors

Adding Class: Matrix → Data Frame

Class Hierarchies: Adding Layers

Methods: Functions That Know Their Object’s Class

Spatial Computing: The Stack

The Dependencies Under sf

The Spatial Stack

Why This Matters

The sf Wrapper

What Is GDAL?

GDAL Drivers and Water Resources

How Core Libraries Pass Data to R: WKT and WKB

Simple Features

Simple Features

Today’s Data: Colorado Counties

What Is a Simple Feature?

What Is a Feature?

POINT

LINESTRING

POLYGON

Multi-Geometries and Collections

The Big 7 Geometry Types

Dimensions

Valid Geometries

Invalid Examples

Empty Geometries

sfg → sfc → sf

Storing Geometry: Nested Lists

Three Classes, One Object

sfg: Simple Feature Geometry

sfg: Class Structure

sfg: Same Points, Different Meaning

Mixed Geometries and Collections

Conversion Between Types: st_cast

Casting: Feature vs. Set Level

Sets of geometries

Sets of geometries

Sets of geometries

Sets of geometries

Sets of geometries

Combining and Dissolving Geometries

Visualizing Combine vs. Union

sfc: Geometry List-Columns

Combine + Cast: From Boundaries to Lines

CONUS: Combine vs. Union

Distance to Denver: Thinking About Feature Level

Attempt 1: distance to states (polygons)

Attempt 1: distance to states (polygons)

Attempt 1: distance to states (polygons)

Attempt 1: distance to states (polygons)

Attempt 1: distance to states (polygons)

Attempt 2: cast to lines first

Attempt 2: cast to lines first

Attempt 2: cast to lines first

Attempt 2: cast to lines first

Attempt 2: cast to lines first

The Right Feature Level

The sf Object

sf: Attributes + Geometry

Sticky Geometry

Plotting sf Objects with ggplot2

The Sticky Geometry Structure

Simple Features: What You’ve Learned

☕ Break

5 Minutes

Coordinate Systems

What Makes Spatial Data Spatial?

Coordinate Reference Systems

Geographic Coordinate Systems

Common Datums in Water Resources

Projected Coordinate Systems

Spatial Properties and Distortion

Projection Choice: Water Resources Defaults

EPSG Codes and PROJ Strings

Checking and Setting CRS

Visualizing Projections

The Dependencies Under `sf`

The `sf` Wrapper

`sfg`: Simple Feature Geometry

`sfg`: Class Structure

`sfg`: Same Points, Different Meaning

Conversion Between Types: `st_cast`

`sfc`: Geometry List-Columns

The `sf` Object

`sf`: Attributes + Geometry

Plotting `sf` Objects with ggplot2

Building `sf` from a CSV

`geom_sf()`: The Foundation

`ggrepel`: Non-Overlapping Labels

`gghighlight`: Emphasize a Subset