Week 2

Vector Data — Features, Projections & Measures

Today’s Arc

From Tabular to Spatial

Last week: data as rows and columns — wrangling, joining, visualizing, modeling.

This week: data as features in space — the same verbs, but now every observation has a geometry that describes where it is and what shape it takes.

Today’s arc:

  1. Simple Features — the standard and the R implementation
  2. Spatial Computing: The Stack — GDAL, PROJ, GEOS (the C++ libraries under sf)
  3. Coordinate Systems — GCS, datums, projections
  4. Measurements — distance, area, units

(We’ll take a 5-minute break after Simple Features.)

R Objects and Classes

Why Classes Matter: One Data Structure, Many Behaviors

You’ve already used S3 classes in tidyverse without noticing. The same object type can behave differently based on its class.

Understanding S3 classes in R explains:

  • Why print() on a tibble looks different than a regular data.frame
  • Why filter() on grouped data filters within each group instead of globally
  • Why summarize() on a grouped_df produces a different output shape

This is the callback pattern — R looks at an object’s class and dispatches to the right function. Later, sf will extend data.frame with spatial methods that work the same way.

Adding Class: Matrix → Data Frame

A class is metadata attached to an object telling R how to interpret and operate on it:

# Start with a matrix — just numbers in rows and columns
v <- c(1,2,3,4,5,6)
class(v)
#> [1] "numeric"
m <- matrix(v, nrow = 3)
class(m)
#> [1] "matrix" "array"
m
#>      [,1] [,2]
#> [1,]    1    4
#> [2,]    2    5
#> [3,]    3    6
# Add class metadata: "this is a data frame"
df <- as.data.frame(m)
class(df)
#> [1] "data.frame"
df
#>   V1 V2
#> 1  1  4
#> 2  2  5
#> 3  3  6

Same data. But now df has column names, printed summary, and methods that respect rectangular structure. The class changed the behavior without changing the underlying values.

Class Hierarchies: Adding Layers

Objects can have multiple classes in a hierarchy. Adding structure adds more classes:

# Plain data.frame
df <- data.frame(gauge = c("A", "B", "C"), flow = c(100, 120, 95))
class(df)
#> [1] "data.frame"

# Add grouping — now it has *two* classes
gdf <- group_by(df, gauge)
class(gdf)
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

# Methods are dispatched: dplyr functions check for "grouped_df" first
slice(gdf, 1)  # respects grouping — one row per group, not one global row
#> # A tibble: 3 × 2
#> # Groups:   gauge [3]
#>   gauge  flow
#>   <chr> <dbl>
#> 1 A       100
#> 2 B       120
#> 3 C        95
slice(df, 1)   # no grouping — just one row total
#>   gauge flow
#> 1     A  100

The key insight: the same function behaves differently based on class. This is the callback mechanism. When you call slice(), R internally asks: “What class is this object?” and dispatches to the right implementation.

Methods: Functions That Know Their Object’s Class

Every function that respects class has methods or versions specialized for each class:

# Base R example: print() has different methods
print.data.frame()   # specialized for data.frames
print.default()      # fallback for other objects

# You can list all methods for a class:
methods(class = "data.frame")   # all functions with .data.frame versions

select(), filter(), mutate() are dplyr verbs that have methods:

methods("filter")    # shows filter.data.frame, filter.grouped_df, filter.ts, etc.
#> [1] filter.data.frame* filter.sf*         filter.ts*        
#> see '?methods' for accessing help and source code

When you call filter() on a grouped dataframe, R internally does:

  1. Check: “What class is this object?” (grouped_df)
  2. Look for filter.grouped_df()
  3. If it exists, call it; if not, fall back to filter.data.frame()
  4. The filter.grouped_df() method applies the condition within each group — the key behavior

Compare the same code on grouped vs ungrouped data:

df <- data.frame(site = c("A", "A", "B", "B"), flow = c(100, 120, 95, 110))

# On ungrouped data: keeps rows where flow > 105
df |> filter(flow > 105)
#>   site flow
#> 1    A  120
#> 2    B  110

# On grouped data: keeps earliest row per group where flow > 105
group_by(df, site) |> filter(flow > 105)
#> # A tibble: 2 × 2
#> # Groups:   site [2]
#>   site   flow
#>   <chr> <dbl>
#> 1 A       120
#> 2 B       110

The function is identical; the method dispatched changes based on class.

Spatial Computing: The Stack

The Dependencies Under sf

sf is powerful because it’s not built from scratch — it wraps three long standing C/C++ libraries, each with a focused job:

The Spatial Stack

Your R code:

sf::st_read("data.shp")      # I/O
sf::st_distance(a, b)        # Geometry
sf::st_transform(x, 4269)    # CRS

📦 GDAL

File I/O

  • Format translators

  • Reads/writes Shapefile, GeoJSON, GeoPackage, GeoTIFF

  • Returns WKB geometry

🧭 PROJ

Coordinate Systems

  • Transforms between 10,000+ CRS definitions

  • Handles datum shifts & projections

  • Underlies st_transform()

⚙️ GEOS

Geometry Ops

  • Intersection, union, buffer, centroid

  • Distance calculations

  • Spatial predicates (st_intersects())

Why This Matters

Each library solves a hard problem:

Library Solves Water Resources Example
GDAL “How do I read a shapefile or GeoPackage?” Read NHD flowlines, USGS boundaries, stream gauges
PROJ “How do I convert WGS84 to Albers?” Transform NWIS gauges from GPS coordinates to federal projection
GEOS “Is this point inside this polygon? What’s the distance?” Snap gauge to nearest flowline, clip to basin

If we had to write these from scratch in R, it would take years. Instead, spatial computing in R benefits from decades of work by the OSGeo community.

The sf Wrapper

sf::sf_extSoftVersion()
#>           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
#>       "3.14.1"       "3.12.2"        "9.7.1"         "true"         "true" 
#>           PROJ 
#>        "9.7.1"

sf doesn’t reinvent these wheels. It:

  1. Calls GDAL for I/O (st_read(), st_write())
  2. Calls PROJ for CRS transformations (st_transform())
  3. Calls GEOS for geometry operations (st_distance(), st_buffer(), spatial predicates)
  4. Wraps results with S3 classes and methods so dplyr verbs (filter(), mutate(), select()) just work — same as tidyverse

This diagram shows the relationship: sf methods dispatch to GEOS for the heavy lifting, then return R objects.

Note

When you install the sf package on your machine, the R package manager compiles and links against GDAL, PROJ, and GEOS. If installation fails, it’s usually because one of these C/C++ libraries is missing from your system. This is why setting up spatial computing can be more complex than installing a pure-R package.

What Is GDAL?

GDAL = Geospatial Data Abstraction Layer

A C++ library (developed by OSGeo, used by every GIS software) that:

  1. Abstracts file formats — speaks shapefile, GeoJSON, GeoPackage, etc., but exposes the same API to your program
  2. Translates between formats — convert any format to any other with one command
  3. Manages drivers — each format has a “driver” (code that knows how to read/write that specific format)
  4. Handles metadata — CRS, bounds, layer names, attribute types
# GDAL (C++ library) <- drivers (one per format)
#     ├── Shapefile driver
#     ├── GeoJSON driver
#     ├── GeoPackage driver
#     ├── GeoTIFF driver
#     ├── NetCDF driver
#     └── ... 200+ more

When you call st_read("data.shp"), here’s what happens:

  1. R/sf calls GDAL: “Read this file”
  2. GDAL looks at the extension (.shp)
  3. GDAL finds the Shapefile driver
  4. The driver reads the .shp and its sidecars
  5. GDAL returns data to sf
  6. sf wraps it in an sfc geometry + sf object structure and hands it to your R code

Note

You don’t write GDAL code directly. But st_read() and st_write() are thin wrappers around GDAL. Understanding this explains why st_read() “just works” on 50+ formats and why format conversions are trivial in R but painful elsewhere.

GDAL Drivers and Water Resources

sf::st_drivers() # shows Format, type (vector/raster), open, write

Key drivers for water resources:

Driver Format Extension Common Use
ESRI Shapefile Shapefile .shp USGS NHD, NOAA, boundaries
GeoJSON GeoJSON .geojson Google APIs, web services
GPKG GeoPackage .gpkg NHDPlus HR, modern data
GTiff GeoTIFF .tif USGS 3DEP elevation, imagery
netCDF NetCDF .nc NOAA climate grids
HDF5 HDF5 .h5 NASA satellite data

Important

GeoPackage replaces Shapefiles: one file, complete metadata, SQLite-based. Shapefiles require sidecars (.shx, .dbf, .prj) and lack metadata. Use GeoPackage when possible.

How Core Libraries Pass Data to R: WKT and WKB

When GDAL reads a file or GEOS performs an operation, it uses two standard encodings to pass geometry to R:

Well-Known Text (WKT) — human-readable (you’ll see this in error messages and discussions):

x <- st_linestring(matrix(10:1, 5))
st_as_text(x)
#> [1] "LINESTRING (10 5, 9 4, 8 3, 7 2, 6 1)"
  • LINESTRING = type; parentheses contain points; points separated by commas; coordinates separated by spaces

Well-Known Binary (WKB) — machine-readable, used internally by GDAL/GEOS for speed and lossless communication:

st_as_binary(x)
#>  [1] 01 02 00 00 00 05 00 00 00 00 00 00 00 00 00 24 40 00 00 00 00 00 00 14 40
#> [26] 00 00 00 00 00 00 22 40 00 00 00 00 00 00 10 40 00 00 00 00 00 00 20 40 00
#> [51] 00 00 00 00 00 08 40 00 00 00 00 00 00 1c 40 00 00 00 00 00 00 00 40 00 00
#> [76] 00 00 00 00 18 40 00 00 00 00 00 00 f0 3f

Note

You rarely write WKT/WKB directly, but understanding they exist explains why st_read() “just works”—GDAL converts any format to WKB, sf wraps it as an R object, and the roundtrip is seamless.

Simple Features

Simple Features

Today’s Data: Colorado Counties

co
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    geoid       name      aland state_nm                       geometry
#> 1  08001      Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2  08003    Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3  08005   Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4  08007  Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5  08009       Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6  08011       Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7  08013    Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8  08014 Broomfield   85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9  08015    Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017   Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...

What Is a Simple Feature?

  • Simple features (ISO 19125-1:2004) is a standard describing how real-world objects can be represented in computers, with emphasis on their spatial geometry

  • It defines:

    1. A class hierarchy of geometry types
    2. A set of operations on those geometries
    3. Binary and text encodings for storage and transfer
  • Widely implemented: PostGIS, ESRI ArcGIS, GDAL, GeoJSON — and sf in R
  • A subset of simple features (the big 7) forms the GeoJSON specification

What Is a Feature?

A feature is a thing in the real world — a river, a gauge, a watershed, a county.

  • Features often consist of other features:
    • A river system → a river → an outlet point
    • A watershed → sub-basins → individual reaches

The standard says: “A simple feature has both spatial and non-spatial attributes. Spatial attributes are geometry valued.”

  • The geometry describes where the feature is and how it is represented
  • A river’s geometry could be its watershed polygon, its mainstem linestring, or its outlet point — depending on the question
str(co)
#> Classes 'sf' and 'data.frame':   64 obs. of  5 variables:
#>  $ geoid   : chr  "08001" "08003" "08005" "08007" ...
#>  $ name    : chr  "Adams" "Alamosa" "Arapahoe" "Archuleta" ...
#>  $ aland   : num  3.02e+09 1.87e+09 2.07e+09 3.50e+09 6.62e+09 ...
#>  $ state_nm: chr  "Colorado" "Colorado" "Colorado" "Colorado" ...
#>  $ geometry:sfc_MULTIPOLYGON of length 64; first list element: List of 1
#>   ..$ :List of 1
#>   .. ..$ : num [1:132, 1:2] -105 -105 -105 -105 -105 ...
#>   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#>  - attr(*, "sf_column")= chr "geometry"
#>  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#>   ..- attr(*, "names")= chr [1:4] "geoid" "name" "aland" "state_nm"

POINT

A point is composed of one coordinate pair in a specific coordinate system.

  • Points have no length or area
  • Example: a gauge location, a dam, a city
ggplot() + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), size = 3) + 
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

LINESTRING

A linestring is composed of an ordered sequence of two or more coordinate points.

  • Points in a line are called vertices and explicitly define the connection between two points
  • A line has length, but no area
  • Example: a stream centerline, a river reach, a road
ggplot() + 
  geom_line(aes(x = c(1, 2, 3), y = c(3, 1, 2)), linewidth = 1) + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), col = "red", size = 3) + 
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

POLYGON

A polygon is composed of 4 or more points whose starting and ending point are the same.

  • Polygons enclose a closed ring (first point = last point)
  • Polygons have both length and area
  • Example: a watershed boundary, a county border, a floodplain
ggplot() + 
  geom_polygon(aes(x = c(1, 2, 3, 1), y = c(3, 1, 2, 3)), fill = "steelblue", alpha = 0.5, color = "black") + 
  geom_point(aes(x = c(1, 2, 3), y = c(3, 1, 2)), col = "red", size = 3) +
  labs(x = "Longitude", y = "Latitude") +
  theme_linedraw()

Multi-Geometries and Collections

Multi- variants are just sets of the same type:

  • MULTIPOINT = set of points
  • MULTILINESTRING = set of lines (e.g., a braided river system)
  • MULTIPOLYGON = set of polygons (e.g., an island chain, a fragmented watershed)

GEOMETRYCOLLECTION = a set of mixed geometry types (rare, usually from intersections)

The Big 7 Geometry Types

Single Description
POINT Zero-dimensional — a single coordinate pair
LINESTRING Sequence of points connected by straight line pieces — 1D
POLYGON Closed ring with positive area — 2D; may have interior holes
Multi (same type) Description
MULTIPOINT Set of points
MULTILINESTRING Set of linestrings
MULTIPOLYGON Set of polygons
Mixed Description
GEOMETRYCOLLECTION Set of geometries of any type

Note

These 7 types cover virtually everything in water resources work: gauge locations (POINT), stream reaches (LINESTRING), watershed boundaries (POLYGON), and their multi-feature equivalents.

Dimensions

All geometries are composed of points defined in 2-, 3-, or 4-D space:

  • XY — standard 2D coordinates (longitude, latitude or easting, northing)
  • XYZ — adds altitude (elevation of stream bed, DEM surface)
  • XYM — adds a measure attribute independent of the feature (e.g., streamflow at a vertex — rarely used)
  • XYZM — both Z and M

Note: “dimension” here means coordinate dimension (XY vs XYZ), not whether the geometry has length or area:

m1 <- rbind(c(8,1), c(2,5), c(3,2))
st_dimension(st_multipoint(m1))    # 0 — point set, no length
#> [1] 0
st_dimension(st_linestring(m1))    # 1 — has length, no area
#> [1] 1

Valid Geometries

Rules for valid geometries:

  • LINESTRING — no self-intersections

  • POLYGON — rings must be closed

  • POLYGON — holes inside exterior only

  • POLYGON — rings touch at points only

  • POLYGON — no repeated paths

Warning

Run st_is_valid() to check your data. Fix invalid geometries with st_make_valid().

Invalid Examples

Self-intersecting line:

x1 <- st_linestring(cbind(c(0,1,0,1), 
                          c(0,1,1,0)))
st_is_simple(x1)  # FALSE

Invalid polygon (crossing edges):

x2 <- st_polygon(list(cbind(
  c(0,1,1,1,0,0), 
  c(0,0,1,0.6,1,0))))
st_is_valid(x2)  # FALSE

These operations fail silently unless checked.

Empty Geometries

EMPTY geometries serve as “NA” placeholders for geometry types — they arise naturally from spatial operations:

(e <- st_intersection(st_point(c(0,0)), st_point(c(1,1))))
#> GEOMETRYCOLLECTION EMPTY
st_is_empty(e)
#> [1] TRUE

plot(st_point(c(0,0)), pch = 16)
plot(st_point(c(1,1)), add = TRUE, col = 'red', pch = 16)

sfg → sfc → sf

Storing Geometry: Nested Lists

Remember nested lists? A geometry is just coordinates:

# One geometry: a pair of coordinates
c(x = 1, y = 2)

# Many geometries: nested list of coordinate pairs
list(list(c(1, 2)), list(c(3, 4)), list(c(5, 6)))

R stores multiple geometries side-by-side this way: Each row in a data.frame holds one geometry (which is a list of coordinates). This creates a geometry column — many rows, each with its own nested geometry.

  • One row = one feature (with attributes + geometry)
  • Each row’s geometry = one nested coordinate structure
  • All rows together = a geometry list-column

Three Classes, One Object

  • 🟢 sf — a data.frame with feature attributes and a geometry column (class: sf)
  • 🔴 sfc — the geometry list-column (set of geometries)
  • 🔵 sfg — individual simple feature geometry (one geometry per feature)

Important

This is the S3 class hierarchy in action. sf is a subclass of data.frame adding the geometry column and specialized methods. When you call filter() on an sf object, R looks at its class and dispatches to the spatial-aware filter.sf() method, which respects the geometry column. Same pattern as grouped_df from tidyverse.

sfg: Simple Feature Geometry

sfg objects carry the geometry for a single feature. Storage rules:

  1. A single POINT → numeric vector
  2. Points in a LINESTRING or POLYGON ring → matrix (each row = a point)
  3. Any other set → list

Creator functions (rarely used in practice — you usually read existing data):

(x <- st_point(c(1, 2)))
str(x)
#>  'XY' num [1:2] 1 2

(x <- st_linestring(matrix(c(1,2,3,4), ncol = 2)))
str(x)
#>  'XY' num [1:2, 1:2] 1 2 3 4

sfg: Class Structure

All geometry objects have an S3 class indicating (1) dimension, (2) type, (3) superclass:

(pt  <- st_point(c(0, 1)))
attributes(pt)
#> $class
#> [1] "XY"    "POINT" "sfg"

(pt2 <- st_point(c(0, 1, 4)))   # XYZ
attributes(pt2)
#> $class
#> [1] "XYZ"   "POINT" "sfg"

sfg: Same Points, Different Meaning

The same coordinate matrix creates geometries with entirely different dimensionality:

m1 <- rbind(c(8,1), c(2,5), c(3,2))

(mp <- st_multipoint(m1))   # zero-dimensional point set
(ls <- st_linestring(m1))   # one-dimensional line

st_dimension(mp); st_length(mp)
#> [1] 0
#> [1] 0
st_dimension(ls); st_length(ls)
#> [1] 1
#> [1] 10.37338

Mixed Geometries and Collections

When features have different geometry types, the type becomes GEOMETRY.

Example: A collection with POINT + LINESTRING:

g <- st_sfc(st_point(c(0, 0)), 
            st_linestring(rbind(c(0, 0), c(1, 1))))

Filter by type with st_is():

st_sf(geometry = g) |>
  filter(st_is(geometry, "LINESTRING"))   # keep only lines
#> Simple feature collection with 1 feature and 0 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA
#>                geometry
#> 1 LINESTRING (0 0, 1 1)

When intersections produce mixed types, use st_collection_extract():

pt   <- st_point(c(1, 0))
ls   <- st_linestring(matrix(c(4, 3, 0, 0), ncol = 2))
poly <- st_polygon(list(matrix(c(5.5, 7, 7, 6, 5.5, 0, 0, -0.5, -0.5, 0), ncol = 2)))
collection <- st_geometrycollection(list(pt, ls, poly))

st_collection_extract(collection, "POLYGON")  # extract only polygons

Conversion Between Types: st_cast

(co1   <- filter(co, name == "Larimer")$geometry)
#> Geometry set for 1 feature 
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS:  WGS 84
(co_ls <- st_cast(co1, "MULTILINESTRING"))
#> Geometry set for 1 feature 
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS:  WGS 84

Casting: Feature vs. Set Level

Casting operates at the feature level, not the set level:

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_cast("POINT")   # ⚠️ converts ONE line to ONE point — not what you want
#> Warning in st_cast.LINESTRING(st_linestring(rbind(c(0, 0), c(1, 1), c(1, :
#> point from first coordinate only
#> POINT (0 0)

To extract individual points, work at the set level (sfc):

(p <- rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
   st_linestring() |>
   st_sfc() |>           # create a SET of one geometry
   st_cast("POINT"))     # now produces 4 individual points
#> Geometry set for 4 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1))
#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    1    1
#> [3,]    1    0
#> [4,]    0    1

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring()

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc()
#> Geometry set for 1 feature 
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc() |>
  st_cast("POINT")
#> Geometry set for 4 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> CRS:           NA

Sets of geometries

rbind(c(0,0), c(1,1), c(1,0), c(0,1)) |>
  st_linestring() |>
  st_sfc() |>
  st_cast("POINT") ->
  p

Combining and Dissolving Geometries

Two key operations for working with geometry sets:

st_combine() — joins geometries while preserving interior boundaries:

(co_combined <- st_combine(co$geometry))
#> Geometry set for 1 feature 
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84

st_union() — joins geometries and dissolves interior boundaries (merges into one feature):

(co_unioned <- st_union(co$geometry))
#> Geometry set for 1 feature 
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84

Visualizing Combine vs. Union

st_combine (boundaries preserved):

st_union (boundaries dissolved):

Water resources example: Use st_combine() to keep sub-basins separate, use st_union() to create a single merged basin outline.

sfc: Geometry List-Columns

st_sfc creates sets of geometries — what becomes the sticky list-column in sf objects:

(sfc <- st_sfc(st_point(c(0,1)), st_point(c(-3,2))))
#> Geometry set for 2 features 
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -3 ymin: 1 xmax: 0 ymax: 2
#> CRS:           NA
class(sfc)
#> [1] "sfc_POINT" "sfc"
attributes(sfc) |> names()
#> [1] "class"     "precision" "bbox"      "crs"       "n_empty"

The sfc object records for the whole set: precision, bounding box, CRS, and count of empty geometries.

Combine + Cast: From Boundaries to Lines

You can combine geometries and then cast them to a different type — useful for creating line networks from polygon boundaries:

co_borders_combined <- st_combine(co$geometry) |> st_cast("MULTILINESTRING")
co_borders_unioned  <- st_union(co$geometry) |> st_cast("MULTILINESTRING")

Tip

This distinction matters constantly in water resources: county boundaries vs. watershed boundaries, stream reaches vs. the river network, individual gauge catchments vs. the basin outline.

CONUS: Combine vs. Union

us_c_ml <- st_combine(conus) |> st_cast("MULTILINESTRING")
us_u_ml <- st_union(conus)   |> st_cast("MULTILINESTRING")

Distance to Denver: Thinking About Feature Level

Question: how far is Denver from the nearest state border?

denver_sf <- data.frame(y = 39.7392, x = -104.9903, name = "Denver") |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

Attempt 1: distance to states (polygons)

conus
#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    state_region state_division feature_code           state_name state_abbr
#> 1             3              6      1779775              Alabama         AL
#> 2             4              8      1779777              Arizona         AZ
#> 3             3              7      0068085             Arkansas         AR
#> 4             4              9      1779778           California         CA
#> 5             4              8      1779779             Colorado         CO
#> 6             1              1      1779780          Connecticut         CT
#> 7             3              5      1779781             Delaware         DE
#> 8             3              5      1702382 District of Columbia         DC
#> 9             3              5      0294478              Florida         FL
#> 10            3              5      1705317              Georgia         GA
#>                    name fip_class tiger_class combined_area_code
#> 1               Alabama      <NA>       G4000                 NA
#> 2               Arizona      <NA>       G4000                 NA
#> 3              Arkansas      <NA>       G4000                 NA
#> 4            California      <NA>       G4000                 NA
#> 5              Colorado      <NA>       G4000                 NA
#> 6           Connecticut      <NA>       G4000                 NA
#> 7              Delaware      <NA>       G4000                 NA
#> 8  District of Columbia      <NA>       G4000                 NA
#> 9               Florida      <NA>       G4000                 NA
#> 10              Georgia      <NA>       G4000                 NA
#>    metropolitan_area_code functional_status    land_area  water_area fip_code
#> 1                    <NA>                 A 131175477769  4591897964       01
#> 2                    <NA>                 A 294363973043   855871553       04
#> 3                    <NA>                 A 134660767709  3121950081       05
#> 4                    <NA>                 A 403671756816 20293573058       06
#> 5                    <NA>                 A 268418796417  1185716938       08
#> 6                    <NA>                 A  12541690473  1816424193       09
#> 7                    <NA>                 A   5046731559  1399179670       10
#> 8                    <NA>                 A    158316124    18709762       11
#> 9                    <NA>                 A 138961722096 45972570361       12
#> 10                   <NA>                 A 149486624386  4418360134       13
#>                          geometry
#> 1  MULTIPOLYGON (((-85.4883 30...
#> 2  MULTIPOLYGON (((-110.7507 3...
#> 3  MULTIPOLYGON (((-90.95577 3...
#> 4  MULTIPOLYGON (((-116.1062 3...
#> 5  MULTIPOLYGON (((-105.155 36...
#> 6  MULTIPOLYGON (((-72.5279 41...
#> 7  MULTIPOLYGON (((-75.13846 3...
#> 8  MULTIPOLYGON (((-77.00244 3...
#> 9  MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name)
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTIPOLYGON (((-85.4883 30...
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4            California MULTIPOLYGON (((-116.1062 3...
#> 5              Colorado MULTIPOLYGON (((-105.155 36...
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7              Delaware MULTIPOLYGON (((-75.13846 3...
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9               Florida MULTIPOLYGON (((-83.10874 2...
#> 10              Georgia MULTIPOLYGON (((-81.09538 3...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf))
#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry          dist
#> 1               Alabama MULTIPOLYGON (((-85.4883 30... 1571007.9 [m]
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...  466602.9 [m]
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...  975519.6 [m]
#> 4            California MULTIPOLYGON (((-116.1062 3... 1000950.5 [m]
#> 5              Colorado MULTIPOLYGON (((-105.155 36...       0.0 [m]
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41... 2636635.7 [m]
#> 7              Delaware MULTIPOLYGON (((-75.13846 3... 2485997.1 [m]
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3... 2388920.9 [m]
#> 9               Florida MULTIPOLYGON (((-83.10874 2... 1847447.4 [m]
#> 10              Georgia MULTIPOLYGON (((-81.09538 3... 1788781.6 [m]

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3)
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS:  WGS 84
#>   state_name         dist                       geometry
#> 1   Colorado      0.0 [m] MULTIPOLYGON (((-105.155 36...
#> 2    Wyoming 140002.6 [m] MULTIPOLYGON (((-106.3212 4...
#> 3   Nebraska 161243.2 [m] MULTIPOLYGON (((-95.93779 4...

Attempt 1: distance to states (polygons)

conus |>
  select(state_name) %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3) ->
  state_data
  • NOTE: the `%>% pipe needed here …
  • Distance to Colorado is 0 — Denver is inside a polygon, so distance = 0
  • Polygons describe areas — distance to the interior is always 0

Attempt 2: cast to lines first

conus
#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    state_region state_division feature_code           state_name state_abbr
#> 1             3              6      1779775              Alabama         AL
#> 2             4              8      1779777              Arizona         AZ
#> 3             3              7      0068085             Arkansas         AR
#> 4             4              9      1779778           California         CA
#> 5             4              8      1779779             Colorado         CO
#> 6             1              1      1779780          Connecticut         CT
#> 7             3              5      1779781             Delaware         DE
#> 8             3              5      1702382 District of Columbia         DC
#> 9             3              5      0294478              Florida         FL
#> 10            3              5      1705317              Georgia         GA
#>                    name fip_class tiger_class combined_area_code
#> 1               Alabama      <NA>       G4000                 NA
#> 2               Arizona      <NA>       G4000                 NA
#> 3              Arkansas      <NA>       G4000                 NA
#> 4            California      <NA>       G4000                 NA
#> 5              Colorado      <NA>       G4000                 NA
#> 6           Connecticut      <NA>       G4000                 NA
#> 7              Delaware      <NA>       G4000                 NA
#> 8  District of Columbia      <NA>       G4000                 NA
#> 9               Florida      <NA>       G4000                 NA
#> 10              Georgia      <NA>       G4000                 NA
#>    metropolitan_area_code functional_status    land_area  water_area fip_code
#> 1                    <NA>                 A 131175477769  4591897964       01
#> 2                    <NA>                 A 294363973043   855871553       04
#> 3                    <NA>                 A 134660767709  3121950081       05
#> 4                    <NA>                 A 403671756816 20293573058       06
#> 5                    <NA>                 A 268418796417  1185716938       08
#> 6                    <NA>                 A  12541690473  1816424193       09
#> 7                    <NA>                 A   5046731559  1399179670       10
#> 8                    <NA>                 A    158316124    18709762       11
#> 9                    <NA>                 A 138961722096 45972570361       12
#> 10                   <NA>                 A 149486624386  4418360134       13
#>                          geometry
#> 1  MULTIPOLYGON (((-85.4883 30...
#> 2  MULTIPOLYGON (((-110.7507 3...
#> 3  MULTIPOLYGON (((-90.95577 3...
#> 4  MULTIPOLYGON (((-116.1062 3...
#> 5  MULTIPOLYGON (((-105.155 36...
#> 6  MULTIPOLYGON (((-72.5279 41...
#> 7  MULTIPOLYGON (((-75.13846 3...
#> 8  MULTIPOLYGON (((-77.00244 3...
#> 9  MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...

Attempt 2: cast to lines first

conus |>
  select(state_name)
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTIPOLYGON (((-85.4883 30...
#> 2               Arizona MULTIPOLYGON (((-110.7507 3...
#> 3              Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4            California MULTIPOLYGON (((-116.1062 3...
#> 5              Colorado MULTIPOLYGON (((-105.155 36...
#> 6           Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7              Delaware MULTIPOLYGON (((-75.13846 3...
#> 8  District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9               Florida MULTIPOLYGON (((-83.10874 2...
#> 10              Georgia MULTIPOLYGON (((-81.09538 3...

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING")
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry
#> 1               Alabama MULTILINESTRING ((-85.4883 ...
#> 2               Arizona MULTILINESTRING ((-110.7507...
#> 3              Arkansas MULTILINESTRING ((-90.95577...
#> 4            California MULTILINESTRING ((-116.1062...
#> 5              Colorado MULTILINESTRING ((-105.155 ...
#> 6           Connecticut MULTILINESTRING ((-72.5279 ...
#> 7              Delaware MULTILINESTRING ((-75.13846...
#> 8  District of Columbia MULTILINESTRING ((-77.00244...
#> 9               Florida MULTILINESTRING ((-83.10874...
#> 10              Georgia MULTILINESTRING ((-81.09538...

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING") %>%
  mutate(dist = st_distance(., denver_sf))
#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>              state_name                       geometry          dist
#> 1               Alabama MULTILINESTRING ((-85.4883 ... 1571007.9 [m]
#> 2               Arizona MULTILINESTRING ((-110.7507...  466602.9 [m]
#> 3              Arkansas MULTILINESTRING ((-90.95577...  975519.6 [m]
#> 4            California MULTILINESTRING ((-116.1062... 1000950.5 [m]
#> 5              Colorado MULTILINESTRING ((-105.155 ...  140002.6 [m]
#> 6           Connecticut MULTILINESTRING ((-72.5279 ... 2636635.7 [m]
#> 7              Delaware MULTILINESTRING ((-75.13846... 2485997.1 [m]
#> 8  District of Columbia MULTILINESTRING ((-77.00244... 2388920.9 [m]
#> 9               Florida MULTILINESTRING ((-83.10874... 1847447.4 [m]
#> 10              Georgia MULTILINESTRING ((-81.09538... 1788781.6 [m]

Attempt 2: cast to lines first

conus |>
  select(state_name) |>
  st_cast("MULTILINESTRING") %>%
  mutate(dist = st_distance(., denver_sf)) |>
  slice_min(dist, n = 3)
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS:  WGS 84
#>   state_name         dist                       geometry
#> 1   Colorado 140002.6 [m] MULTILINESTRING ((-105.155 ...
#> 2    Wyoming 140002.6 [m] MULTILINESTRING ((-106.3212...
#> 3   Nebraska 161243.2 [m] MULTILINESTRING ((-95.93779...

The Right Feature Level

Computing distance to 48 state borders when we only need the nearest one is wasteful. The feature should be the complete set of borders — one MULTILINESTRING:

st_distance(denver_sf, st_cast(st_combine(conus), "MULTILINESTRING"))
#> Units: [m]
#>          [,1]
#> [1,] 140002.6
  • 1 calculation instead of 48
  • Correct answer: distance to the border, not to individual state features
  • Same principle applies to: distance to national border, distance to any river network, distance to nearest dam

The sf Object

sf: Attributes + Geometry

co
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS:  WGS 84
#> First 10 features:
#>    geoid       name      aland state_nm                       geometry
#> 1  08001      Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2  08003    Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3  08005   Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4  08007  Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5  08009       Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6  08011       Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7  08013    Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8  08014 Broomfield   85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9  08015    Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017   Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...
class(co)
#> [1] "sf"         "data.frame"
attr(co, "sf_column")
#> [1] "geometry"

An sf object is a data.frame with one special list-column: the geometry. It extends data.frame with spatial awareness.

Sticky Geometry

The geometry column persists through dplyr operations:

co |> 
  select(name) |> 
  slice(1:2)   # geometry comes along
#> Simple feature collection with 2 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -106.0393 ymin: 37.3562 xmax: -103.7057 ymax: 40.00137
#> Geodetic CRS:  WGS 84
#>      name                       geometry
#> 1   Adams MULTIPOLYGON (((-105.0532 3...
#> 2 Alamosa MULTIPOLYGON (((-105.4855 3...

Drop it explicitly when you don’t need it:

co |> st_drop_geometry() |> select(name) |> slice(1:2)
#>      name
#> 1   Adams
#> 2 Alamosa

Plotting sf Objects with ggplot2

geom_sf() is the bridge between sf and ggplot2 — everything from last week works:

ggplot(data = co) +
  geom_sf(aes(fill = aland)) +
  scale_fill_viridis_c(name = "Land Area (m²)") +
  labs(title = "Colorado Counties by Land Area") +
  theme_minimal()

All dplyr verbs work on sf objects first, then pipe into ggplot:

co |>
  filter(name == "Larimer") |>
  ggplot() + geom_sf(fill = "#1E4D2B") + theme_void()

The Sticky Geometry Structure

The geometry binding is unique — unlike bind_cols() or cbind(), the sfc list-column carries CRS, bounding box, and precision metadata that regular column binding does not.

Simple Features: What You’ve Learned

You now understand:

  • The seven geometry types — POINT, LINESTRING, POLYGON, and their multi-equivalents
  • The class hierarchysfg (individual geometry) → sfc (set) → sf (dataframe with sticky geometry)
  • Valid geometries — self-intersecting lines and invalid polygons break spatial operations silently
  • WKT and WKB — human-readable and machine-readable encodings
  • Casting and dissolving — changing geometry type and boundary representation
  • Sticky geometry in dplyrsf objects work naturally with pipes, filters, and ggplot

Now a 5-minute break, then: What makes these coordinates actually spatial? We’ll cover the Coordinate Reference System — the foundation that anchors every spatial operation.

☕ Break

5 Minutes

Back in 5. Next: coordinate systems, projections, and the three-part system (ellipsoid, datum, prime meridian) that determines whether your spatial analysis is correct or silently wrong.

Coordinate Systems

What Makes Spatial Data Spatial?

The geometry alone is meaningless without knowing where on Earth those coordinates refer to. That’s the job of the Coordinate Reference System.

Coordinate Reference Systems

A Coordinate Reference System (CRS) defines how spatial features relate to the surface of the Earth.

Two broad types:

  • Geographic Coordinate System (GCS) — locations on the curved surface of the Earth, measured in angular units (degrees)
  • Projected Coordinate System (PCS) — locations on a flat surface, measured in linear units (meters, feet)

In sf, three tools:

Function Does
st_crs() Retrieve CRS from an sf or sfc object
st_set_crs() Set or replace CRS (no coordinate transformation)
st_transform() Transform coordinates from one CRS to another

Geographic Coordinate Systems

A GCS identifies locations on the curved surface of the Earth using angular measurements from the center:

  • Latitude — vertical angle (North/South from equator)
  • Longitude — horizontal angle (East/West from prime meridian)
  • North and East are positive; South and West are negative

A GCS is defined by three components:

  1. Ellipsoid — the mathematical model of Earth’s shape (flattened sphere, not perfect)
  2. Datum — how the ellipsoid is anchored to the real Earth (locally fitted or globally centered)
  3. Prime meridian — the origin of longitude (Greenwich, 0°E/W)

Why this matters: two datasets may use the same coordinates but different datums. The difference is small (sub-meter) but compounds in spatial joins and distance calculations.

Common Datums in Water Resources

Three datums account for most US data:

Datum Acronym Used by Typical error vs. NAD83
North American Datum 1983 NAD83 NHDPlus, most USGS vector data Reference datum
World Geodetic System 1984 WGS84 GPS, NWIS gauges, web maps <1 meter in CONUS
NAD27 (historic) NAD27 Legacy sources only Can be 100+ meters off

Note

NAD83 and WGS84 differ by less than 1 meter for most of CONUS. But they are not the same CRS. Mixing them without explicit transformation introduces sub-meter errors that compound in spatial joins. Always check with st_crs(a) == st_crs(b) before joining.

Projected Coordinate Systems

A Projected Coordinate System (PCS) transforms the curved Earth onto a 2D plane. Every transformation introduces distortion.

Three main projection families:

  • Planar — flat surface touches globe at a point (best for polar regions)
  • Cylindrical — cylinder wraps around globe (world maps, Mercator)
  • Conic — cone around globe (mid-latitude areas, CONUS)

Spatial Properties and Distortion

Each type of projection prioritizes one property at the cost of others:

Property Preserved by Used for
Area Equal-area (Albers) Accurate area calculations, choropleth maps
Distance Equidistant Distance from a fixed center point
Shape Conformal (Mercator) Navigation, preserving local angles
Direction Azimuthal Navigation from a pole

Projection Choice: Water Resources Defaults

For CONUS analysis, use these defaults:

Important

  • Computing area (choropleth maps, watershed statistics) → EPSG:5070 (Albers Equal Area). This is the standard federal projection for CONUS.
  • Measuring distance (gauge to outlet, confluences, dam spacing) → Keep data in geographic (WGS84, EPSG:4326) and use st_distance() with default S2 spherical geometry — it’s geodesic and correct.
  • Sub-state local analysis → Use a local UTM zone (e.g., NAD83 / UTM zone 13) if you need projected planar distance.
  • Visualization/mapping → Use whatever looks right for your region (Web Mercator for web, Albers for print).

EPSG Codes and PROJ Strings

sf stores CRS in two ways:

  • EPSG code — integer ID for a known CRS (e.g., 4326 = WGS84 geographic)
  • PROJ4 string — generic text description understood by the PROJ library

Common codes for water resources work:

EPSG Name Use
4326 WGS84 geographic GPS coordinates, web maps, NWIS gauges
4269 NAD83 geographic NHDPlus, most USGS vector data
5070 NAD83 / Conus Albers Area calculations, CONUS-wide analysis
26913–26915 UTM NAD83 zones 13–15 Local/state-level distance calculations

Practical rule: Use EPSG:5070 for CONUS area calculations; use a local UTM (e.g., NAD83 / UTM zones) for sub-state analyses.

Checking and Setting CRS

st_crs(conus)$epsg
#> [1] 4326
st_crs(conus)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"
st_crs(conus)$datum
#> [1] "WGS84"
conus5070 <- st_transform(conus, 5070)
st_crs(conus5070)$epsg
#> [1] 5070
st_crs(conus5070)$proj4string
#> [1] "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs"

Visualizing Projections

#1 Reproducibility Habit: Always Check CRS

Before any spatial operation — joins, distance calculations, area calculations — verify CRS match:

# Guard pattern: check before joining
if (st_crs(gauges) != st_crs(flowlines)) {
  flowlines <- st_transform(flowlines, st_crs(gauges))
}

# Or use the equality check
st_crs(gauges) == st_crs(flowlines)  # must be TRUE

Real example: NWIS gauges (WGS84, EPSG:4326) + NHD flowlines (NAD83, EPSG:4269):

st_crs(4326) == st_crs(4269)    # FALSE — they are not the same!
#> [1] FALSE

They look similar (sub-meter differences in CONUS), but snapping a point to a line or computing a distance will be subtly wrong without transformation.

Important

sf will warn you about CRS mismatch but won’t stop you. The result will execute and look plausible — but numbers will be wrong. Build the guard-check habit now.

Measurements

Geodesic vs. Planar Measurements

Geodesic distance: shortest path on an ellipsoid — true Earth surface distance.

Planar distance: Euclidean distance in projected coordinates — fast but introduces error.

pts <- data.frame(y = c(40.7128, 34.4208), x = c(-74.0060, -119.6982)) |>
  st_as_sf(coords = c("x", "y"), crs = 4326)

st_distance(pts)                    # Geodesic: ~3,962 km (correct)
st_distance(st_transform(pts, 5070))# Planar: ~3,950 km (distorted)

Important

For water resources: use geodesic (default). In geographic CRS, st_distance() automatically computes geodesic (spherical) distances — correct for continental scales.

Note

sf uses geodesic by default in geographic CRS. Toggle S2 with sf::sf_use_s2(TRUE/FALSE) for fine control (rarely needed).

Note

Performance: S2 geodesic calculations used to be substantially slower, but modern versions of sf and the S2 library have optimized performance significantly. Geodesic is now only ~25% slower than planar for large operations — a measurable but not critical difference for most workflows. Use geodesic unless you’re profiling and find it’s a real bottleneck.

What Is S2?

S2 is Google’s spherical geometry library — a modern C++ implementation for accurate calculations on a sphere/ellipsoid.

sf uses S2 under the hood when data is in geographic coordinates (e.g., WGS84). Instead of flattening Earth to a 2D plane (which introduces distortion), S2 works directly with spherical coordinates:

  • Ellipsoidal geometry — computes distances, intersections, areas on the actual Earth shape
  • Efficient spatial indexing — divides the sphere into hierarchical cells for fast operations
  • Replaces GEOS for geographic data — when your CRS is in degrees, S2 handles the heavy lifting
  • Fast enough in practice — modern optimizations have made geodesic calculations competitive with planar

Important

S2 is automatic: if your data is in geographic coordinates, st_distance(), st_intersects(), and other spatial operations use S2 geodesic calculations by default. You don’t need to do anything.

Controlling S2: When and How

S2 is the right default for water resources work (continental scales, mixed coordinate systems), but you can toggle it for specialized cases:

# Disable S2 (use GEOS planar geometry instead)
sf::sf_use_s2(FALSE)

# Re-enable S2
sf::sf_use_s2(TRUE)

# Check current setting
sf::sf_use_s2()

When to disable S2: - Very small local areas (< 10 km) where planar error is negligible and you want marginal speed gains - Specialized geometric algorithms that require GEOS planar operations (rare in practice)

Leave S2 enabled (default): - Continental and larger scales (your typical water resources workflow) - Mixing geographic and projected data - When correctness matters more than squeezing out microseconds

Units as an S3 Class — Workflow

Results carry units as an S3 class attribute — both powerful for correctness and a source of confusion in practice.

class(conus)
#> [1] "sf"         "data.frame"
attributes(conus)
#> $names
#>  [1] "state_region"           "state_division"         "feature_code"          
#>  [4] "state_name"             "state_abbr"             "name"                  
#>  [7] "fip_class"              "tiger_class"            "combined_area_code"    
#> [10] "metropolitan_area_code" "functional_status"      "land_area"             
#> [13] "water_area"             "fip_code"               "geometry"              
#> 
#> $row.names
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
#> 
#> $sf_column
#> [1] "geometry"
#> 
#> $agr
#>           state_region         state_division           feature_code 
#>                   <NA>                   <NA>                   <NA> 
#>             state_name             state_abbr                   name 
#>                   <NA>                   <NA>                   <NA> 
#>              fip_class            tiger_class     combined_area_code 
#>                   <NA>                   <NA>                   <NA> 
#> metropolitan_area_code      functional_status              land_area 
#>                   <NA>                   <NA>                   <NA> 
#>             water_area               fip_code 
#>                   <NA>                   <NA> 
#> Levels: constant aggregate identity
#> 
#> $class
#> [1] "sf"         "data.frame"

class(st_length(conus))
#> [1] "units"
attributes(st_length(conus)) |> unlist()
#> units.numerator           class 
#>             "m"         "units"

Units Workflow: Create → Convert → Drop

Create units naturally from measurements, convert between units, then drop before plotting or filtering:

Convert between units with the units package:

co <- AOI::aoi_get(state = "CO")
(l = st_length(st_cast(co, "MULTILINESTRING")))
#> 2100640 [m]

units::set_units(l, "km")
#> 2100.64 [km]
units::set_units(l, "mile")
#> 1305.277 [mile]

(a = st_area(co))
#> 2.69382e+11 [m^2]

units::set_units(a, "ha")
#> 26938196 [ha]
units::set_units(a, "km2")
#> 269382 [km^2]

Strip units when needed for arithmetic or plotting:

units::drop_units(l)
#> [1] 2100640
as.numeric(l)
#> [1] 2100640

Units Are a Class — Know the Rules

Units enforce dimensional correctness but can surprise you:

st_length(conus) + 100   # can't add meters + unitless number

filter(cities, dist_to_border < 160)  # may warn or fail

The corrected patterns:

# Create numeric version for filtering/plotting
cities <- cities |>
  mutate(dist_to_border_km = as.numeric(units::set_units(dist_to_border, "km")))

# Now filter and plot safely
cities |>
  filter(dist_to_border_km < 160) |>
  ggplot(aes(color = dist_to_border_km)) +
  geom_sf()

Getting Spatial Data into R

Building sf from a CSV

When coordinates are stored as numeric columns in a CSV — the most common format for point data from APIs, sensors, and databases — use st_as_sf():

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"),  # column names: X first, then Y
           crs    = 4326)             # raw lat/lon is almost always WGS84

Three things to always specify:

  1. coords = — column names in order X (longitude), Y (latitude)
  2. crs = — the CRS the raw numbers are in (not what you want them in)
  3. Then st_transform() to convert to your analysis CRS

Full pipeline:

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) |>
  st_transform(5070)                # reproject for distance analysis

End-to-End: CSV → Filtered → Transformed → Measured

This is the complete workflow to compute the distance from every city to the national border:

# Pre-compute boundary in projected space (outside the pipeline)
boundary <- st_cast(st_union(st_transform(conus, 5070)), "MULTILINESTRING")

# Main workflow: read, build, transform, measure
system.time({
cities <- read_csv("data/uscities.csv", show_col_types =FALSE) |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) |>
  st_transform(5070) %>%
  mutate(
    dist_to_border_m = st_distance(., boundary),
    dist_to_border_km = units::set_units(dist_to_border_m, "km")
  )
})
#>    user  system elapsed 
#>   1.665   0.018   1.633

# user  system elapsed 
# 1.648   0.014   1.677

Key takeaways: - Build sf: CSV coordinates + CRS they’re in → st_as_sf() - Transform: project to EPSG:5070 for measurements - Measure: distance returns units object; convert as needed - Reuse: pass pre-computed boundary to st_distance() — clean, fast, readable

Common Pitfalls ⚠️

Pitfall: Distances in Geographic Space

boundary <- st_cast(st_union(conus),"MULTILINESTRING")

cities <- read_csv("data/uscities.csv") |>
  st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
  mutate(dist = st_distance(., boundary))

# user    system  elapsed
# 80.763  0.300   81.055

Rule: Transform geometry to projected CRS before expensive operations like st_union(), st_intersection(), etc.

Pitfall 2: Comparing Units

dist <- st_distance(cities, boundary)  # Returns [m]

filter(cities, dist < 100)              # ❌ ERROR
filter(cities, as.numeric(dist) < 100)  # ✅ CORRECT

Rule: Drop units with as.numeric() before comparisons.

Important

Principle: Choose CRS strategically.

  • Geographic (4326) by default for input/output (standard data format)
  • Project immediately to EPSG:5070 (or your region’s standard) for analysis
  • Pre-compute expensive operations (union, clip, buffer) in projected space outside the data pipeline
  • This respects both correctness (geodesic) and performance (planar operations where they matter)

Visualizing Spatial Data

geom_sf(): The Foundation

geom_sf() is ggplot2’s spatial geometry layer. It automatically extracts coordinates from the geometry column and plots them. No need to specify x and y aesthetics.

Points

ggplot(cities) +
  geom_sf(color = "navy", size = 1)

Polygons

ggplot(conus) +
  geom_sf(fill = "lightblue", color = "black")

Key advantage: All standard ggplot2 aesthetics work — color, fill, size, alpha, linetype, etc. And you can layer multiple geom_sf() calls:

ggplot() +
  geom_sf(data = conus, fill = "white") +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)), color = "red", size = 0.1) +
  theme_void()

Tip

Always add theme_void() to remove axes, gridlines, and background — maps don’t need Cartesian coordinates.

ggrepel: Non-Overlapping Labels

ggrepel solves label collision — essential any time you label points on a map:

library(ggrepel)

ggplot() +
  geom_sf(data = conus, fill = "white") +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)), color = "red", size = 0.1) +
  ggrepel::geom_label_repel(
    data = filter(cities, city == "Denver"),
    aes(geometry = geometry, label = city),
    stat = "sf_coordinates",  # extracts coordinates from sf geometry
    size = 3
  ) +
  theme_void()

stat = "sf_coordinates" is the key argument — it tells ggrepel to extract X/Y from the geometry column rather than expecting explicit x and y aesthetics. Without it, you get an error.

gghighlight: Emphasize a Subset

gghighlight greys out features that do not meet a condition, keeping them as spatial context:

library(gghighlight)

ggplot() +
  geom_sf(data = conus) +
  geom_sf(data = st_filter(cities, st_transform(conus, 5070)),
          aes(color = as.numeric(dist_to_border_km)),
          size = 0.3) +
  gghighlight(as.numeric(dist_to_border_km) < 100) +
  scale_color_gradient(low = "orange", high = "darkred") +
  labs(title = "Cities within 100 km of the national border",
       color = "Distance (km)") +
  theme_void()

Features outside the condition are rendered in grey at reduced opacity — they provide spatial context without competing visually with the highlighted subset. Works with any geom_* — points, lines, polygons.

Wrapping Up

What We Covered

Simple Features

  • The ISO standard and its R implementation
  • The big 7 geometry types
  • sfg → sfc → sf class hierarchy
  • WKT / WKB encodings
  • Casting, combining, dissolving
  • Sticky geometry + geom_sf()

Coordinate Systems

  • GCS: ellipsoid, geoid, datum
  • PCS: projection types and distortion
  • EPSG codes and st_transform()
  • Always check CRS before joining

Measurements

  • Geodesic vs. planar distance
  • st_distance(), st_area(), st_length()
  • Units as an S3 class — set_units(), drop_units()
  • Converting and dropping units

Key habits built today:

  1. st_is_valid() on every dataset you read
  2. st_crs(a) == st_crs(b) before every spatial join
  3. Use equal-area CRS for areas, equidistant for distances
  4. as.numeric() before filtering on unit quantities

Next Week

Week 3: Spatial Predicates & Joins — relationships between geometries, filtering by space, and combining data sources