Vector Data — Features, Projections & Measures
Last week: data as rows and columns — wrangling, joining, visualizing, modeling.
This week: data as features in space — the same verbs, but now every observation has a geometry that describes where it is and what shape it takes.
Today’s arc:
sf)(We’ll take a 5-minute break after Simple Features.)
You’ve already used S3 classes in tidyverse without noticing. The same object type can behave differently based on its class.
Understanding S3 classes in R explains:
print() on a tibble looks different than a regular data.framefilter() on grouped data filters within each group instead of globallysummarize() on a grouped_df produces a different output shapeThis is the callback pattern — R looks at an object’s class and dispatches to the right function. Later, sf will extend data.frame with spatial methods that work the same way.
A class is metadata attached to an object telling R how to interpret and operate on it:
Same data. But now df has column names, printed summary, and methods that respect rectangular structure. The class changed the behavior without changing the underlying values.
Objects can have multiple classes in a hierarchy. Adding structure adds more classes:
# Plain data.frame
df <- data.frame(gauge = c("A", "B", "C"), flow = c(100, 120, 95))
class(df)
#> [1] "data.frame"
# Add grouping — now it has *two* classes
gdf <- group_by(df, gauge)
class(gdf)
#> [1] "grouped_df" "tbl_df" "tbl" "data.frame"
# Methods are dispatched: dplyr functions check for "grouped_df" first
slice(gdf, 1) # respects grouping — one row per group, not one global row
#> # A tibble: 3 × 2
#> # Groups: gauge [3]
#> gauge flow
#> <chr> <dbl>
#> 1 A 100
#> 2 B 120
#> 3 C 95
slice(df, 1) # no grouping — just one row total
#> gauge flow
#> 1 A 100The key insight: the same function behaves differently based on class. This is the callback mechanism. When you call slice(), R internally asks: “What class is this object?” and dispatches to the right implementation.
Every function that respects class has methods or versions specialized for each class:
When you call filter() on a grouped dataframe, R internally does:
grouped_df)filter.grouped_df()filter.data.frame()filter.grouped_df() method applies the condition within each group — the key behaviorCompare the same code on grouped vs ungrouped data:
df <- data.frame(site = c("A", "A", "B", "B"), flow = c(100, 120, 95, 110))
# On ungrouped data: keeps rows where flow > 105
df |> filter(flow > 105)
#> site flow
#> 1 A 120
#> 2 B 110
# On grouped data: keeps earliest row per group where flow > 105
group_by(df, site) |> filter(flow > 105)
#> # A tibble: 2 × 2
#> # Groups: site [2]
#> site flow
#> <chr> <dbl>
#> 1 A 120
#> 2 B 110The function is identical; the method dispatched changes based on class.
sfsf is powerful because it’s not built from scratch — it wraps three long standing C/C++ libraries, each with a focused job:
Your R code:
📦 GDAL
File I/O
Format translators
Reads/writes Shapefile, GeoJSON, GeoPackage, GeoTIFF
Returns WKB geometry
🧭 PROJ
Coordinate Systems
Transforms between 10,000+ CRS definitions
Handles datum shifts & projections
Underlies st_transform()
⚙️ GEOS
Geometry Ops
Intersection, union, buffer, centroid
Distance calculations
Spatial predicates (st_intersects())
Each library solves a hard problem:
| Library | Solves | Water Resources Example |
|---|---|---|
| GDAL | “How do I read a shapefile or GeoPackage?” | Read NHD flowlines, USGS boundaries, stream gauges |
| PROJ | “How do I convert WGS84 to Albers?” | Transform NWIS gauges from GPS coordinates to federal projection |
| GEOS | “Is this point inside this polygon? What’s the distance?” | Snap gauge to nearest flowline, clip to basin |
If we had to write these from scratch in R, it would take years. Instead, spatial computing in R benefits from decades of work by the OSGeo community.
sf Wrappersf doesn’t reinvent these wheels. It:
st_read(), st_write())st_transform())st_distance(), st_buffer(), spatial predicates)filter(), mutate(), select()) just work — same as tidyverseThis diagram shows the relationship: sf methods dispatch to GEOS for the heavy lifting, then return R objects.
Note
When you install the sf package on your machine, the R package manager compiles and links against GDAL, PROJ, and GEOS. If installation fails, it’s usually because one of these C/C++ libraries is missing from your system. This is why setting up spatial computing can be more complex than installing a pure-R package.
GDAL = Geospatial Data Abstraction Layer
A C++ library (developed by OSGeo, used by every GIS software) that:
When you call st_read("data.shp"), here’s what happens:
sf calls GDAL: “Read this file”sfsf wraps it in an sfc geometry + sf object structure and hands it to your R codeNote
You don’t write GDAL code directly. But st_read() and st_write() are thin wrappers around GDAL. Understanding this explains why st_read() “just works” on 50+ formats and why format conversions are trivial in R but painful elsewhere.
Key drivers for water resources:
| Driver | Format | Extension | Common Use |
|---|---|---|---|
| ESRI Shapefile | Shapefile | .shp | USGS NHD, NOAA, boundaries |
| GeoJSON | GeoJSON | .geojson | Google APIs, web services |
| GPKG | GeoPackage | .gpkg | NHDPlus HR, modern data |
| GTiff | GeoTIFF | .tif | USGS 3DEP elevation, imagery |
| netCDF | NetCDF | .nc | NOAA climate grids |
| HDF5 | HDF5 | .h5 | NASA satellite data |
Important
GeoPackage replaces Shapefiles: one file, complete metadata, SQLite-based. Shapefiles require sidecars (.shx, .dbf, .prj) and lack metadata. Use GeoPackage when possible.
When GDAL reads a file or GEOS performs an operation, it uses two standard encodings to pass geometry to R:
Well-Known Text (WKT) — human-readable (you’ll see this in error messages and discussions):
Well-Known Binary (WKB) — machine-readable, used internally by GDAL/GEOS for speed and lossless communication:
st_as_binary(x)
#> [1] 01 02 00 00 00 05 00 00 00 00 00 00 00 00 00 24 40 00 00 00 00 00 00 14 40
#> [26] 00 00 00 00 00 00 22 40 00 00 00 00 00 00 10 40 00 00 00 00 00 00 20 40 00
#> [51] 00 00 00 00 00 08 40 00 00 00 00 00 00 1c 40 00 00 00 00 00 00 00 40 00 00
#> [76] 00 00 00 00 18 40 00 00 00 00 00 00 f0 3fNote
You rarely write WKT/WKB directly, but understanding they exist explains why st_read() “just works”—GDAL converts any format to WKB, sf wraps it as an R object, and the roundtrip is seamless.
co
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS: WGS 84
#> First 10 features:
#> geoid name aland state_nm geometry
#> 1 08001 Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2 08003 Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3 08005 Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4 08007 Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5 08009 Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6 08011 Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7 08013 Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8 08014 Broomfield 85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9 08015 Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017 Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...Simple features (ISO 19125-1:2004) is a standard describing how real-world objects can be represented in computers, with emphasis on their spatial geometry
It defines:
sf in RA feature is a thing in the real world — a river, a gauge, a watershed, a county.
The standard says: “A simple feature has both spatial and non-spatial attributes. Spatial attributes are geometry valued.”
str(co)
#> Classes 'sf' and 'data.frame': 64 obs. of 5 variables:
#> $ geoid : chr "08001" "08003" "08005" "08007" ...
#> $ name : chr "Adams" "Alamosa" "Arapahoe" "Archuleta" ...
#> $ aland : num 3.02e+09 1.87e+09 2.07e+09 3.50e+09 6.62e+09 ...
#> $ state_nm: chr "Colorado" "Colorado" "Colorado" "Colorado" ...
#> $ geometry:sfc_MULTIPOLYGON of length 64; first list element: List of 1
#> ..$ :List of 1
#> .. ..$ : num [1:132, 1:2] -105 -105 -105 -105 -105 ...
#> ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
#> - attr(*, "sf_column")= chr "geometry"
#> - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
#> ..- attr(*, "names")= chr [1:4] "geoid" "name" "aland" "state_nm"A point is composed of one coordinate pair in a specific coordinate system.
A linestring is composed of an ordered sequence of two or more coordinate points.
A polygon is composed of 4 or more points whose starting and ending point are the same.
Multi- variants are just sets of the same type:
MULTIPOINT = set of pointsMULTILINESTRING = set of lines (e.g., a braided river system)MULTIPOLYGON = set of polygons (e.g., an island chain, a fragmented watershed)GEOMETRYCOLLECTION = a set of mixed geometry types (rare, usually from intersections)
| Single | Description |
|---|---|
POINT |
Zero-dimensional — a single coordinate pair |
LINESTRING |
Sequence of points connected by straight line pieces — 1D |
POLYGON |
Closed ring with positive area — 2D; may have interior holes |
| Multi (same type) | Description |
|---|---|
MULTIPOINT |
Set of points |
MULTILINESTRING |
Set of linestrings |
MULTIPOLYGON |
Set of polygons |
| Mixed | Description |
|---|---|
GEOMETRYCOLLECTION |
Set of geometries of any type |
Note
These 7 types cover virtually everything in water resources work: gauge locations (POINT), stream reaches (LINESTRING), watershed boundaries (POLYGON), and their multi-feature equivalents.
All geometries are composed of points defined in 2-, 3-, or 4-D space:
Rules for valid geometries:
LINESTRING — no self-intersections
POLYGON — rings must be closed
POLYGON — holes inside exterior only
POLYGON — rings touch at points only
POLYGON — no repeated paths
Warning
Run st_is_valid() to check your data. Fix invalid geometries with st_make_valid().
Self-intersecting line:
Invalid polygon (crossing edges):
These operations fail silently unless checked.
EMPTY geometries serve as “NA” placeholders for geometry types — they arise naturally from spatial operations:
Remember nested lists? A geometry is just coordinates:
R stores multiple geometries side-by-side this way: Each row in a data.frame holds one geometry (which is a list of coordinates). This creates a geometry column — many rows, each with its own nested geometry.
data.frame with feature attributes and a geometry column (class: sf)Important
This is the S3 class hierarchy in action. sf is a subclass of data.frame adding the geometry column and specialized methods. When you call filter() on an sf object, R looks at its class and dispatches to the spatial-aware filter.sf() method, which respects the geometry column. Same pattern as grouped_df from tidyverse.
sfg: Simple Feature Geometrysfg objects carry the geometry for a single feature. Storage rules:
POINT → numeric vectorLINESTRING or POLYGON ring → matrix (each row = a point)sfg: Class StructureAll geometry objects have an S3 class indicating (1) dimension, (2) type, (3) superclass:
sfg: Same Points, Different MeaningThe same coordinate matrix creates geometries with entirely different dimensionality:
When features have different geometry types, the type becomes GEOMETRY.
Example: A collection with POINT + LINESTRING:
Filter by type with st_is():
When intersections produce mixed types, use st_collection_extract():
st_cast(co1 <- filter(co, name == "Larimer")$geometry)
#> Geometry set for 1 feature
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS: WGS 84
(co_ls <- st_cast(co1, "MULTILINESTRING"))
#> Geometry set for 1 feature
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -106.1954 ymin: 40.25778 xmax: -104.9431 ymax: 40.99844
#> Geodetic CRS: WGS 84Casting operates at the feature level, not the set level:
To extract individual points, work at the set level (sfc):
Two key operations for working with geometry sets:
st_combine() — joins geometries while preserving interior boundaries:
st_union() — joins geometries and dissolves interior boundaries (merges into one feature):
st_combine (boundaries preserved):
st_union (boundaries dissolved):
Water resources example: Use st_combine() to keep sub-basins separate, use st_union() to create a single merged basin outline.
sfc: Geometry List-Columnsst_sfc creates sets of geometries — what becomes the sticky list-column in sf objects:
The sfc object records for the whole set: precision, bounding box, CRS, and count of empty geometries.
You can combine geometries and then cast them to a different type — useful for creating line networks from polygon boundaries:
Tip
This distinction matters constantly in water resources: county boundaries vs. watershed boundaries, stream reaches vs. the river network, individual gauge catchments vs. the basin outline.
Question: how far is Denver from the nearest state border?
#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_region state_division feature_code state_name state_abbr
#> 1 3 6 1779775 Alabama AL
#> 2 4 8 1779777 Arizona AZ
#> 3 3 7 0068085 Arkansas AR
#> 4 4 9 1779778 California CA
#> 5 4 8 1779779 Colorado CO
#> 6 1 1 1779780 Connecticut CT
#> 7 3 5 1779781 Delaware DE
#> 8 3 5 1702382 District of Columbia DC
#> 9 3 5 0294478 Florida FL
#> 10 3 5 1705317 Georgia GA
#> name fip_class tiger_class combined_area_code
#> 1 Alabama <NA> G4000 NA
#> 2 Arizona <NA> G4000 NA
#> 3 Arkansas <NA> G4000 NA
#> 4 California <NA> G4000 NA
#> 5 Colorado <NA> G4000 NA
#> 6 Connecticut <NA> G4000 NA
#> 7 Delaware <NA> G4000 NA
#> 8 District of Columbia <NA> G4000 NA
#> 9 Florida <NA> G4000 NA
#> 10 Georgia <NA> G4000 NA
#> metropolitan_area_code functional_status land_area water_area fip_code
#> 1 <NA> A 131175477769 4591897964 01
#> 2 <NA> A 294363973043 855871553 04
#> 3 <NA> A 134660767709 3121950081 05
#> 4 <NA> A 403671756816 20293573058 06
#> 5 <NA> A 268418796417 1185716938 08
#> 6 <NA> A 12541690473 1816424193 09
#> 7 <NA> A 5046731559 1399179670 10
#> 8 <NA> A 158316124 18709762 11
#> 9 <NA> A 138961722096 45972570361 12
#> 10 <NA> A 149486624386 4418360134 13
#> geometry
#> 1 MULTIPOLYGON (((-85.4883 30...
#> 2 MULTIPOLYGON (((-110.7507 3...
#> 3 MULTIPOLYGON (((-90.95577 3...
#> 4 MULTIPOLYGON (((-116.1062 3...
#> 5 MULTIPOLYGON (((-105.155 36...
#> 6 MULTIPOLYGON (((-72.5279 41...
#> 7 MULTIPOLYGON (((-75.13846 3...
#> 8 MULTIPOLYGON (((-77.00244 3...
#> 9 MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_name geometry
#> 1 Alabama MULTIPOLYGON (((-85.4883 30...
#> 2 Arizona MULTIPOLYGON (((-110.7507 3...
#> 3 Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4 California MULTIPOLYGON (((-116.1062 3...
#> 5 Colorado MULTIPOLYGON (((-105.155 36...
#> 6 Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7 Delaware MULTIPOLYGON (((-75.13846 3...
#> 8 District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9 Florida MULTIPOLYGON (((-83.10874 2...
#> 10 Georgia MULTIPOLYGON (((-81.09538 3...
#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_name geometry dist
#> 1 Alabama MULTIPOLYGON (((-85.4883 30... 1571007.9 [m]
#> 2 Arizona MULTIPOLYGON (((-110.7507 3... 466602.9 [m]
#> 3 Arkansas MULTIPOLYGON (((-90.95577 3... 975519.6 [m]
#> 4 California MULTIPOLYGON (((-116.1062 3... 1000950.5 [m]
#> 5 Colorado MULTIPOLYGON (((-105.155 36... 0.0 [m]
#> 6 Connecticut MULTIPOLYGON (((-72.5279 41... 2636635.7 [m]
#> 7 Delaware MULTIPOLYGON (((-75.13846 3... 2485997.1 [m]
#> 8 District of Columbia MULTIPOLYGON (((-77.00244 3... 2388920.9 [m]
#> 9 Florida MULTIPOLYGON (((-83.10874 2... 1847447.4 [m]
#> 10 Georgia MULTIPOLYGON (((-81.09538 3... 1788781.6 [m]
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS: WGS 84
#> state_name dist geometry
#> 1 Colorado 0.0 [m] MULTIPOLYGON (((-105.155 36...
#> 2 Wyoming 140002.6 [m] MULTIPOLYGON (((-106.3212 4...
#> 3 Nebraska 161243.2 [m] MULTIPOLYGON (((-95.93779 4...
#> Simple feature collection with 49 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_region state_division feature_code state_name state_abbr
#> 1 3 6 1779775 Alabama AL
#> 2 4 8 1779777 Arizona AZ
#> 3 3 7 0068085 Arkansas AR
#> 4 4 9 1779778 California CA
#> 5 4 8 1779779 Colorado CO
#> 6 1 1 1779780 Connecticut CT
#> 7 3 5 1779781 Delaware DE
#> 8 3 5 1702382 District of Columbia DC
#> 9 3 5 0294478 Florida FL
#> 10 3 5 1705317 Georgia GA
#> name fip_class tiger_class combined_area_code
#> 1 Alabama <NA> G4000 NA
#> 2 Arizona <NA> G4000 NA
#> 3 Arkansas <NA> G4000 NA
#> 4 California <NA> G4000 NA
#> 5 Colorado <NA> G4000 NA
#> 6 Connecticut <NA> G4000 NA
#> 7 Delaware <NA> G4000 NA
#> 8 District of Columbia <NA> G4000 NA
#> 9 Florida <NA> G4000 NA
#> 10 Georgia <NA> G4000 NA
#> metropolitan_area_code functional_status land_area water_area fip_code
#> 1 <NA> A 131175477769 4591897964 01
#> 2 <NA> A 294363973043 855871553 04
#> 3 <NA> A 134660767709 3121950081 05
#> 4 <NA> A 403671756816 20293573058 06
#> 5 <NA> A 268418796417 1185716938 08
#> 6 <NA> A 12541690473 1816424193 09
#> 7 <NA> A 5046731559 1399179670 10
#> 8 <NA> A 158316124 18709762 11
#> 9 <NA> A 138961722096 45972570361 12
#> 10 <NA> A 149486624386 4418360134 13
#> geometry
#> 1 MULTIPOLYGON (((-85.4883 30...
#> 2 MULTIPOLYGON (((-110.7507 3...
#> 3 MULTIPOLYGON (((-90.95577 3...
#> 4 MULTIPOLYGON (((-116.1062 3...
#> 5 MULTIPOLYGON (((-105.155 36...
#> 6 MULTIPOLYGON (((-72.5279 41...
#> 7 MULTIPOLYGON (((-75.13846 3...
#> 8 MULTIPOLYGON (((-77.00244 3...
#> 9 MULTIPOLYGON (((-83.10874 2...
#> 10 MULTIPOLYGON (((-81.09538 3...
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_name geometry
#> 1 Alabama MULTIPOLYGON (((-85.4883 30...
#> 2 Arizona MULTIPOLYGON (((-110.7507 3...
#> 3 Arkansas MULTIPOLYGON (((-90.95577 3...
#> 4 California MULTIPOLYGON (((-116.1062 3...
#> 5 Colorado MULTIPOLYGON (((-105.155 36...
#> 6 Connecticut MULTIPOLYGON (((-72.5279 41...
#> 7 Delaware MULTIPOLYGON (((-75.13846 3...
#> 8 District of Columbia MULTIPOLYGON (((-77.00244 3...
#> 9 Florida MULTIPOLYGON (((-83.10874 2...
#> 10 Georgia MULTIPOLYGON (((-81.09538 3...
#> Simple feature collection with 49 features and 1 field
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_name geometry
#> 1 Alabama MULTILINESTRING ((-85.4883 ...
#> 2 Arizona MULTILINESTRING ((-110.7507...
#> 3 Arkansas MULTILINESTRING ((-90.95577...
#> 4 California MULTILINESTRING ((-116.1062...
#> 5 Colorado MULTILINESTRING ((-105.155 ...
#> 6 Connecticut MULTILINESTRING ((-72.5279 ...
#> 7 Delaware MULTILINESTRING ((-75.13846...
#> 8 District of Columbia MULTILINESTRING ((-77.00244...
#> 9 Florida MULTILINESTRING ((-83.10874...
#> 10 Georgia MULTILINESTRING ((-81.09538...
#> Simple feature collection with 49 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_name geometry dist
#> 1 Alabama MULTILINESTRING ((-85.4883 ... 1571007.9 [m]
#> 2 Arizona MULTILINESTRING ((-110.7507... 466602.9 [m]
#> 3 Arkansas MULTILINESTRING ((-90.95577... 975519.6 [m]
#> 4 California MULTILINESTRING ((-116.1062... 1000950.5 [m]
#> 5 Colorado MULTILINESTRING ((-105.155 ... 140002.6 [m]
#> 6 Connecticut MULTILINESTRING ((-72.5279 ... 2636635.7 [m]
#> 7 Delaware MULTILINESTRING ((-75.13846... 2485997.1 [m]
#> 8 District of Columbia MULTILINESTRING ((-77.00244... 2388920.9 [m]
#> 9 Florida MULTILINESTRING ((-83.10874... 1847447.4 [m]
#> 10 Georgia MULTILINESTRING ((-81.09538... 1788781.6 [m]
#> Simple feature collection with 3 features and 2 fields
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -111.0546 ymin: 36.99246 xmax: -95.30829 ymax: 45.00582
#> Geodetic CRS: WGS 84
#> state_name dist geometry
#> 1 Colorado 140002.6 [m] MULTILINESTRING ((-105.155 ...
#> 2 Wyoming 140002.6 [m] MULTILINESTRING ((-106.3212...
#> 3 Nebraska 161243.2 [m] MULTILINESTRING ((-95.93779...
Computing distance to 48 state borders when we only need the nearest one is wasteful. The feature should be the complete set of borders — one MULTILINESTRING:
sf Objectsf: Attributes + Geometryco
#> Simple feature collection with 64 features and 4 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -109.0602 ymin: 36.99246 xmax: -102.0415 ymax: 41.00342
#> Geodetic CRS: WGS 84
#> First 10 features:
#> geoid name aland state_nm geometry
#> 1 08001 Adams 3021840487 Colorado MULTIPOLYGON (((-105.0532 3...
#> 2 08003 Alamosa 1871643028 Colorado MULTIPOLYGON (((-105.4855 3...
#> 3 08005 Arapahoe 2066438714 Colorado MULTIPOLYGON (((-103.7065 3...
#> 4 08007 Archuleta 3496712164 Colorado MULTIPOLYGON (((-107.1287 3...
#> 5 08009 Baca 6617400567 Colorado MULTIPOLYGON (((-102.0416 3...
#> 6 08011 Bent 3918255148 Colorado MULTIPOLYGON (((-102.7476 3...
#> 7 08013 Boulder 1881325109 Colorado MULTIPOLYGON (((-105.3978 3...
#> 8 08014 Broomfield 85386685 Colorado MULTIPOLYGON (((-105.1092 3...
#> 9 08015 Chaffee 2624715692 Colorado MULTIPOLYGON (((-105.9698 3...
#> 10 08017 Cheyenne 4605713960 Colorado MULTIPOLYGON (((-103.1729 3...
class(co)
#> [1] "sf" "data.frame"
attr(co, "sf_column")
#> [1] "geometry"An sf object is a data.frame with one special list-column: the geometry. It extends data.frame with spatial awareness.
The geometry column persists through dplyr operations:
co |>
select(name) |>
slice(1:2) # geometry comes along
#> Simple feature collection with 2 features and 1 field
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -106.0393 ymin: 37.3562 xmax: -103.7057 ymax: 40.00137
#> Geodetic CRS: WGS 84
#> name geometry
#> 1 Adams MULTIPOLYGON (((-105.0532 3...
#> 2 Alamosa MULTIPOLYGON (((-105.4855 3...sf Objects with ggplot2geom_sf() is the bridge between sf and ggplot2 — everything from last week works:
The geometry binding is unique — unlike bind_cols() or cbind(), the sfc list-column carries CRS, bounding box, and precision metadata that regular column binding does not.
You now understand:
sfg (individual geometry) → sfc (set) → sf (dataframe with sticky geometry)sf objects work naturally with pipes, filters, and ggplotNow a 5-minute break, then: What makes these coordinates actually spatial? We’ll cover the Coordinate Reference System — the foundation that anchors every spatial operation.
Back in 5. Next: coordinate systems, projections, and the three-part system (ellipsoid, datum, prime meridian) that determines whether your spatial analysis is correct or silently wrong.
The geometry alone is meaningless without knowing where on Earth those coordinates refer to. That’s the job of the Coordinate Reference System.
A Coordinate Reference System (CRS) defines how spatial features relate to the surface of the Earth.
Two broad types:
In sf, three tools:
| Function | Does |
|---|---|
st_crs() |
Retrieve CRS from an sf or sfc object |
st_set_crs() |
Set or replace CRS (no coordinate transformation) |
st_transform() |
Transform coordinates from one CRS to another |
A GCS identifies locations on the curved surface of the Earth using angular measurements from the center:
A GCS is defined by three components:
Why this matters: two datasets may use the same coordinates but different datums. The difference is small (sub-meter) but compounds in spatial joins and distance calculations.
Three datums account for most US data:
| Datum | Acronym | Used by | Typical error vs. NAD83 |
|---|---|---|---|
| North American Datum 1983 | NAD83 | NHDPlus, most USGS vector data | Reference datum |
| World Geodetic System 1984 | WGS84 | GPS, NWIS gauges, web maps | <1 meter in CONUS |
| NAD27 (historic) | NAD27 | Legacy sources only | Can be 100+ meters off |
Note
NAD83 and WGS84 differ by less than 1 meter for most of CONUS. But they are not the same CRS. Mixing them without explicit transformation introduces sub-meter errors that compound in spatial joins. Always check with st_crs(a) == st_crs(b) before joining.
A Projected Coordinate System (PCS) transforms the curved Earth onto a 2D plane. Every transformation introduces distortion.
Three main projection families:
Each type of projection prioritizes one property at the cost of others:
| Property | Preserved by | Used for |
|---|---|---|
| Area | Equal-area (Albers) | Accurate area calculations, choropleth maps |
| Distance | Equidistant | Distance from a fixed center point |
| Shape | Conformal (Mercator) | Navigation, preserving local angles |
| Direction | Azimuthal | Navigation from a pole |
For CONUS analysis, use these defaults:
Important
st_distance() with default S2 spherical geometry — it’s geodesic and correct.sf stores CRS in two ways:
4326 = WGS84 geographic)Common codes for water resources work:
| EPSG | Name | Use |
|---|---|---|
| 4326 | WGS84 geographic | GPS coordinates, web maps, NWIS gauges |
| 4269 | NAD83 geographic | NHDPlus, most USGS vector data |
| 5070 | NAD83 / Conus Albers | Area calculations, CONUS-wide analysis |
| 26913–26915 | UTM NAD83 zones 13–15 | Local/state-level distance calculations |
Practical rule: Use EPSG:5070 for CONUS area calculations; use a local UTM (e.g., NAD83 / UTM zones) for sub-state analyses.
Before any spatial operation — joins, distance calculations, area calculations — verify CRS match:
Real example: NWIS gauges (WGS84, EPSG:4326) + NHD flowlines (NAD83, EPSG:4269):
They look similar (sub-meter differences in CONUS), but snapping a point to a line or computing a distance will be subtly wrong without transformation.
Important
sf will warn you about CRS mismatch but won’t stop you. The result will execute and look plausible — but numbers will be wrong. Build the guard-check habit now.
Geodesic distance: shortest path on an ellipsoid — true Earth surface distance.
Planar distance: Euclidean distance in projected coordinates — fast but introduces error.
Important
For water resources: use geodesic (default). In geographic CRS, st_distance() automatically computes geodesic (spherical) distances — correct for continental scales.
Note
sf uses geodesic by default in geographic CRS. Toggle S2 with sf::sf_use_s2(TRUE/FALSE) for fine control (rarely needed).
Note
Performance: S2 geodesic calculations used to be substantially slower, but modern versions of sf and the S2 library have optimized performance significantly. Geodesic is now only ~25% slower than planar for large operations — a measurable but not critical difference for most workflows. Use geodesic unless you’re profiling and find it’s a real bottleneck.
S2 is Google’s spherical geometry library — a modern C++ implementation for accurate calculations on a sphere/ellipsoid.
sf uses S2 under the hood when data is in geographic coordinates (e.g., WGS84). Instead of flattening Earth to a 2D plane (which introduces distortion), S2 works directly with spherical coordinates:
Important
S2 is automatic: if your data is in geographic coordinates, st_distance(), st_intersects(), and other spatial operations use S2 geodesic calculations by default. You don’t need to do anything.
S2 is the right default for water resources work (continental scales, mixed coordinate systems), but you can toggle it for specialized cases:
When to disable S2: - Very small local areas (< 10 km) where planar error is negligible and you want marginal speed gains - Specialized geometric algorithms that require GEOS planar operations (rare in practice)
Leave S2 enabled (default): - Continental and larger scales (your typical water resources workflow) - Mixing geographic and projected data - When correctness matters more than squeezing out microseconds
Results carry units as an S3 class attribute — both powerful for correctness and a source of confusion in practice.
class(conus)
#> [1] "sf" "data.frame"
attributes(conus)
#> $names
#> [1] "state_region" "state_division" "feature_code"
#> [4] "state_name" "state_abbr" "name"
#> [7] "fip_class" "tiger_class" "combined_area_code"
#> [10] "metropolitan_area_code" "functional_status" "land_area"
#> [13] "water_area" "fip_code" "geometry"
#>
#> $row.names
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
#>
#> $sf_column
#> [1] "geometry"
#>
#> $agr
#> state_region state_division feature_code
#> <NA> <NA> <NA>
#> state_name state_abbr name
#> <NA> <NA> <NA>
#> fip_class tiger_class combined_area_code
#> <NA> <NA> <NA>
#> metropolitan_area_code functional_status land_area
#> <NA> <NA> <NA>
#> water_area fip_code
#> <NA> <NA>
#> Levels: constant aggregate identity
#>
#> $class
#> [1] "sf" "data.frame"…
Create units naturally from measurements, convert between units, then drop before plotting or filtering:
Convert between units with the units package:
co <- AOI::aoi_get(state = "CO")
(l = st_length(st_cast(co, "MULTILINESTRING")))
#> 2100640 [m]
units::set_units(l, "km")
#> 2100.64 [km]
units::set_units(l, "mile")
#> 1305.277 [mile]
(a = st_area(co))
#> 2.69382e+11 [m^2]
units::set_units(a, "ha")
#> 26938196 [ha]
units::set_units(a, "km2")
#> 269382 [km^2]Units enforce dimensional correctness but can surprise you:
sf from a CSVWhen coordinates are stored as numeric columns in a CSV — the most common format for point data from APIs, sensors, and databases — use st_as_sf():
Three things to always specify:
coords = — column names in order X (longitude), Y (latitude)crs = — the CRS the raw numbers are in (not what you want them in)st_transform() to convert to your analysis CRSThis is the complete workflow to compute the distance from every city to the national border:
# Pre-compute boundary in projected space (outside the pipeline)
boundary <- st_cast(st_union(st_transform(conus, 5070)), "MULTILINESTRING")
# Main workflow: read, build, transform, measure
system.time({
cities <- read_csv("data/uscities.csv", show_col_types =FALSE) |>
st_as_sf(coords = c("lng", "lat"), crs = 4326) |>
st_transform(5070) %>%
mutate(
dist_to_border_m = st_distance(., boundary),
dist_to_border_km = units::set_units(dist_to_border_m, "km")
)
})
#> user system elapsed
#> 1.665 0.018 1.633
# user system elapsed
# 1.648 0.014 1.677Key takeaways: - Build sf: CSV coordinates + CRS they’re in → st_as_sf() - Transform: project to EPSG:5070 for measurements - Measure: distance returns units object; convert as needed - Reuse: pass pre-computed boundary to st_distance() — clean, fast, readable
Rule: Transform geometry to projected CRS before expensive operations like st_union(), st_intersection(), etc.
Rule: Drop units with as.numeric() before comparisons.
Important
Principle: Choose CRS strategically.
geom_sf(): The Foundationgeom_sf() is ggplot2’s spatial geometry layer. It automatically extracts coordinates from the geometry column and plots them. No need to specify x and y aesthetics.
Key advantage: All standard ggplot2 aesthetics work — color, fill, size, alpha, linetype, etc. And you can layer multiple geom_sf() calls:
Tip
Always add theme_void() to remove axes, gridlines, and background — maps don’t need Cartesian coordinates.
ggrepel: Non-Overlapping Labelsggrepel solves label collision — essential any time you label points on a map:
library(ggrepel)
ggplot() +
geom_sf(data = conus, fill = "white") +
geom_sf(data = st_filter(cities, st_transform(conus, 5070)), color = "red", size = 0.1) +
ggrepel::geom_label_repel(
data = filter(cities, city == "Denver"),
aes(geometry = geometry, label = city),
stat = "sf_coordinates", # extracts coordinates from sf geometry
size = 3
) +
theme_void()stat = "sf_coordinates" is the key argument — it tells ggrepel to extract X/Y from the geometry column rather than expecting explicit x and y aesthetics. Without it, you get an error.
gghighlight: Emphasize a Subsetgghighlight greys out features that do not meet a condition, keeping them as spatial context:
library(gghighlight)
ggplot() +
geom_sf(data = conus) +
geom_sf(data = st_filter(cities, st_transform(conus, 5070)),
aes(color = as.numeric(dist_to_border_km)),
size = 0.3) +
gghighlight(as.numeric(dist_to_border_km) < 100) +
scale_color_gradient(low = "orange", high = "darkred") +
labs(title = "Cities within 100 km of the national border",
color = "Distance (km)") +
theme_void()Features outside the condition are rendered in grey at reduced opacity — they provide spatial context without competing visually with the highlighted subset. Works with any geom_* — points, lines, polygons.
Simple Features
geom_sf()Coordinate Systems
st_transform()Measurements
st_distance(), st_area(), st_length()set_units(), drop_units()Key habits built today:
st_is_valid() on every dataset you readst_crs(a) == st_crs(b) before every spatial joinas.numeric() before filtering on unit quantitiesWeek 3: Spatial Predicates & Joins — relationships between geometries, filtering by space, and combining data sources