Week 1
Expectations / Level Setting
🚀 Getting Started with R for Data Science
- Welcome to 523C: Environmental Data Science Applications: Water Resources!
- This first lecture will introduce essential, high-level topics to help you build a strong foundation in R for environmental data science.
- Throughout the lecture, you will be asked to assess your comfort level with various topics via a Google survey.
- The survey results will help tailor the course focus, ensuring that we reinforce challenging concepts while avoiding unnecessary review of familiar topics.
Google Survey
- Please open this survey and answer the questions as we work through this lecture.
- Your responses will provide valuable insights into areas where additional explanations or hands-on exercises may be beneficial.
~ Week 1: Data Science Basics
Data Types
R has five principal data types (excluding raw and complex):
- Character: A string of text, represented with quotes (e.g., “hello”).
- Used to store words, phrases, and categorical data.
- Integer: A whole number, explicitly defined with an
L
suffix (e.g.,42L
).- Stored more efficiently than numeric values when decimals are not needed.
- Numeric: A floating-point number, used for decimal values (e.g.,
3.1415
).- This is the default type for numbers in R.
- Boolean (Logical): A logical value that represents
TRUE
orFALSE
.- Commonly used in logical operations and conditional statements.
<- "a"
character <- 1L
integer <- 3.3
numeric <- TRUE boolean
Data Structures
- When working with multiple values, we need data structures to store and manipulate data efficiently.
- R provides several types of data structures, each suited for different use cases.
Vector
- A vector is the most basic data structure in R and contains elements of the same type.
- Vectors are created using the
c()
function.
<- c("a", "b", "c")
char.vec <- c(TRUE, FALSE, TRUE) boolean.vec
- Lists allow for heterogeneous data types.
<- list(a = c(1,2,3),
list b = c(TRUE, FALSE),
c = "test")
Matrix
# Creating a sequence of numbers:
<- 1:9)
(vec #> [1] 1 2 3 4 5 6 7 8 9
- A matrix is a two-dimensional data structure where a diminision (dim) is added to an atomic vector
- Matrices are created using the
matrix()
function.
# Default column-wise filling
matrix(vec, nrow = 3)
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
# Row-wise filling
matrix(vec, nrow = 3, byrow = TRUE)
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
#> [3,] 7 8 9
Array
- An array extends matrices to higher dimensions.
- It is useful when working with multi-dimensional data.
# Creating a 2x2x2 array
array(vec, dim = c(2,2,2))
#> , , 1
#>
#> [,1] [,2]
#> [1,] 1 3
#> [2,] 2 4
#>
#> , , 2
#>
#> [,1] [,2]
#> [1,] 5 7
#> [2,] 6 8
Data Frame / Tibble
- Data Frames: A table-like (rectangular) structure where each column is a vector of equal length.
- Used for storing datasets where different columns can have different data types.
- Tibble: A modern version of a data frame that supports list-columns and better printing.
- Offers improved performance and formatting for large datasets.
<- data.frame(char.vec, boolean.vec))
(df #> char.vec boolean.vec
#> 1 a TRUE
#> 2 b FALSE
#> 3 c TRUE
<- tibble::tibble(char.vec, list))
(tib #> # A tibble: 3 × 2
#> char.vec list
#> <chr> <named list>
#> 1 a <dbl [3]>
#> 2 b <lgl [2]>
#> 3 c <chr [1]>
📦 Installing Packages
- R has a vast ecosystem of packages that extend its capabilities both on CRAN and github
- To install a package from CRAN, use
install.packages()
. - To install a package from Github, use
remotes
::install_github()`. - We’ll start by installing
palmerpenguins
, which contains a dataset on penguins.
install.packages('palmerpenguins')
Attaching/Loading Packages
- To use an installed package, you need to load it in your current working session using
library()
. - Here, we load
palmerpenguins
for dataset exploration andtidyverse
for data science workflows.
library(palmerpenguins) # 🐧 Fun dataset about penguins!
library(tidyverse) # 🛠 Essential for data science in R
Help & Documentation
- R has built-in documentation that provides information about functions and datasets.
- To access documentation, use
?function_name
. - Example: Viewing the help page for the
penguins
dataset.
?penguins
- You can also use
help.search("keyword")
to look up topics of interest. - For vignettes (detailed guides), use
vignette("package_name")
.
Quarto: Communication
- In this class we will use Quarto, a more modern, cross langauge version of Rmarkdown
- If you are comfortable with Rmd, you’ll quickly be able to transition to Qmd
- If you are new to Rmd, you’ll be able to learn the latest and greatest
🌟 Tidyverse: A Swiss Army Knife for Data Science R 
The
tidyverse
is a collection of packages designed for data science.We can see what it includes using the
tidyverse_packages
function:
tidyverse_packages()
#> [1] "broom" "conflicted" "cli" "dbplyr"
#> [5] "dplyr" "dtplyr" "forcats" "ggplot2"
#> [9] "googledrive" "googlesheets4" "haven" "hms"
#> [13] "httr" "jsonlite" "lubridate" "magrittr"
#> [17] "modelr" "pillar" "purrr" "ragg"
#> [21] "readr" "readxl" "reprex" "rlang"
#> [25] "rstudioapi" "rvest" "stringr" "tibble"
#> [29] "tidyr" "xml2" "tidyverse"
While all tidyverse
packages are valuable, the main ones we will focus on are:
readr
: Reading datatibble
: Enhanced data framesdplyr
: Data manipulationtidyr
: Data reshapingpurrr
: Functional programmingggplot2
: Visualization
Combined, this provides us a complete “data science” toolset:


readr 
- The
readr
package provides functions for reading data into R. - The
read_csv()
function reads comma-separated files. - The
read_tsv()
function reads tab-separated files. - The
read_delim()
function reads files with custom delimiters. - In all cases, more intellegent parsing is done than with base R equivalents.
read_csv 
= 'https://raw.githubusercontent.com/mikejohnson51/csu-ess-330/refs/heads/main/resources/county-centroids.csv'
path
# base R
read.csv(path) |>
head()
#> fips LON LAT
#> 1 1061 -85.83575 31.09404
#> 2 8125 -102.42587 40.00307
#> 3 17177 -89.66239 42.35138
#> 4 28153 -88.69577 31.64132
#> 5 34041 -74.99570 40.85940
#> 6 46051 -96.76981 45.17255
# More inutitive readr
read_csv(path) |>
head()
#> # A tibble: 6 × 3
#> fips LON LAT
#> <chr> <dbl> <dbl>
#> 1 01061 -85.8 31.1
#> 2 08125 -102. 40.0
#> 3 17177 -89.7 42.4
#> 4 28153 -88.7 31.6
#> 5 34041 -75.0 40.9
#> 6 46051 -96.8 45.2
dplyr 
- The
dplyr
package provides functions for data manipulation throuhg ‘a grammar for data manipulation’. - It provides capabilities similar to SQL for data manipulation.
- It includes functions for viewing, filtering, selecting, mutating, summarizing, and joining data.
%>%
/ |>

- The pipe operator
%>%
is used to chain operations in R. - The pipe operator
|>
is a base R version of%>%
introduced in R 4.1. - The pipe passes what on the “left hand” side to the function on the “right hand” side as the first argument.
|>
penguins glimpse()
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
glimpse 
- The
glimpse()
function provides a concise summary of a dataset.
glimpse(penguins)
#> Rows: 344
#> Columns: 8
#> $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
#> $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
#> $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
#> $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
#> $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
#> $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
#> $ sex <fct> male, female, female, NA, female, male, female, male…
#> $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
select 
- The
select()
function is used to select columns from a dataset. - It is useful when you want to work with specific columns.
- Example: Selecting the
species
column from thepenguins
dataset.
select(penguins, species)
#> # A tibble: 344 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Adelie
#> 3 Adelie
#> 4 Adelie
#> 5 Adelie
#> 6 Adelie
#> 7 Adelie
#> 8 Adelie
#> 9 Adelie
#> 10 Adelie
#> # ℹ 334 more rows
filter 
- The
filter()
function is used to filter rows based on a condition. - It is useful when you want to work with specific rows.
- Example: Filtering the
penguins
dataset to include only Adelie penguins.
filter(penguins, species == "Adelie")
#> # A tibble: 152 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen NA NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193 3450
#> 6 Adelie Torgersen 39.3 20.6 190 3650
#> 7 Adelie Torgersen 38.9 17.8 181 3625
#> 8 Adelie Torgersen 39.2 19.6 195 4675
#> 9 Adelie Torgersen 34.1 18.1 193 3475
#> 10 Adelie Torgersen 42 20.2 190 4250
#> # ℹ 142 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>
mutate 
- The
mutate()
function is used to create new columns or modify existing ones. - It is useful when you want to add new information to your dataset.
- Example: Creating a new column
bill_length_cm
frombill_length_mm
.
Note the use of the tidy_select helper starts_with
mutate(penguins, bill_length_cm = bill_length_mm / 100) |>
select(starts_with("bill"))
#> # A tibble: 344 × 3
#> bill_length_mm bill_depth_mm bill_length_cm
#> <dbl> <dbl> <dbl>
#> 1 39.1 18.7 0.391
#> 2 39.5 17.4 0.395
#> 3 40.3 18 0.403
#> 4 NA NA NA
#> 5 36.7 19.3 0.367
#> 6 39.3 20.6 0.393
#> 7 38.9 17.8 0.389
#> 8 39.2 19.6 0.392
#> 9 34.1 18.1 0.341
#> 10 42 20.2 0.42
#> # ℹ 334 more rows
summarize 
- The
summarize()
function is used to aggregate data. - It is useful when you want to calculate summary statistics.
- It always produces a one-row output.
- Example: Calculating the mean
bill_length_mm
for all penguins
summarize(penguins, bill_length_mm = mean(bill_length_mm, na.rm = TRUE))
#> # A tibble: 1 × 1
#> bill_length_mm
#> <dbl>
#> 1 43.9
group_by / ungroup 
- The
group_by()
function is used to group data by one or more columns. - It is useful when you want to perform operations on groups.
- It does this by adding a
grouped_df
class to the dataset. - The
ungroup()
function removes grouping from a dataset.
<- group_by(penguins, species)
groups
::group_keys(groups)
dplyr#> # A tibble: 3 × 1
#> species
#> <fct>
#> 1 Adelie
#> 2 Chinstrap
#> 3 Gentoo
::group_indices(groups)[1:5]
dplyr#> [1] 1 1 1 1 1
Group operations 
- Example: Grouping the
penguins
dataset byspecies
and calculating the meanbill_length_mm
.
|>
penguins group_by(species) |>
summarize(bill_length_mm = mean(bill_length_mm, na.rm = TRUE)) |>
ungroup()
#> # A tibble: 3 × 2
#> species bill_length_mm
#> <fct> <dbl>
#> 1 Adelie 38.8
#> 2 Chinstrap 48.8
#> 3 Gentoo 47.5
Joins 
- The
dplyr
package provides functions for joining datasets. - Common join functions include
inner_join()
,left_join()
,right_join()
, andfull_join()
. - Joins are used to combine datasets based on shared keys (primary and foreign).
Mutating joins 
- Mutating joins add columns from one dataset to another based on a shared key.
- Example: Adding
species
information to thepenguins
dataset based on thespecies_id
.
<- tribble(
species ~species_id, ~species,
1, "Adelie",
2, "Chinstrap",
3, "Gentoo"
)
left_join 
select(penguins, species, contains('bill')) |>
left_join(species, by = "species")
#> # A tibble: 344 × 4
#> species bill_length_mm bill_depth_mm species_id
#> <chr> <dbl> <dbl> <dbl>
#> 1 Adelie 39.1 18.7 1
#> 2 Adelie 39.5 17.4 1
#> 3 Adelie 40.3 18 1
#> 4 Adelie NA NA 1
#> 5 Adelie 36.7 19.3 1
#> 6 Adelie 39.3 20.6 1
#> 7 Adelie 38.9 17.8 1
#> 8 Adelie 39.2 19.6 1
#> 9 Adelie 34.1 18.1 1
#> 10 Adelie 42 20.2 1
#> # ℹ 334 more rows
right_join 
select(penguins, species, contains('bill')) |>
right_join(species, by = "species")
#> # A tibble: 344 × 4
#> species bill_length_mm bill_depth_mm species_id
#> <chr> <dbl> <dbl> <dbl>
#> 1 Adelie 39.1 18.7 1
#> 2 Adelie 39.5 17.4 1
#> 3 Adelie 40.3 18 1
#> 4 Adelie NA NA 1
#> 5 Adelie 36.7 19.3 1
#> 6 Adelie 39.3 20.6 1
#> 7 Adelie 38.9 17.8 1
#> 8 Adelie 39.2 19.6 1
#> 9 Adelie 34.1 18.1 1
#> 10 Adelie 42 20.2 1
#> # ℹ 334 more rows
inner_join 
select(penguins, species, contains('bill')) |>
right_join(species, by = "species")
#> # A tibble: 344 × 4
#> species bill_length_mm bill_depth_mm species_id
#> <chr> <dbl> <dbl> <dbl>
#> 1 Adelie 39.1 18.7 1
#> 2 Adelie 39.5 17.4 1
#> 3 Adelie 40.3 18 1
#> 4 Adelie NA NA 1
#> 5 Adelie 36.7 19.3 1
#> 6 Adelie 39.3 20.6 1
#> 7 Adelie 38.9 17.8 1
#> 8 Adelie 39.2 19.6 1
#> 9 Adelie 34.1 18.1 1
#> 10 Adelie 42 20.2 1
#> # ℹ 334 more rows
full_join 
select(penguins, species, contains('bill')) |>
right_join(species, by = "species")
#> # A tibble: 344 × 4
#> species bill_length_mm bill_depth_mm species_id
#> <chr> <dbl> <dbl> <dbl>
#> 1 Adelie 39.1 18.7 1
#> 2 Adelie 39.5 17.4 1
#> 3 Adelie 40.3 18 1
#> 4 Adelie NA NA 1
#> 5 Adelie 36.7 19.3 1
#> 6 Adelie 39.3 20.6 1
#> 7 Adelie 38.9 17.8 1
#> 8 Adelie 39.2 19.6 1
#> 9 Adelie 34.1 18.1 1
#> 10 Adelie 42 20.2 1
#> # ℹ 334 more rows
Filtering Joins 
- Filtering joins retain only rows that match between datasets.
- Example: Filtering the
penguins
dataset to include only rows with matchingspecies_id
.
select(penguins, species, contains('bill')) |>
semi_join(species, by = "species")
#> # A tibble: 344 × 3
#> species bill_length_mm bill_depth_mm
#> <fct> <dbl> <dbl>
#> 1 Adelie 39.1 18.7
#> 2 Adelie 39.5 17.4
#> 3 Adelie 40.3 18
#> 4 Adelie NA NA
#> 5 Adelie 36.7 19.3
#> 6 Adelie 39.3 20.6
#> 7 Adelie 38.9 17.8
#> 8 Adelie 39.2 19.6
#> 9 Adelie 34.1 18.1
#> 10 Adelie 42 20.2
#> # ℹ 334 more rows
ggplot2: Visualization 
- The
ggplot2
package is used for data visualization. - It is based on the “grammar of graphics”, which allows for a high level of customization.
ggplot2
is built on the concept of layers, where each layer adds a different element to the plot.
ggplot

- The
ggplot()
function initializes a plot. - It provides a blank canvas to which layers can be added.
ggplot()
data / aesthetics 
- Data must be provided to
ggplot()
- The
aes()
function is used to map variables to aesthetics (e.g., x and y axes). - aes arguments provided in
ggplot
are inherited by all layers. - Example: Creating a plot of
body_mass_g
vs.bill_length_mm
.
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm))
geom_* 
- The
geom_*()
functions add geometric objects to the plot. - They describe how to render the mapping created in
aes
- Example: Adding points to the plot.
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point()
facet_wrap / facet_grid 
- The
facet_wrap()
function is used to create small multiples of a plot. - It is useful when you want to compare subsets of data.
- The
facet_grid()
function is used to create a grid of plots. - Example: Faceting the plot by
species
.
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point() +
facet_wrap(~species)
theme_* 
- The
theme_*()
functions are used to customize the appearance of the plot. - They allow you to modify the plot’s background, gridlines, and text.
- Example: Applying the
theme_linedraw()
theme to the plot.
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point() +
facet_wrap(~species) +
theme_linedraw()
- There are 1000s of themes available in the
ggplot2
ecosystemggthemes
ggpubr
hrbrthemes
ggsci
- …
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point() +
facet_wrap(~species) +
::theme_economist() ggthemes
labs 
- The
labs()
function is used to add titles, subtitles, and axis labels to the plot. - It is useful for providing context and making the plot more informative.
- Example: Adding titles and axis labels to the plot.
ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm)) +
geom_point() +
facet_wrap(~species) +
theme_linedraw() +
labs(title = "Penguins Weight by Bill Size",
x = "Body Mass",
y = "Bill Length",
subtitle = "Made for 523c")
tidyr

- The
tidyr
package provides functions for data reshaping. - It includes functions for pivoting and nesting data.
pivot_longer 
- The
pivot_longer()
function is used to convert wide data to long data. - It is useful when you want to work with data in a tidy format.
- Example: Converting the
penguins
dataset from wide to long format.
data.long = penguins |>
(select(species, bill_length_mm, body_mass_g) |>
mutate(penguin_id = 1:n()) |>
pivot_longer(-c(penguin_id, species),
names_to = "Measure",
values_to = "value"))
#> # A tibble: 688 × 4
#> species penguin_id Measure value
#> <fct> <int> <chr> <dbl>
#> 1 Adelie 1 bill_length_mm 39.1
#> 2 Adelie 1 body_mass_g 3750
#> 3 Adelie 2 bill_length_mm 39.5
#> 4 Adelie 2 body_mass_g 3800
#> 5 Adelie 3 bill_length_mm 40.3
#> 6 Adelie 3 body_mass_g 3250
#> 7 Adelie 4 bill_length_mm NA
#> 8 Adelie 4 body_mass_g NA
#> 9 Adelie 5 bill_length_mm 36.7
#> 10 Adelie 5 body_mass_g 3450
#> # ℹ 678 more rows
pivot_wider 
- The
pivot_wider()
function is used to convert long data to wide data. - It is useful when you want to work with data in a wide format.
- Example: Converting the
data.long
dataset from long to wide format.
|>
data.long pivot_wider(names_from = "Measure",
values_from = "value")
#> # A tibble: 344 × 4
#> species penguin_id bill_length_mm body_mass_g
#> <fct> <int> <dbl> <dbl>
#> 1 Adelie 1 39.1 3750
#> 2 Adelie 2 39.5 3800
#> 3 Adelie 3 40.3 3250
#> 4 Adelie 4 NA NA
#> 5 Adelie 5 36.7 3450
#> 6 Adelie 6 39.3 3650
#> 7 Adelie 7 38.9 3625
#> 8 Adelie 8 39.2 4675
#> 9 Adelie 9 34.1 3475
#> 10 Adelie 10 42 4250
#> # ℹ 334 more rows
nest / unnest 
- The
nest()
function is used to nest data into a list-column. - It is useful when you want to group data together.
- Example: Nesting the
penguins
dataset byspecies
.
|>
penguins nest(data = -species)
#> # A tibble: 3 × 2
#> species data
#> <fct> <list>
#> 1 Adelie <tibble [152 × 7]>
#> 2 Gentoo <tibble [124 × 7]>
#> 3 Chinstrap <tibble [68 × 7]>
|>
penguins nest(data = -species) |>
unnest(data)
#> # A tibble: 344 × 8
#> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> <fct> <fct> <dbl> <dbl> <int> <int>
#> 1 Adelie Torgersen 39.1 18.7 181 3750
#> 2 Adelie Torgersen 39.5 17.4 186 3800
#> 3 Adelie Torgersen 40.3 18 195 3250
#> 4 Adelie Torgersen NA NA NA NA
#> 5 Adelie Torgersen 36.7 19.3 193 3450
#> 6 Adelie Torgersen 39.3 20.6 190 3650
#> 7 Adelie Torgersen 38.9 17.8 181 3625
#> 8 Adelie Torgersen 39.2 19.6 195 4675
#> 9 Adelie Torgersen 34.1 18.1 193 3475
#> 10 Adelie Torgersen 42 20.2 190 4250
#> # ℹ 334 more rows
#> # ℹ 2 more variables: sex <fct>, year <int>
linear modeling: lm
- The
lm()
function is used to fit linear models. - It is useful when you want to model the relationship between two variables.
- Example: Fitting a linear model to predict
body_mass_g
fromflipper_length_mm
.
<- lm(body_mass_g ~ flipper_length_mm, data = drop_na(penguins))
model
summary(model)
#>
#> Call:
#> lm(formula = body_mass_g ~ flipper_length_mm, data = drop_na(penguins))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1057.33 -259.79 -12.24 242.97 1293.89
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5872.09 310.29 -18.93 <2e-16 ***
#> flipper_length_mm 50.15 1.54 32.56 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 393.3 on 331 degrees of freedom
#> Multiple R-squared: 0.7621, Adjusted R-squared: 0.7614
#> F-statistic: 1060 on 1 and 331 DF, p-value: < 2.2e-16
broom 
- The
broom
package is used to tidy model outputs. - It provides functions to convert model outputs into tidy data frames.
- Example: Tidying the
model
output.
tidy 
- The
tidy()
function is used to tidy model coefficients. - It is useful when you want to extract model coefficients.
- Example: Tidying the
model
output.
tidy(model)
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -5872. 310. -18.9 1.18e- 54
#> 2 flipper_length_mm 50.2 1.54 32.6 3.13e-105
glance 
- The
glance()
function is used to provide a summary of model fit. - It is useful when you want to assess model performance.
- Example: Glancing at the
model
output.
glance(model)
#> # A tibble: 1 × 12
#> r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.762 0.761 393. 1060. 3.13e-105 1 -2461. 4928. 4940.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
augment 
- The
augment()
function is used to add model predictions and residuals to the dataset. - It is useful when you want to visualize model performance.
- Example: Augmenting the
model
output.
<- augment(model)
a
ggplot(a, aes(x = .fitted, y = body_mass_g)) +
geom_point() +
geom_smooth(method = "lm")
ggplot(a, aes(x = .resid)) +
geom_histogram()
purrr 
- The
purrr
package is used for functional programming. - It provides functions for working with lists and vectors.
map 
- The
map()
function is used to apply a function to each element of a list. - It is useful when you want to iterate over a list.
- Example: Fitting a linear model to each species in the
penguins
dataset.
|>
penguins nest(data = -species) |>
mutate(lm = map(data, ~lm(body_mass_g ~ flipper_length_mm, data = .x)))
#> # A tibble: 3 × 3
#> species data lm
#> <fct> <list> <list>
#> 1 Adelie <tibble [152 × 7]> <lm>
#> 2 Gentoo <tibble [124 × 7]> <lm>
#> 3 Chinstrap <tibble [68 × 7]> <lm>
map_* 
- The
map_*()
functions are used to extract specific outputs from a list. - They are useful when you want to extract specific outputs from a list.
- Example: Extracting the R-squared values (doubles) from the linear models.
|>
penguins nest(data = -species) |>
mutate(lm = map(data, ~lm(body_mass_g ~ flipper_length_mm, data = .x)),
r2 = map_dbl(lm, ~summary(.x)$r.squared))
#> # A tibble: 3 × 4
#> species data lm r2
#> <fct> <list> <list> <dbl>
#> 1 Adelie <tibble [152 × 7]> <lm> 0.219
#> 2 Gentoo <tibble [124 × 7]> <lm> 0.494
#> 3 Chinstrap <tibble [68 × 7]> <lm> 0.412
map2 
- The
map2()
function is used to iterate over two lists in parallel. - It is useful when you want to apply a function to two lists simultaneously.
- Example: Augmenting the linear models with the original data.
|>
penguins drop_na() |>
nest(data = -species) |>
mutate(lm_mod = map(data, ~lm(body_mass_g ~ flipper_length_mm, data = .x)),
r2 = map_dbl(lm_mod, ~summary(.x)$r.squared),
a = map2(lm_mod, data, ~broom::augment(.x, .y)))
#> # A tibble: 3 × 5
#> species data lm_mod r2 a
#> <fct> <list> <list> <dbl> <list>
#> 1 Adelie <tibble [146 × 7]> <lm> 0.216 <tibble [146 × 13]>
#> 2 Gentoo <tibble [119 × 7]> <lm> 0.506 <tibble [119 × 13]>
#> 3 Chinstrap <tibble [68 × 7]> <lm> 0.412 <tibble [68 × 13]>
~ Week 2-3: Spatial Data (Vector)
sf 
- The
sf
package is used for working with spatial data. - sf binds to common spatial libraries like GDAL, GEOS, and PROJ.
- It provides functions for reading, writing, and manipulating spatial data.
library(sf)
::sf_extSoftVersion()
sf#> GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H
#> "3.11.0" "3.5.3" "9.1.0" "true" "true"
#> PROJ
#> "9.1.0"
I/O 
- The
st_read()
function is used to read spatial data. - It is useful when you want to import spatial data into R for local or remote files.
- Example: Reading a Major Global Rivers.
From package 
# via packages
<- AOI::aoi_get(state = "conus", county = "all"))
(counties #> Simple feature collection with 3108 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -124.8485 ymin: 24.39631 xmax: -66.88544 ymax: 49.38448
#> Geodetic CRS: WGS 84
#> First 10 features:
#> state_region state_division feature_code state_name state_abbr name
#> 1 3 6 0161526 Alabama AL Autauga
#> 2 3 6 0161527 Alabama AL Baldwin
#> 3 3 6 0161528 Alabama AL Barbour
#> 4 3 6 0161529 Alabama AL Bibb
#> 5 3 6 0161530 Alabama AL Blount
#> 6 3 6 0161531 Alabama AL Bullock
#> 7 3 6 0161532 Alabama AL Butler
#> 8 3 6 0161533 Alabama AL Calhoun
#> 9 3 6 0161534 Alabama AL Chambers
#> 10 3 6 0161535 Alabama AL Cherokee
#> fip_class tiger_class combined_area_code metropolitan_area_code
#> 1 H1 G4020 388 <NA>
#> 2 H1 G4020 380 <NA>
#> 3 H1 G4020 NA <NA>
#> 4 H1 G4020 142 <NA>
#> 5 H1 G4020 142 <NA>
#> 6 H1 G4020 NA <NA>
#> 7 H1 G4020 NA <NA>
#> 8 H1 G4020 NA <NA>
#> 9 H1 G4020 122 <NA>
#> 10 H1 G4020 NA <NA>
#> functional_status land_area water_area fip_code
#> 1 A 1539634184 25674812 01001
#> 2 A 4117656514 1132955729 01003
#> 3 A 2292160149 50523213 01005
#> 4 A 1612188717 9572303 01007
#> 5 A 1670259090 14860281 01009
#> 6 A 1613083467 6030667 01011
#> 7 A 2012002546 2701199 01013
#> 8 A 1569246126 16536293 01015
#> 9 A 1545085601 16971700 01017
#> 10 A 1433620850 120310807 01019
#> geometry
#> 1 MULTIPOLYGON (((-86.81491 3...
#> 2 MULTIPOLYGON (((-87.59883 3...
#> 3 MULTIPOLYGON (((-85.41644 3...
#> 4 MULTIPOLYGON (((-87.01916 3...
#> 5 MULTIPOLYGON (((-86.5778 33...
#> 6 MULTIPOLYGON (((-85.65767 3...
#> 7 MULTIPOLYGON (((-86.4482 31...
#> 8 MULTIPOLYGON (((-85.79605 3...
#> 9 MULTIPOLYGON (((-85.59315 3...
#> 10 MULTIPOLYGON (((-85.51361 3...
From file 
<- sf::read_sf('data/majorrivers_0_0/MajorRivers.shp'))
(rivers #> Simple feature collection with 98 features and 4 fields
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -164.8874 ymin: -36.96945 xmax: 160.7636 ymax: 71.39249
#> Geodetic CRS: WGS 84
#> # A tibble: 98 × 5
#> NAME SYSTEM MILES KILOMETERS geometry
#> <chr> <chr> <dbl> <dbl> <MULTILINESTRING [°]>
#> 1 Kolyma <NA> 2552. 4106. ((144.8419 61.75915, 144.8258 61.8036,…
#> 2 Parana Parana 1616. 2601. ((-51.0064 -20.07941, -51.02972 -20.22…
#> 3 San Francisco <NA> 1494. 2404. ((-46.43639 -20.25807, -46.49835 -20.2…
#> 4 Japura Amazon 1223. 1968. ((-76.71056 1.624166, -76.70029 1.6883…
#> 5 Putumayo Amazon 890. 1432. ((-76.86806 1.300553, -76.86695 1.295,…
#> 6 Rio Maranon Amazon 889. 1431. ((-73.5079 -4.459834, -73.79197 -4.621…
#> 7 Ucayali Amazon 1298. 2089. ((-73.5079 -4.459834, -73.51585 -4.506…
#> 8 Guapore Amazon 394. 634. ((-65.39585 -10.39333, -65.39578 -10.3…
#> 9 Madre de Dios Amazon 568. 914. ((-65.39585 -10.39333, -65.45279 -10.4…
#> 10 Amazon Amazon 1890. 3042. ((-73.5079 -4.459834, -73.45141 -4.427…
#> # ℹ 88 more rows
via url 
# via url
<- sf::read_sf("https://reference.geoconnex.us/collections/gages/items/1000001"))
(gage #> Simple feature collection with 1 feature and 17 fields
#> Geometry type: POINT
#> Dimension: XY
#> Bounding box: xmin: -107.2826 ymin: 35.94568 xmax: -107.2826 ymax: 35.94568
#> Geodetic CRS: WGS 84
#> # A tibble: 1 × 18
#> nhdpv2_reachcode mainstem_uri fid nhdpv2_reach_measure cluster uri
#> <chr> <chr> <int> <dbl> <chr> <chr>
#> 1 13020205000216 https://geoconnex.u… 1 80.3 https:… http…
#> # ℹ 12 more variables: nhdpv2_comid <dbl>, name <chr>, nhdpv2_totdasqkm <dbl>,
#> # description <chr>, nhdpv2_link_source <chr>, subjectof <chr>,
#> # nhdpv2_offset_m <dbl>, provider <chr>, gage_totdasqkm <dbl>,
#> # provider_id <chr>, dasqkm_diff <dbl>, geometry <POINT [°]>
# write out data
# write_sf(counties, "data/counties.shp")
Geometry list columns 
- The
geometry
column contains the spatial information. - It is stored as a list-column of
sfc
objects. - Example: Accessing the first geometry in the
rivers
dataset.
$geometry[1]
rivers#> Geometry set for 1 feature
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: 144.8258 ymin: 61.40833 xmax: 160.7636 ymax: 68.8008
#> Geodetic CRS: WGS 84
Projections 
- CRS (Coordinate Reference System) is used to define the spatial reference.
- The
st_crs()
function is used to get the CRS of a dataset. - The
st_transform()
function is used to transform the CRS of a dataset. - Example: Transforming the
rivers
dataset to EPSG:5070.
st_crs(rivers) |> sf::st_is_longlat()
#> [1] TRUE
st_crs(rivers)$units
#> NULL
<- st_transform(rivers, 5070)
riv_5070
st_crs(riv_5070) |> sf::st_is_longlat()
#> [1] FALSE
st_crs(riv_5070)$units
#> [1] "m"
Data Manipulation 
- All dplyr verbs work with
sf
objects. - Example: Filtering the
rivers
dataset to include only the Mississippi River.
<- filter(rivers, SYSTEM == "Mississippi")
mississippi <- filter(counties, name == "Larimer") larimer
Unions / Combines 
- The
st_union()
function is used to combine geometries. - It is useful when you want to merge geometries.
mississippi#> Simple feature collection with 4 features and 4 fields
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -112 ymin: 28.92983 xmax: -77.86168 ymax: 48.16158
#> Geodetic CRS: WGS 84
#> # A tibble: 4 × 5
#> NAME SYSTEM MILES KILOMETERS geometry
#> * <chr> <chr> <dbl> <dbl> <MULTILINESTRING [°]>
#> 1 Arkansas Mississippi 1446. 2327. ((-106.3789 39.36165, -106.3295 39.3…
#> 2 Mississippi Mississippi 2385. 3838. ((-95.02364 47.15609, -94.98973 47.3…
#> 3 Missouri Mississippi 2739. 4408. ((-110.5545 44.76081, -110.6122 44.7…
#> 4 Ohio Mississippi 1368. 2202. ((-89.12166 36.97756, -89.17502 37.0…
st_union(mississippi)
#> Geometry set for 1 feature
#> Geometry type: MULTILINESTRING
#> Dimension: XY
#> Bounding box: xmin: -112 ymin: 28.92983 xmax: -77.86168 ymax: 48.16158
#> Geodetic CRS: WGS 84
Measures 
- The
st_length()
function is used to calculate the length of a geometry. - The
st_area()
function is used to calculate the area of a geometry. - The
st_distance()
function is used to calculate the distance between two geometries. - Example: Calculating the length of the Mississippi River and the area of Larimer County.
st_length(mississippi)
#> Units: [m]
#> [1] 1912869 3147943 3331900 1785519
st_area(larimer)
#> 6813621254 [m^2]
st_distance(larimer, mississippi)
#> Units: [m]
#> [,1] [,2] [,3] [,4]
#> [1,] 116016.6 1009375 526454 1413983
Predicates 
- Spatial predicates are used to check relationships between geometries using the DE-9IM model.
- The
st_intersects()
function is used to check if geometries intersect. - The
st_filter()
function is used to filter geometries based on a predicate.
st_intersects(counties, mississippi)
#> Sparse geometry binary predicate list of length 3108, where the
#> predicate was `intersects'
#> first 10 elements:
#> 1: (empty)
#> 2: (empty)
#> 3: (empty)
#> 4: (empty)
#> 5: (empty)
#> 6: (empty)
#> 7: (empty)
#> 8: (empty)
#> 9: (empty)
#> 10: (empty)
<- st_filter(counties, mississippi, .predicate = st_intersects)
ints
ggplot() +
geom_sf(data = ints) +
geom_sf(data = mississippi, col = "blue") +
theme_bw()
~ Week 4-5: Spatial Data (Raster) 
terra 
- The
terra
package is used for working with raster data. - It provides functions for reading, writing, and manipulating raster data.
library(terra)
gdal()
#> [1] "3.10.1"
I/O 
- Any raster format that GDAL can read, can be read with
rast()
. - The package loads the native GDAL src library (like
sf
) rast
reads data headers, not data itself, until needed.- Example: Reading a GeoTIF of Colorado elevation.
elev = terra::rast('data/colorado_elevation.tif'))
(#> class : SpatRaster
#> dimensions : 16893, 21395, 1 (nrow, ncol, nlyr)
#> resolution : 30, 30 (x, y)
#> extent : -1146465, -504615, 1566915, 2073705 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
#> source : colorado_elevation.tif
#> name : CONUS_dem
#> min value : 98679
#> max value : 439481
Raster Structure 
Raster data is stored as an multi-dimensional array of values. - Remember this is atomic vector with diminisions - The same way we looked
<- values(elev)
v head(v)
#> CONUS_dem
#> [1,] 242037
#> [2,] 243793
#> [3,] 244464
#> [4,] 244302
#> [5,] 244060
#> [6,] 243888
class(v[,1])
#> [1] "integer"
dim(v)
#> [1] 361425735 1
dim(elev)
#> [1] 16893 21395 1
nrow(elev)
#> [1] 16893
Additonal Structure
In addition to the values and diminsions, rasters have: - Extent: The spatial extent of the raster. - Resolution: The spatial resolution of the raster pixels. - CRS: The coordinate reference system of the raster.
crs(elev)
#> [1] "PROJCRS[\"unnamed\",\n BASEGEOGCRS[\"NAD83\",\n DATUM[\"North American Datum 1983\",\n ELLIPSOID[\"GRS 1980\",6378137,298.257222101004,\n LENGTHUNIT[\"metre\",1]]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n ID[\"EPSG\",4269]],\n CONVERSION[\"Albers Equal Area\",\n METHOD[\"Albers Equal Area\",\n ID[\"EPSG\",9822]],\n PARAMETER[\"Latitude of false origin\",23,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8821]],\n PARAMETER[\"Longitude of false origin\",-96,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8822]],\n PARAMETER[\"Latitude of 1st standard parallel\",29.5,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8823]],\n PARAMETER[\"Latitude of 2nd standard parallel\",45.5,\n ANGLEUNIT[\"degree\",0.0174532925199433],\n ID[\"EPSG\",8824]],\n PARAMETER[\"Easting at false origin\",0,\n LENGTHUNIT[\"metre\",1],\n ID[\"EPSG\",8826]],\n PARAMETER[\"Northing at false origin\",0,\n LENGTHUNIT[\"metre\",1],\n ID[\"EPSG\",8827]]],\n CS[Cartesian,2],\n AXIS[\"easting\",east,\n ORDER[1],\n LENGTHUNIT[\"metre\",1,\n ID[\"EPSG\",9001]]],\n AXIS[\"northing\",north,\n ORDER[2],\n LENGTHUNIT[\"metre\",1,\n ID[\"EPSG\",9001]]]]"
ext(elev)
#> SpatExtent : -1146465, -504615, 1566915, 2073705 (xmin, xmax, ymin, ymax)
res(elev)
#> [1] 30 30
Crop/Mask 
- The
crop()
function is used to crop a raster to a specific extent. - It is useful when you want to work with a subset of the data.
- crop extracts data (whether from a remote or local source)
- The
mask()
function is used to mask a raster using a vector or other extent, keeping only the data within the mask. - Input extents must match the CRS of the raster data
- Example: Cropping then masking the elevation raster to Larimer County.
<- st_transform(larimer, crs(elev))
larimer_5070
= crop(elev, larimer_5070)
larimer_elev
plot(larimer_elev)
<- mask(larimer_elev, larimer_5070)
larimer_mask plot(larimer_mask)
Summary / Algebra 
- Rasters can be added, subtracted, multiplied, and divided
- Any form of map algebra can be done with rasters
- For example, multiplying the Larimer mask by 2
raw
larimer_mask#> class : SpatRaster
#> dimensions : 3054, 3469, 1 (nrow, ncol, nlyr)
#> resolution : 30, 30 (x, y)
#> extent : -849255, -745185, 1952655, 2044275 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
#> source(s) : memory
#> varname : colorado_elevation
#> name : CONUS_dem
#> min value : 145787
#> max value : 412773
Data Operation
<- larimer_mask^2 elev2
rast modified by rast
/ elev2
larimer_mask #> class : SpatRaster
#> dimensions : 3054, 3469, 1 (nrow, ncol, nlyr)
#> resolution : 30, 30 (x, y)
#> extent : -849255, -745185, 1952655, 2044275 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
#> source(s) : memory
#> varname : colorado_elevation
#> name : CONUS_dem
#> min value : 2.422639e-06
#> max value : 6.859322e-06
statistical methods
scaled = scale(larimer_mask))
(#> class : SpatRaster
#> dimensions : 3054, 3469, 1 (nrow, ncol, nlyr)
#> resolution : 30, 30 (x, y)
#> extent : -849255, -745185, 1952655, 2044275 (xmin, xmax, ymin, ymax)
#> coord. ref. : +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs
#> source(s) : memory
#> varname : colorado_elevation
#> name : CONUS_dem
#> min value : -1.562331
#> max value : 3.053412
Value Supersetting 
- Rasters are matrices or arrays of values, and can be manipulated as such
- For example, setting 35% of the raster to NA
sample(ncell(larimer_elev), .35*ncell(larimer_elev))] <- NA
larimer_elev[
plot(larimer_elev)
Focal 
- The
focal()
function is used to calculate focal statistics. - It is useful when you want to calculate statistics for each cell based on its neighbors.
- Example: Calculating the mean elevation within a 30-cell window to remove the NAs we just created
= terra::focal(larimer_elev, win = 30, fun = "mean", na.policy="only")
xx plot(xx)
~ Week 6-7: Machine Learning 
library(tidymodels)
tidymodels_packages()
#> [1] "broom" "cli" "conflicted" "dials" "dplyr"
#> [6] "ggplot2" "hardhat" "infer" "modeldata" "parsnip"
#> [11] "purrr" "recipes" "rlang" "rsample" "rstudioapi"
#> [16] "tibble" "tidyr" "tune" "workflows" "workflowsets"
#> [21] "yardstick" "tidymodels"
Seeds for reproducability
rsamples
for resampling and cross-validation
- The
rsample
package is used for resampling and cross-validation. - It provides functions for creating resamples and cross-validation folds.
- Example: Creating a 5-fold cross-validation object for the
penguins
dataset.
set.seed(123)
<- initial_split(drop_na(penguins), prop = 0.8, strata = species))
(penguins_split #> <Training/Testing/Total>
#> <265/68/333>
<- training(penguins_split)
penguins_train <- testing(penguins_split)
penguins_test
<- vfold_cv(penguins_train, v = 5) penguin_folds
recipes
for feature engineering 
- The
recipes
package is used for feature engineering. - It provides functions for preprocessing data before modeling.
- Example: Defining a recipe for feature engineering the
penguins
dataset.
# Define recipe for feature engineering
<- recipe(species ~ ., data = penguins_train) |>
penguin_recipe step_impute_knn(all_predictors()) |> # Impute missing values
step_normalize(all_numeric_predictors()) # Normalize numeric features
Parsnip for model selection 
- The
parsnip
package is used for model implementation - It provides functions for defining models types, engines, and modes.
- Example: Defining models for logistic regression, random forest, and decision tree.
# Define models
<- multinom_reg() |>
log_reg_model set_engine("nnet") |>
set_mode("classification")
<- rand_forest(trees = 500) |>
rf_model set_engine("ranger") |>
set_mode("classification")
<- decision_tree() |>
dt_model set_mode("classification")
Workflows for model execution 
- The
workflows
package is used for model execution. - It provides functions for defining and executing workflows.
- Example: Creating a workflow for logistic regression.
# Create workflow
<- workflow() |>
log_reg_workflow add_model(log_reg_model) |>
add_recipe(penguin_recipe) |>
fit_resamples(resamples = penguin_folds,
metrics = metric_set(roc_auc, accuracy))
yardstick for model evaluation 
collect_metrics(log_reg_workflow)
#> # A tibble: 2 × 6
#> .metric .estimator mean n std_err .config
#> <chr> <chr> <dbl> <int> <dbl> <chr>
#> 1 accuracy multiclass 1 5 0 Preprocessor1_Model1
#> 2 roc_auc hand_till 1 5 0 Preprocessor1_Model1
workflowsets
for model comparison 
- The
workflowsets
package is used for model comparison. - It provides functions for comparing multiple models usingthe purrr mapping paradigm
- Example: Comparing logistic regression, random forest, and decision tree models.
<- workflow_set(list(penguin_recipe),
(workflowset list(log_reg_model, rf_model, dt_model)) |>
workflow_map("fit_resamples",
resamples = penguin_folds,
metrics = metric_set(roc_auc, accuracy)))
#> # A workflow set/tibble: 3 × 4
#> wflow_id info option result
#> <chr> <list> <list> <list>
#> 1 recipe_multinom_reg <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 2 recipe_rand_forest <tibble [1 × 4]> <opts[2]> <rsmp[+]>
#> 3 recipe_decision_tree <tibble [1 × 4]> <opts[2]> <rsmp[+]>
autoplot / rank_results

- The
autoplot()
function is used to visualize model performance. - The
rank_results()
function is used to rank models based on a metric. - Example: Visualizing and ranking the model results based on the roc_auc (area under the curve) metric.
autoplot(workflowset)
rank_results(workflowset, rank_metric = "roc_auc")
#> # A tibble: 6 × 9
#> wflow_id .config .metric mean std_err n preprocessor model rank
#> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
#> 1 recipe_multinom_… Prepro… accura… 1 0 5 recipe mult… 1
#> 2 recipe_multinom_… Prepro… roc_auc 1 0 5 recipe mult… 1
#> 3 recipe_rand_fore… Prepro… accura… 0.981 5.97e-3 5 recipe rand… 2
#> 4 recipe_rand_fore… Prepro… roc_auc 1.00 3.60e-4 5 recipe rand… 2
#> 5 recipe_decision_… Prepro… accura… 0.955 1.28e-2 5 recipe deci… 3
#> 6 recipe_decision_… Prepro… roc_auc 0.953 1.39e-2 5 recipe deci… 3
Model Validation

- Finally, we can validate the model on the test set
- The
augment()
function is used to add model predictions and residuals to the dataset. - The
conf_mat()
function is used to create a confusion matrix. - Example: Validating the logistic regression model on the test set.
workflow() |>
# Add model and recipe
add_model(log_reg_model) |>
add_recipe(penguin_recipe) |>
# Train model
fit(data = penguins_train) |>
# Fit trained model to test set
fit(data = penguins_test) |>
# Extract Predictions
augment(penguins_test) |>
conf_mat(truth = species, estimate = .pred_class)
#> Truth
#> Prediction Adelie Chinstrap Gentoo
#> Adelie 30 0 0
#> Chinstrap 0 14 0
#> Gentoo 0 0 24
Conclusion
- Today we reviewed/introduced the foundations of R for environmental data science.
- We discussed data types, structures, and packages for data manipulation and modeling.
- We also explored vector and raster data, along with ML applications.
- We will continue to build on these concepts in future lectures.
- Please complete the survey to help us tailor the course to your needs.