The ROC captures separation. - The ROC curve shows the trade-off between sensitivity and specificity at different thresholds.
The Brier score captures calibration. - The Brier score is a measure of how well the predicted probabilities of an event match the actual outcomes.
Good separation: the densities don’t overlap.
Good calibration: the calibration line follows the diagonal.
Calibration plot: We bin observations according to predicted probability. In the bin for 20%-30% predicted prob, we should see an event rate of ~25% if the model is well-calibrated.
Regression
R-squared (\(R^2\))
Measures how well the model explains the variance in the data.
Formula:
\[
R^2 = 1 - \frac{SS_{res}}{SS_{tot}}
\] where:
- \(SS_{res}\) is the sum of squared residuals.
\(SS_{tot}\) is the total sum of squares.
Values range from \(-\infty\) to 1.
1: Perfect fit
0: No improvement over mean prediction
Negative: Worse than just using the mean
Root Mean Squared Error (RMSE)
Measures the model’s prediction error in the same unit as the dependent variable.
Metrics and metric sets work with grouped data frames!
augment(forested_fit, new_data = forested_train) |>group_by(tree_no_tree) |>accuracy(truth = forested, estimate = .pred_class)#> # A tibble: 2 × 4#> tree_no_tree .metric .estimator .estimate#> <fct> <chr> <chr> <dbl>#> 1 Tree accuracy binary 0.946#> 2 No tree accuracy binary 0.941augment(forested_fit, new_data = forested_train) |>group_by(tree_no_tree) |>specificity(truth = forested, estimate = .pred_class)#> # A tibble: 2 × 4#> tree_no_tree .metric .estimator .estimate#> <fct> <chr> <chr> <dbl>#> 1 Tree specificity binary 0.582#> 2 No tree specificity binary 0.974
Note
The specificity for "Tree" is a good bit lower than it is for "No tree".
So, when this index classifies the plot as having a tree, the model does not do well at correctly identifying the plot as non-forested when it is indeed non-forested.
Recap
Previously - Setup
library(tidyverse)# Ingest Data# URLs for COVID-19 case data and census population datacovid_url <-'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'pop_url <-'/Users/mikejohnson/Downloads/co-est2023-alldata.csv'#pop_url <- 'https://www2.census.gov/programs-surveys/popest/datasets/2020-2023/counties/totals/co-est2023-alldata.csv'# Clean Census Datacensus = readr::read_csv(pop_url) |>filter(COUNTY =="000") |># Filter for state-level data onlymutate(fips = STATE) |># Create a new FIPS column for mergingselect(fips, contains("2021")) # Select relevant columns for 2021 data# Process COVID-19 Datastate_data <- readr::read_csv(covid_url) |>group_by(fips) |>mutate(new_cases =pmax(0, cases - dplyr::lag(cases)), # Compute new cases, ensuring no negative valuesnew_deaths =pmax(0, deaths - dplyr::lag(deaths)) # Compute new deaths, ensuring no negative values ) |>ungroup() |>left_join(census, by ="fips") |># Merge with census datamutate(m =month(date), y =year(date),season =case_when( # Define seasons based on month m %in%3:5~"Spring", m %in%6:8~"Summer", m %in%9:11~"Fall", m %in%c(12, 1, 2) ~"Winter" ) ) |>group_by(state, y, season) |>mutate(season_cases =sum(new_cases, na.rm =TRUE), # Aggregate seasonal casesseason_deaths =sum(new_deaths, na.rm =TRUE) # Aggregate seasonal deaths ) |>distinct(state, y, season, .keep_all =TRUE) |># Keep only distinct rows by state, year, seasonungroup() |>select(state, contains('season'), y, POPESTIMATE2021, BIRTHS2021, DEATHS2021) |># Select relevant columnsdrop_na() |># Remove rows with missing valuesmutate(logC =log(season_cases +1)) # Log-transform case numbers for modeling
Previously - Data Usage
set.seed(4028)split <-initial_split(state_data, prop =0.8, strata = season) # 80/20 train-test splittrain <-training(split) # Training settest <-testing(split) # Test setset.seed(3045)folds <-vfold_cv(train, v =10) # 10-fold cross-validation
set.seed(9)ctrl <-control_grid(save_pred =TRUE)b_res <- b_wflow |>tune_grid(resamples = folds,grid =25,# The options below are not required by defaultparam_info = b_param, control = ctrl,metrics =metric_set(mae) )
Let’s fit the model on the training set and verify our performance using the test set.
We’ve seen fit() and predict() (+ augment()) but there is a shortcut:
# the boosted tree workflow can be `finalized` with the best parametersworkflow <-finalize_workflow(b_wflow, b_best)# forested_split has train + test info(final_fit <-last_fit(workflow, split))#> # Resampling results#> # Manual resampling #> # A tibble: 1 × 6#> splits id .metrics .notes .predictions .workflow #> <list> <chr> <list> <list> <list> <list> #> 1 <split [570/144]> train/test split <tibble [2 × 4]> <tibble [0 × 3]> <tibble [144 × 4]> <workflow>
The final fit!
collect_metrics(final_fit)#> # A tibble: 2 × 4#> .metric .estimator .estimate .config #> <chr> <chr> <dbl> <chr> #> 1 rmse standard 0.391 Preprocessor1_Model1#> 2 rsq standard 0.918 Preprocessor1_Model1collect_predictions(final_fit) |>ggplot(aes(.pred, logC)) +geom_point() +geom_abline() +geom_smooth(method ="lm", se =FALSE)
The whole game
Assignment
Starting with the “Whole Game Image” on the last slide, in plain language, write a sentence or 2 about each step and item, its importance, and note key things to be careful of. The full submission should be a 2-3 paragraph narative of the entire process, including the final model fit and performance.