The first step in computing zonal statistics are the need to compute a weight map that can be used to reallocate the gridded data with respect to the percent overlapping each cell. There are two primary packages that tackle this which include intersectr (which uses areal as a back-end) and zonal which uses exactextractr.

Option 1: Intersectr:

Setting up the data for a intersectr weights map requires (1) computing a vector representation of the grid and (2) intersecting the aggregation units with this grid to determine the percent overlap or “coverage fraction”. Below the workflow from the intersectr docs is wrapped in a function that requires a NetCDF file path, a sf geometry set, an ID variable in the geometries and the variable to extract from the grid.

library(intersectr)
library(ncmeta)
library(RNetCDF)

intersectr_weights = function(file, geom, ID, var ){
  nc_coord_vars <- nc_coord_var(file)
  variable_name <- var
  nc_coord_vars <- filter(nc_coord_vars, variable == variable_name)
  
  nc       <- open.nc(file)
  X_coords <- var.get.nc(nc, nc_coord_vars$X, unpack = TRUE)
  Y_coords <- var.get.nc(nc, nc_coord_vars$Y, unpack = TRUE)
  
  nc_prj <- nc_gm_to_prj(nc_grid_mapping_atts(file))
    
  cell_geometry = create_cell_geometry(X_coords = X_coords,
                         Y_coords = Y_coords,
                         prj = nc_prj,
                         geom = geom, 
                         buffer_dist = 0.1, # Degrees
                         regularize = TRUE)
    
  data_source_cells <- st_sf(dplyr::select(cell_geometry, grid_ids))
  target_polygons   <- st_sf(dplyr::select(geom, !!ID))
  st_agr(data_source_cells) <- "constant"
  st_agr(target_polygons)   <- "constant"
    
  calculate_area_intersection_weights(
      data_source_cells,
      target_polygons, allow_lonlat = TRUE)
}

Here two motivating use cases are shown to compare the efficiency of these two approaches. The first covers a large area but has many small aggregation units. The second also covers a large area but has a few large polygon agggreation units. Each of these poses a unique set of demands with repect to how

Many small aggregation units

~20,000 watersheds along the east coast

Large Area, Lots of aggregation units

file <-  '/Users/mjohnson/Downloads/pet_1979.nc'

geom <- st_make_valid(read_sf('/Users/mjohnson/github/hydrofabric/workflow/nhd_workflows/cache/ngen_01a-4.gpkg', "catchments"))

The gridded data and aggregate units we are working with can be seen below:

(meta = nc_dims(file))
#> # A tibble: 4 x 4
#>      id name  length unlim
#>   <int> <chr>  <dbl> <lgl>
#> 1     0 lon     1386 FALSE
#> 2     1 lat      585 FALSE
#> 3     2 day      365 FALSE
#> 4     3 crs        1 FALSE

The grid we are working with holds daily PET values for CONUS for the year 1979. It has:

1,386 X_cells
585 Y_cells
365 time slices

for a total of 295,945,650 values.

glimpse(geom)
#> Rows: 19,700
#> Columns: 3
#> $ comid    <dbl> 4289595, 4290057, 4287059, 4287061, 4292649, 4292651, 4287199, 4287195, 4…
#> $ areasqkm <dbl> 12.2707416, 4.0614450, 2.8056073, 6.7866652, 18.1573074, 27.8619553, 11.6…
#> $ geom     <MULTIPOLYGON [m]> MULTIPOLYGON (((1976295 290..., MULTIPOLYGON (((1976415 290.…

In total we have 19,700 aggregation units to summarize over the 365 time steps.

bnch <- bench::mark(
  iterations = 1, check = FALSE, time_unit = "s",
  intersectr = intersectr_weights(file, geom, "comid", "potential_evapotranspiration"),
  zonal      = zonal::weighting_grid(file, geom, "comid"),
)

expression	median	mem_alloc	median_rel	mem_rel
intersectr	74.99095	860.8167MB	4.29406	1.327935
zonal	17.46388	648.2371MB	1.00000	1.000000

A few large aggregation units

~ 64 counties in Colorado

The alternative use case is a large domain that we want to aggregate to a smaller set of large aggregation units.

colorado  = AOI::aoi_get(state = "CO", county = "all")
glimpse(colorado)
#> Rows: 64
#> Columns: 13
#> $ statefp           <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "08"…
#> $ countyfp          <chr> "035", "095", "039", "014", "037", "053", "025", "003", "065", "…
#> $ countyns          <chr> "00198133", "00198163", "00198136", "01945881", "00198134", "001…
#> $ affgeoid          <chr> "0500000US08035", "0500000US08095", "0500000US08039", "0500000US…
#> $ geoid             <chr> "08035", "08095", "08039", "08014", "08037", "08053", "08025", "…
#> $ name              <chr> "Douglas", "Phillips", "Elbert", "Broomfield", "Eagle", "Hinsdal…
#> $ lsad              <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "06"…
#> $ aland             <dbl> 2176272717, 1781724973, 4793658887, 85478497, 4362874882, 289361…
#> $ awater            <dbl> 6752511, 301808, 442148, 1411781, 18849997, 15324686, 33430385, …
#> $ state_name        <chr> "Colorado", "Colorado", "Colorado", "Colorado", "Colorado", "Col…
#> $ state_abbr        <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO"…
#> $ jurisdiction_type <chr> "state", "state", "state", "state", "state", "state", "state", "…
#> $ geometry          <MULTIPOLYGON [°]> MULTIPOLYGON (((-105.2178 3..., MULTIPOLYGON (((-10…

In total we have 64 aggregation units to summarize over the 365 time steps.

bnch2 <- bench::mark(
  iterations = 1, check = FALSE, time_unit = "s",
  intersectr = intersectr_weights(file, colorado, "name", "potential_evapotranspiration"),
  zonal      = zonal::weighting_grid(file, colorado, "name")
)

expression	median	mem_alloc	median_rel	mem_rel
intersectr	169.2785367	2.354612GB	927.5329	155.8245
zonal	0.1825041	15.47332MB	1.0000	1.0000

Notably the intersectr approach, based on vector intersections, is unable to scale well when the aggregation units are large both with respect to time, and memory.

Weight Maps

Option 1: Intersectr:

Many small aggregation units

~20,000 watersheds along the east coast

A few large aggregation units

~ 64 counties in Colorado