The first step in computing zonal statistics are the need to compute a weight map that can be used to reallocate the gridded data with respect to the percent overlapping each cell. There are two primary packages that tackle this which include intersectr (which uses areal as a back-end) and zonal which uses exactextractr.

Option 1: Intersectr:

Setting up the data for a intersectr weights map requires (1) computing a vector representation of the grid and (2) intersecting the aggregation units with this grid to determine the percent overlap or “coverage fraction”. Below the workflow from the intersectr docs is wrapped in a function that requires a NetCDF file path, a sf geometry set, an ID variable in the geometries and the variable to extract from the grid.

library(intersectr)
library(ncmeta)
library(RNetCDF)

intersectr_weights = function(file, geom, ID, var ){
  nc_coord_vars <- nc_coord_var(file)
  variable_name <- var
  nc_coord_vars <- filter(nc_coord_vars, variable == variable_name)
  
  nc       <- open.nc(file)
  X_coords <- var.get.nc(nc, nc_coord_vars$X, unpack = TRUE)
  Y_coords <- var.get.nc(nc, nc_coord_vars$Y, unpack = TRUE)
  
  nc_prj <- nc_gm_to_prj(nc_grid_mapping_atts(file))
    
  cell_geometry = create_cell_geometry(X_coords = X_coords,
                         Y_coords = Y_coords,
                         prj = nc_prj,
                         geom = geom, 
                         buffer_dist = 0.1, # Degrees
                         regularize = TRUE)
    
  data_source_cells <- st_sf(dplyr::select(cell_geometry, grid_ids))
  target_polygons   <- st_sf(dplyr::select(geom, !!ID))
  st_agr(data_source_cells) <- "constant"
  st_agr(target_polygons)   <- "constant"
    
  calculate_area_intersection_weights(
      data_source_cells,
      target_polygons, allow_lonlat = TRUE)
}

Here two motivating use cases are shown to compare the efficiency of these two approaches. The first covers a large area but has many small aggregation units. The second also covers a large area but has a few large polygon agggreation units. Each of these poses a unique set of demands with repect to how

Many small aggregation units

~20,000 watersheds along the east coast

  • Large Area, Lots of aggregation units
file <-  '/Users/mjohnson/Downloads/pet_1979.nc'

geom <- st_make_valid(read_sf('/Users/mjohnson/github/hydrofabric/workflow/nhd_workflows/cache/ngen_01a-4.gpkg', "catchments"))

The gridded data and aggregate units we are working with can be seen below:

(meta = nc_dims(file))
#> # A tibble: 4 x 4
#>      id name  length unlim
#>   <int> <chr>  <dbl> <lgl>
#> 1     0 lon     1386 FALSE
#> 2     1 lat      585 FALSE
#> 3     2 day      365 FALSE
#> 4     3 crs        1 FALSE

The grid we are working with holds daily PET values for CONUS for the year 1979. It has:

  • 1,386 X_cells
  • 585 Y_cells
  • 365 time slices

for a total of 295,945,650 values.

glimpse(geom)
#> Rows: 19,700
#> Columns: 3
#> $ comid    <dbl> 4289595, 4290057, 4287059, 4287061, 4292649, 4292651, 4287199, 4287195, 4…
#> $ areasqkm <dbl> 12.2707416, 4.0614450, 2.8056073, 6.7866652, 18.1573074, 27.8619553, 11.6…
#> $ geom     <MULTIPOLYGON [m]> MULTIPOLYGON (((1976295 290..., MULTIPOLYGON (((1976415 290.…

In total we have 19,700 aggregation units to summarize over the 365 time steps.

bnch <- bench::mark(
  iterations = 1, check = FALSE, time_unit = "s",
  intersectr = intersectr_weights(file, geom, "comid", "potential_evapotranspiration"),
  zonal      = zonal::weighting_grid(file, geom, "comid"),
)
expression median mem_alloc median_rel mem_rel
intersectr 74.99095 860.8167MB 4.29406 1.327935
zonal 17.46388 648.2371MB 1.00000 1.000000

A few large aggregation units

~ 64 counties in Colorado

The alternative use case is a large domain that we want to aggregate to a smaller set of large aggregation units.

colorado  = AOI::aoi_get(state = "CO", county = "all")
glimpse(colorado)
#> Rows: 64
#> Columns: 13
#> $ statefp           <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "08"…
#> $ countyfp          <chr> "035", "095", "039", "014", "037", "053", "025", "003", "065", "…
#> $ countyns          <chr> "00198133", "00198163", "00198136", "01945881", "00198134", "001…
#> $ affgeoid          <chr> "0500000US08035", "0500000US08095", "0500000US08039", "0500000US…
#> $ geoid             <chr> "08035", "08095", "08039", "08014", "08037", "08053", "08025", "…
#> $ name              <chr> "Douglas", "Phillips", "Elbert", "Broomfield", "Eagle", "Hinsdal…
#> $ lsad              <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "06"…
#> $ aland             <dbl> 2176272717, 1781724973, 4793658887, 85478497, 4362874882, 289361…
#> $ awater            <dbl> 6752511, 301808, 442148, 1411781, 18849997, 15324686, 33430385, …
#> $ state_name        <chr> "Colorado", "Colorado", "Colorado", "Colorado", "Colorado", "Col…
#> $ state_abbr        <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO"…
#> $ jurisdiction_type <chr> "state", "state", "state", "state", "state", "state", "state", "…
#> $ geometry          <MULTIPOLYGON [°]> MULTIPOLYGON (((-105.2178 3..., MULTIPOLYGON (((-10…

In total we have 64 aggregation units to summarize over the 365 time steps.

bnch2 <- bench::mark(
  iterations = 1, check = FALSE, time_unit = "s",
  intersectr = intersectr_weights(file, colorado, "name", "potential_evapotranspiration"),
  zonal      = zonal::weighting_grid(file, colorado, "name")
)
expression median mem_alloc median_rel mem_rel
intersectr 169.2785367 2.354612GB 927.5329 155.8245
zonal 0.1825041 15.47332MB 1.0000 1.0000

Notably the intersectr approach, based on vector intersections, is unable to scale well when the aggregation units are large both with respect to time, and memory.