The first step in computing zonal statistics are the need to compute a weight map that can be used to reallocate the gridded data with respect to the percent overlapping each cell. There are two primary packages that tackle this which include intersectr
(which uses areal
as a back-end) and zonal
which uses exactextractr
.
Setting up the data for a intersectr
weights map requires (1) computing a vector representation of the grid and (2) intersecting the aggregation units with this grid to determine the percent overlap or “coverage fraction”. Below the workflow from the intersectr
docs is wrapped in a function that requires a NetCDF file path, a sf
geometry set, an ID variable in the geometries and the variable to extract from the grid.
library(intersectr)
library(ncmeta)
library(RNetCDF)
intersectr_weights = function(file, geom, ID, var ){
nc_coord_vars <- nc_coord_var(file)
variable_name <- var
nc_coord_vars <- filter(nc_coord_vars, variable == variable_name)
nc <- open.nc(file)
X_coords <- var.get.nc(nc, nc_coord_vars$X, unpack = TRUE)
Y_coords <- var.get.nc(nc, nc_coord_vars$Y, unpack = TRUE)
nc_prj <- nc_gm_to_prj(nc_grid_mapping_atts(file))
cell_geometry = create_cell_geometry(X_coords = X_coords,
Y_coords = Y_coords,
prj = nc_prj,
geom = geom,
buffer_dist = 0.1, # Degrees
regularize = TRUE)
data_source_cells <- st_sf(dplyr::select(cell_geometry, grid_ids))
target_polygons <- st_sf(dplyr::select(geom, !!ID))
st_agr(data_source_cells) <- "constant"
st_agr(target_polygons) <- "constant"
calculate_area_intersection_weights(
data_source_cells,
target_polygons, allow_lonlat = TRUE)
}
Here two motivating use cases are shown to compare the efficiency of these two approaches. The first covers a large area but has many small aggregation units. The second also covers a large area but has a few large polygon agggreation units. Each of these poses a unique set of demands with repect to how
file <- '/Users/mjohnson/Downloads/pet_1979.nc'
geom <- st_make_valid(read_sf('/Users/mjohnson/github/hydrofabric/workflow/nhd_workflows/cache/ngen_01a-4.gpkg', "catchments"))
The gridded data and aggregate units we are working with can be seen below:
(meta = nc_dims(file))
#> # A tibble: 4 x 4
#> id name length unlim
#> <int> <chr> <dbl> <lgl>
#> 1 0 lon 1386 FALSE
#> 2 1 lat 585 FALSE
#> 3 2 day 365 FALSE
#> 4 3 crs 1 FALSE
The grid we are working with holds daily PET values for CONUS for the year 1979. It has:
for a total of 295,945,650 values.
glimpse(geom)
#> Rows: 19,700
#> Columns: 3
#> $ comid <dbl> 4289595, 4290057, 4287059, 4287061, 4292649, 4292651, 4287199, 4287195, 4…
#> $ areasqkm <dbl> 12.2707416, 4.0614450, 2.8056073, 6.7866652, 18.1573074, 27.8619553, 11.6…
#> $ geom <MULTIPOLYGON [m]> MULTIPOLYGON (((1976295 290..., MULTIPOLYGON (((1976415 290.…
In total we have 19,700 aggregation units to summarize over the 365 time steps.
bnch <- bench::mark(
iterations = 1, check = FALSE, time_unit = "s",
intersectr = intersectr_weights(file, geom, "comid", "potential_evapotranspiration"),
zonal = zonal::weighting_grid(file, geom, "comid"),
)
expression | median | mem_alloc | median_rel | mem_rel |
---|---|---|---|---|
intersectr | 74.99095 | 860.8167MB | 4.29406 | 1.327935 |
zonal | 17.46388 | 648.2371MB | 1.00000 | 1.000000 |
The alternative use case is a large domain that we want to aggregate to a smaller set of large aggregation units.
colorado = AOI::aoi_get(state = "CO", county = "all")
glimpse(colorado)
#> Rows: 64
#> Columns: 13
#> $ statefp <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "08"…
#> $ countyfp <chr> "035", "095", "039", "014", "037", "053", "025", "003", "065", "…
#> $ countyns <chr> "00198133", "00198163", "00198136", "01945881", "00198134", "001…
#> $ affgeoid <chr> "0500000US08035", "0500000US08095", "0500000US08039", "0500000US…
#> $ geoid <chr> "08035", "08095", "08039", "08014", "08037", "08053", "08025", "…
#> $ name <chr> "Douglas", "Phillips", "Elbert", "Broomfield", "Eagle", "Hinsdal…
#> $ lsad <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "06"…
#> $ aland <dbl> 2176272717, 1781724973, 4793658887, 85478497, 4362874882, 289361…
#> $ awater <dbl> 6752511, 301808, 442148, 1411781, 18849997, 15324686, 33430385, …
#> $ state_name <chr> "Colorado", "Colorado", "Colorado", "Colorado", "Colorado", "Col…
#> $ state_abbr <chr> "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO", "CO"…
#> $ jurisdiction_type <chr> "state", "state", "state", "state", "state", "state", "state", "…
#> $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-105.2178 3..., MULTIPOLYGON (((-10…
In total we have 64 aggregation units to summarize over the 365 time steps.
bnch2 <- bench::mark(
iterations = 1, check = FALSE, time_unit = "s",
intersectr = intersectr_weights(file, colorado, "name", "potential_evapotranspiration"),
zonal = zonal::weighting_grid(file, colorado, "name")
)
expression | median | mem_alloc | median_rel | mem_rel |
---|---|---|---|---|
intersectr | 169.2785367 | 2.354612GB | 927.5329 | 155.8245 |
zonal | 0.1825041 | 15.47332MB | 1.0000 | 1.0000 |
Notably the intersectr
approach, based on vector intersections, is unable to scale well when the aggregation units are large both with respect to time, and memory.