Introduction

This document carries us through some of the basics for finding and accessing data from the internet using R. While the details (e.g. specific functions) are R based, the concepts apply to all programming languages.

Section 1: R basics, and foundation spatial libraries

Section 2: Key hydrology based packages (focus on data access)

Section 3: Web Data without services

Section 4: Hands on example

Section 1: Spatial Data Foundations

Basics for R

Install guides can be found here for R and RStudio

Getting R packages

“R packages are extensions to the base R statistical programming language. Packages contain”code, data, and documentation that can be installed by users of R” [wikipedia]
install.packages(name) can install packages from CRAN
remotes::install_github("user/repo") can install packages from Github that are not released to CRAN (development versions, personal repos, etc.)
Packages are loaded (attached) into a working environment by calling library(name)
Once attached all objects (code, data, and documentation) of that package are exposed for use.

R function `syntax`

Functions are code snippets that can be reused, for common and/or complex tasks. Functions can be developed and shared in packages.
In R, package::function is the syntax for calling functions from a package that has not been attached, or, that has a conflicting function name.
Example: sf::read_sf() calls the read_sf function from the sf package
Preceding a function with a ? in brings up the help page for that function

?sf::read_sf

Hashtags (#) inserted before lines of code are comments, they are present for human use not machine interpretation

# this is a comment, below is code
2 + 2

[1] 4

Spatial Data

We are going to discuss spatial data for hydrology in two contexts.

“Vector” data are comprised of points, lines, and polygons that represent discrete spatial entities, such as a river, watershed, or stream gauge.
“Raster” data divides spaces into rectilinear cells (pixels) to represent spatially continuous phenomena, such as elevation or the weather. The cell size (or resolution) defines the fidelity of the data.

Communities, libraries and Standards

The spatial community is built on a few organizing communities (e.g. OSgeo and OGC) and a few core libraries that drive all spatial software (R, python, QGIS, ArcMap)!

These libraries include:

PROJ –> projections, transformations
GEOS –> Geometry operations (measures, relations)
GDAL –> geodata abstraction and processing (read, write)

Simple Features Model

Simple Features (officially Simple Feature Access) is both an OGC and International Organization for Standardization (ISO) standard that specifies how (mostly) two-dimensional geometries can represent and describe objects in the real world.

It describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.

It outlines how the spatial elements of POINTS (XY locations with a specific coordinate reference system) extend to LINES, POLYGONS and GEOMETRYCOLLECTION(s).

sf: simple features

In R, the sf package provides “support for simple features, a standardized way to encode spatial vector data…. [and] Binds to ‘GDAL’ for reading and writing data, to ‘GEOS’ for geometrical operations, and to ‘PROJ’ for projection conversions and datum transformations.”

Therefore when using R, you are using an interface to the core community standards, software, and practices (this is no exclusive to R). TO highlight this we can install (do this once) and attach sf to view the external dependencies versions of the libraries linked to sf.

# install.packages("sf")
library(sf)

sf_extSoftVersion()

          GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
      "3.10.2"        "3.4.2"        "8.2.1"        "false"         "true" 
          PROJ 
       "8.2.1"

The bindings to these lower-level C libraries, and, the larger sf ecosystem in R can be seen below:

What is novel about the sf implementation in the R ecosystem is the way in which the authors built on the list structure in R to store simple feature geometries (sfg) as part of a larger data.frame using a simple feature geometry list-column (sfg). The collection of attribute and spatial information define a simple feature that can be operated on in both table (SQL) and spatial (GEOS, etc) contexts. Not only does this allow us to make the most use of the growing spatial community but also of the growing data science community (see ggplot, dplyr, data.table, dbplyr, arrow, etc.)

In practice, an sf object in R looks like the following:

This extends the idea of “tidy” data in that each row represents one observation, which has one geometric representation of the real world feature it describes.

Basic use:

# Define file path
(filename <- system.file("shape/nc.shp", package="sf"))

[1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library/sf/shape/nc.shp"

# read in file
(nc <- read_sf(filename) )

Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS:  NAD27
# A tibble: 100 × 15
    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74 NWBIR74
   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>   <dbl>
 1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1      10
 2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0      10
 3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5     208
 4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1     123
 5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9    1066
 6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7     954
 7 0.062      1.55  1834    1834 Camd… 37029  37029       15   286     0     115
 8 0.091      1.28  1835    1835 Gates 37073  37073       37   420     0     254
 9 0.118      1.42  1836    1836 Warr… 37185  37185       93   968     4     748
10 0.124      1.43  1837    1837 Stok… 37169  37169       85  1612     1     160
# … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
#   NWBIR79 <dbl>, geometry <MULTIPOLYGON [°]>

# Map
plot(nc['SID79'])

# Spatial measures!
head(st_area(nc))

Units: [m^2]
[1] 1137107793  610916077 1423145355  694378925 1520366979  967504822

# Spatial operations!
{
  st_union(nc) |> plot()
  
  st_centroid(nc)$geometry |> plot(col = "red", add = TRUE)
}

# data science operations
library(dplyr)

{
 plot(nc$geometry)
 plot(slice_max(nc, AREA, n = 10)$geometry, 
      col = "red", add = TRUE)
 plot(slice_max(nc, AREA, n =5)$geometry, 
      col = "yellow", add = TRUE)
 plot(slice_max(nc, AREA, n = 1)$geometry, 
      col = "green", add = TRUE)
}

terra:

While sf is a leading package for working with vector data, terra is a primary R package for working with raster data. terra also operates on vector data but we focus on sf for those purposes
terra builds on the older raster package and provides methods for low-level data manipulation as well as high-level global, local, zonal, and focal computations. The predict and interpolate methods facilitate the use of regression type (interpolation, machine learning) models for spatial prediction, including with satellite remote sensing data. Processing of very large files is supported.

Like sf, terra links to GDAL, and the version and drivers can be viewed using package support functions:

# install.packages(terra)
library(terra)

gdal()

[1] "3.4.2"

DT::datatable(gdal(drivers = TRUE))

Basic Use

(filename <- system.file("ex/meuse.tif", package="terra"))

[1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library/terra/ex/meuse.tif"

(r <- rast(filename))

class       : SpatRaster 
dimensions  : 115, 80, 1  (nrow, ncol, nlyr)
resolution  : 40, 40  (x, y)
extent      : 178400, 181600, 329400, 334000  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.1561605555556 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +datum=WGS84 +units=m +no_defs 
source      : meuse.tif 
name        : test 
min value   :  138 
max value   : 1736

plot(r, main='SpatRaster')

mapview:

mapview is a leaflet helper package designed to quickly and conveniently create interactive visualizations of spatial data.

Basic Useage:

# install.packages(mapview)
library(mapview)

mapview(nc)

Together, sf/terra/mapview provide much of the functionality needed to get the best of both a programmatic and GUI driven GIS!

Section 2: Geospatial/Hydrology focused packages:

With a strong grasp on the basic spatial libraries for reading, writing, and manipulating spatial data, we can move onto packages that seek to support the hydrologic and earth systems communities.

AOI

AOI is NOT on CRAN but provides interfaces for geocoding and retrieving spatial boundaries (e.g state, county, and country). When working with web-based, subsetting services, the ability to quickly define common features is really convenient.

We’ll see through these examples that the concept of AOI is key to requesting subsets of data. Further well see that a reproducable way of generating these is a quickway to make the sytanx of disparate packages more aligned.

Basic Useage:

# remotes::install_github("mikejohnson51/AOI")
library(AOI)

# state
aoi_get(state = "CO") |> mapview()

# county
aoi_get(state = "AL", county = "Tuscaloosa") |> mapview()

# country
aoi_get(country = "Ukraine") |> mapview()

# geocoding
geocode("NOAA National Water Center", pt = TRUE) |> mapview()

nhdplusTools

nhdplusTools is a R package for working with the NHD and the NHD data model. The package is robust and covers 5 primary use topics that include:

Given the focus here on geospatial data via web services, we will look at functions supporting use case 2.

# install.packages('nhdplusTools')
library(nhdplusTools)

grep("get_", ls("package:nhdplusTools"), value = TRUE)

 [1] "get_boundaries"           "get_DD"                  
 [3] "get_DM"                   "get_elev_along_path"     
 [5] "get_flowline_index"       "get_gagesII"             
 [7] "get_hr_data"              "get_huc12"               
 [9] "get_huc8"                 "get_hydro_location"      
[11] "get_levelpaths"           "get_nhdarea"             
[13] "get_nhdplus"              "get_nhdplushr"           
[15] "get_nldi_basin"           "get_nldi_characteristics"
[17] "get_nldi_feature"         "get_nldi_index"          
[19] "get_node"                 "get_nwis"                
[21] "get_partial_length"       "get_path_lengths"        
[23] "get_pathlength"           "get_pfaf"                
[25] "get_raindrop_trace"       "get_sorted"              
[27] "get_split_catchment"      "get_streamlevel"         
[29] "get_streamorder"          "get_terminal"            
[31] "get_tocomid"              "get_UM"                  
[33] "get_UT"                   "get_vaa"                 
[35] "get_vaa_names"            "get_vaa_path"            
[37] "get_waterbodies"          "get_waterbody_index"     
[39] "get_wb_outlet"            "get_xs_point"            
[41] "get_xs_points"

Basic Use

Here we are interested in getting data for an AOI around Boston. We use AOI::aoi_get to get the OpenStreetMap representation of Boston, and initialize an empty list to populate:

boston     = list()
boston$AOI = aoi_get("Boston")

boston$nhdplus      <- get_nhdplus(AOI = boston$AOI) 
boston$waterbodies  <- get_waterbodies(AOI = boston$AOI) 
boston$gages        <- get_nwis(AOI = boston$AOI) 
boston$huc12        <- get_huc12(AOI = boston$AOI) 

# How many features per entry?
sapply(boston, nrow)

        AOI     nhdplus waterbodies       gages       huc12 
          1         201          58           9          18

mapview(boston)

In both the NHDPlus data model, and the emerging international OGC standard for representing surface water features HY_features it’s recognized that hydrologic entities can be “realized” in a number of ways.

For example the holistic notion of a ‘catchment’ can be realized as a ‘divide’, ‘flowpath’, ‘outlet’, or other representation. The nhdplusTools::get_nhdplus functions supports getting 3 realizations of the NHD features that are consistent with the NHDPlus Data model (outlet, flowline, catchment, and all).

boulder <- get_nhdplus(aoi_get("boulder"), realization = "all")
mapview(boulder)

dataRetrival

The USGS supported dataRetrival package provides retrieval functions for USGS and EPA hydrologic and water quality data. Recently a Network Linked Data Index (NLDI) client was also added to the package.

Basic Use: Streamflow Data

#install.package('dataRetrieval')
library(dataRetrieval)

# Supply gauge ids from Boston example
# Parameter codes can be found with (`dataRetrieval::parameterCdFile`)
# "00060" is stream flow

flows = readNWISdv(site = boston$gages$site_no, parameterCd = "00060") |> 
  renameNWISColumns()

ggplot(data = flows) + 
  geom_line(aes(x = Date, y = Flow)) + 
  facet_wrap('site_no')

NLDI client

The hydro Network-Linked Data Index (NLDI) is a system that can index spatial and river network-linked data and navigate the river network to allow discovery of indexed information.

The NLDI service requires the following:

A starting entity (comid, nwis, wqp, location and more).

The “more” includes the ever-growing set of resources available in the NLDI service that can be accessed with the get_nldi_sources function.

DT::datatable(get_nldi_sources())

A direction to navigate along the network

UM: upper mainstem
UT: upper tributary
DM: downstream mainstem
DD: downstream divergence

Features to find along the way!

Any of the above features (e.g. get_nldi_sources)
flowlines
basin

Basic use: NLDI

The NLDI client in dataRetrival operates by speficing the origin, the navigation mode, and the features to find. In the following example, we want to navigate along the upstream tributary of COMID 101 (up to 1000 km) and find all flowline features, nwis gages, and the contributing drainage area.

nldi = findNLDI(comid       = 101, 
             nav         = "UT", 
             find        = c("flowline", "nwis", 'basin'),
             distance_km = 1000)

sapply(nldi, nrow)

      origin        basin  UT_nwissite UT_flowlines 
           1            1           28         1815

mapview(nldi)

nwmTools

While dataRetrival provides retrieval functions for observed data, there is an increasing amount of simulated data available for use. Once example is the 3 retrospective simulations of the National Water Model. This data certainly falls under the umbrella of “big data” and can be difficult to work with given each hourly file (for up to 42 years!) is stored in separate, cloud based files.

nwmTools provides retrieval functions for these retrospective stream flow data from the NOAA National Water Model reanalysis products:

Basic Use: Retrival

# remotes::install_github("mikejohnson51/nwmTools")
library(nwmTools)

nwis = readNWISdv(site = gsub("USGS-", "", nldi$UT_nwissite$identifier[10]), 
                  param = "00060") |> 
  renameNWISColumns()

nwm = readNWMdata(comid = nldi$UT_nwissite$comid[10])

Basic Use: Aggregation

The timestep of the retrospective NWM data is hourly. In many cases, its more desirable to have the time step aggregated to a different temporal resolution. nwmTools provides a family of aggregation functions that allow you to specify the aggregation time period, and the aggregation summary statistic

grep('aggregate_', ls("package:nwmTools"), value = TRUE)

 [1] "aggregate_dowy"   "aggregate_j"      "aggregate_m"      "aggregate_record"
 [5] "aggregate_s"      "aggregate_wy"     "aggregate_wym"    "aggregate_wymd"  
 [9] "aggregate_wys"    "aggregate_y"      "aggregate_yj"     "aggregate_ym"    
[13] "aggregate_ymd"    "aggregate_ys"

# Core function
nwm_ymd  = aggregate_ymd(nwm, "mean")
# User defined function
seasonal = aggregate_s(nwm, fun = function(x){quantile(x,.75)})
# Multiple functions
month    = aggregate_m(nwm, c('mean', "max"))

ggplot(data = nwis) + 
  geom_line(aes(x = Date, y = Flow * 0.028316846592)) + 
  labs(title = "NWIS Data") +
ggplot(data = nwm_ymd) + 
  geom_line(aes(x = ymd, y = flow_cms_v2.1)) + 
  labs(title = "NWM Daily Average") + 
ggplot(data = seasonal) + 
  geom_col(aes(x = season, y = flow_cms_v2.1)) + 
  labs(title = "NWM Seasonal 75%") +
ggplot(data = month) + 
  geom_col(aes(x = as.factor(month), y = flow_cms_v2.1_max)) +
  geom_col(aes(x = as.factor(month), y = flow_cms_v2.1_mean), fill = "red") +
  labs(title = "NWM Monthly Mean/Maximum")

elevatr

elevatr is R package focused of returning elevation data as a SpatialPointsDataFrame from point elevation services or as a raster object from raster elevation services. Currently, the package supports access to the Amazon Web Services Terrain Tiles, the Open Topography Global Datasets API, and the USGS Elevation Point Query Service.

Basic Use

# install.packages("elevatr")
library(elevatr)

x <- get_elev_raster(nldi$basin, z = 12)

plot(rast(x))

opendap.catalog

One of the biggest challenges with Earth System and spatial research is extracting data. These challenges include not only finding the source data but then downloading, managing, and extracting the partitions critical for a given task.

Services exist to make data more readily available over the web but introduce new challenges of identifying subsets, working across a wide array of standards (e.g. non-standards), all without alleviating the challenge of finding resources.

In light of this, opendap.catalog provides 4 primary services.

Basic Usage:

Generalized space (XY) and Time (T) subsets for remote and local NetCDF data with dap()

Handles data streaming, projection, and cropping. Soon to come is masking capabilities.

# remotes::install_github("mikejohnson51/opendap.catalog")
library(opendap.catalog)

AOI = aoi_get(state = "FL")

dap <- dap(URL = "https://cida.usgs.gov/thredds/dodsC/bcsd_obs", 
           AOI = AOI, 
           startDate = "1995-01-01")

source:  https://cida.usgs.gov/thredds/dodsC/bcsd_obs 
varname(s):
   > pr [mm/m] (monthly_sum_pr)
   > prate [mm/d] (monthly_avg_prate)
   > tas [C] (monthly_avg_tas)
   > tasmax [C] (monthly_avg_tasmax)
   > tasmin [C] (monthly_avg_tasmin)
   > wind [m/s] (monthly_avg_wind)
==================================================
diminsions:  63, 48, 1 (names: longitude,latitude,time)
resolution:  0.125, 0.125, 1 months
extent:      -87.75, -79.88, 25.12, 31.12 (xmin, xmax, ymin, ymax)
crs:         +proj=longlat +a=6378137 +f=0.00335281066474748 +p...
time:        1995-01-01 to 1995-01-01
==================================================
values: 18,144 (vars*X*Y*T)

str(dap, max.level = 1)

List of 6
 $ pr    :Formal class 'SpatRaster' [package "terra"] with 1 slot
 $ prate :Formal class 'SpatRaster' [package "terra"] with 1 slot
 $ tas   :Formal class 'SpatRaster' [package "terra"] with 1 slot
 $ tasmax:Formal class 'SpatRaster' [package "terra"] with 1 slot
 $ tasmin:Formal class 'SpatRaster' [package "terra"] with 1 slot
 $ wind  :Formal class 'SpatRaster' [package "terra"] with 1 slot

plot(rast(dap))

###

lai = dap('/Users/mjohnson/mosaic_lai_glass_1.tif',
    AOI = AOI) 

{
  plot(lai)
  plot(st_transform(AOI$geometry, crs(lai)), add = TRUE)
}

A catalog of 14691 web resources (as of 06/2022)

glimpse(opendap.catalog::params)

Rows: 14,691
Columns: 15
$ id        <chr> "hawaii_soest_1727_02e2_b48c", "hawaii_soest_1727_02e2_b48c"…
$ grid_id   <chr> "73", "73", "73", "73", "73", "73", "73", "73", "73", "73", …
$ URL       <chr> "https://apdrc.soest.hawaii.edu/erddap/griddap/hawaii_soest_…
$ tiled     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ variable  <chr> "nudp", "nusf", "nuvdp", "nuvsf", "nvdp", "nvsf", "sudp", "s…
$ varname   <chr> "nudp", "nusf", "nuvdp", "nuvsf", "nvdp", "nvsf", "sudp", "s…
$ long_name <chr> "number of deep zonal velocity profiles", "number of surface…
$ units     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ model     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ensemble  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ scenario  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ T_name    <chr> "time", "time", "time", "time", "time", "time", "time", "tim…
$ duration  <chr> "2001-01-01/2022-01-01", "2001-01-01/2022-01-01", "2001-01-0…
$ interval  <chr> "365 days", "365 days", "365 days", "365 days", "365 days", …
$ nT        <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …

With 14,160 web resources documented, there are simply too many resources to search through by hand unless you know exactly what you want. This voids the possibility of serendipitous discovery. So, we have added a generally fuzzy search tool to help discover datasets.

Say you want to find what resources there are for daily rainfall? search and search_summary can help:

search("daily precipitation") |> 
  search_summary() |> 
  DT::datatable()

(s = search('gridmet daily precipitation'))

# A tibble: 1 × 16
  id      grid_id URL      tiled variable varname long_name units model ensemble
  <chr>   <chr>   <chr>    <chr> <chr>    <chr>   <chr>     <chr> <chr> <chr>   
1 gridmet 176     http://… ""    pr       precip… pr        mm    <NA>  <NA>    
# … with 6 more variables: scenario <chr>, T_name <chr>, duration <chr>,
#   interval <chr>, nT <dbl>, rank <dbl>

Pass catalog elements to the generalized toolsets:

Now that you know the precipitation product you want, lets say you want to study the rainfall seen during Hurricane Harvey over Texas and Louisiana:

(harvey = dap(catalog = s, 
             AOI = aoi_get(state = c("TX", "LA")), 
             startDate = "2017-08-17",
             endDate   = "2017-09-03"))

source:  http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg... 
varname(s):
   > precipitation_amount [mm] (pr)
==================================================
diminsions:  430, 257, 18 (names: lon,lat,day)
resolution:  0.042, 0.042, 1 days
extent:      -106.66, -88.75, 25.8, 36.5 (xmin, xmax, ymin, ymax)
crs:         +proj=longlat +a=6378137 +f=0.00335281066474748 +p...
time:        2017-08-17 to 2017-09-03
==================================================
values: 1,989,180 (vars*X*Y*T)

$precipitation_amount_total
class       : SpatRaster 
dimensions  : 257, 430, 18  (nrow, ncol, nlyr)
resolution  : 0.04166667, 0.04166667  (x, y)
extent      : -106.6625, -88.74583, 25.79583, 36.50417  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +ellps=WGS84 +no_defs 
source      : memory 
names       : 2017-08-17, 2017-08-18, 2017-08-19, 2017-08-20, 2017-08-21, 2017-08-22, ... 
min values  :          0,          0,          0,          0,          0,          0, ... 
max values  :      106.3,       56.8,       62.9,       66.0,       55.3,      121.1, ... 
unit        :         mm,         mm,         mm,         mm,         mm,         mm, ...

plot(harvey[[1]])

# ---- 
harvey_counties = aoi_get(state = c("TX", "LA"), county = "all")

{
  plot(sum(harvey[[1]]) * 0.001, 
       main = "Meters of rainfall: Hurricane Harvey")
  plot(harvey_counties$geometry, add = TRUE)
}

Tiled data streams

One of the key features of dap is the capacity to make requests over tiled resources. Some resources (like MODIS) and tiled in space, and some (like MACA, and LOCA) are tiled in time. Below we see an example of how a single dap call can consolidate a request that covers multiple spatial tiles:

dap = dap(
    catalog = search("MOD16A2.006 PET"),
    AOI = AOI::aoi_get(state = "FL"),
    startDate = "2010-01-01",
    endDate   = "2010-01-31"
  )

source:  https://opendap.cr.usgs.gov/opendap/hyrax/MOD16A2.006/h10v05... 
tiles:   3 XY_modis tiles
varname(s):
   > PET_500m [kg/m^2/8day] (MODIS Gridded 500m 8-day Composite potential Evapotranspiration (ET))
==================================================
diminsions:  1383, 1586, 5 (names: XDim,YDim,time)
resolution:  463.313, 463.313, 8 days
extent:      -8417233.78, -7776472.29, 2712464.3, 3447278.27 (xmin, xmax, ymin, ymax)
crs:         +proj=sinu +lon_0= +x_0= +y_0= +units=m +a=6371007...
time:        2010-01-02 to 2010-02-03
==================================================
values: 10,967,190 (vars*X*Y*T)

plot(dap)

zonal

Often getting spatial data is not the end goal and in many cases producing area based summaries is critical.

zonal is a package that wraps the excellent exactextractr package with some added performance and convenience for these types of tasks. To use zonal, you must supply a raster dataset (SpatRaster), a geometry set to summarize over, and a summary functions (similar to nwmTools::aggregate_*). Lastly, you must provide the column with a unique identifier for each polygon object, and whether the summary data should be joined (join = TRUE) to the original summary data.

# remotes::install_github("NOAA-OWP/zonal")
library(zonal)

# Summary options
zonal::ee_functions()

 [1] "min"                      "max"                     
 [3] "count"                    "sum"                     
 [5] "mean"                     "median"                  
 [7] "quantile"                 "mode"                    
 [9] "majority"                 "minority"                
[11] "variety"                  "variance"                
[13] "stdev"                    "coefficient_of_variation"
[15] "weighted_mean"            "weighted_sum"            
[17] "frac"                     "weighted_frac"

zonal::weight_functions()

     collapse      base
1       fmean      mean
2     fmedian    median
3       fmode      mode
4        fsum       sum
5       fprod      prod
6         fsd        sd
7        fvar       var
8        fmin       min
9        fmax       max
10       fnth       nth
11     ffirst     first
12      flast      last
13      fnobs      nobs
14 fndistinct ndistinct

For big data processing zonal offers a unique ability to produce and work with weight grids. However this is out of scope for this workshop:

Basic Usage:

Returning to the Hurricane Harvey example, imagine you wanted the county wide total rainfall for each day:

library(zonal)

system.time({
summary_data = execute_zonal(harvey, 
                             geom = harvey_counties, 
                             fun = "sum", 
                             ID = "fip_code", 
                             join = TRUE)
})

   user  system elapsed 
  0.462   0.072   0.567

plot(summary_data[grepl("sum", names(summary_data))], border = NA,
     max.plot = 16)

Section 3: Web data without a “service”

Often the needed data is not delivered by a service or function. This does not mean we are out of luck, but rather that we have to do a little of the “heavy lifting” ourselves. This final section highlights how to read tabular, vector and raster data from URL using the http/https and s3 protocols.

These approaches are advantageous when you don’t want to download data locally for analysis.

Access tables

Many CSV readers in R have the ability to read from a URL. If your desired data is stored in CSV you have a number of options:

base::read.csv
readr::read_csv
data.table::fread

One example dataset in which there is not web service client is the USACE National Inventory of Dams (NID) data. This data is stored as a CSV file.

Basic usage:

library(readr)

dams = read_csv('https://nid.usace.army.mil/api/nation/csv', 
                skip = 1) 

glimpse(dams)

Rows: 92,029
Columns: 74
$ `Dam Name`                                <chr> "Aichi Forge Usa Dam", "Dod …
$ `Other Names`                             <chr> "Aichi Forge Usa Lake", NA, …
$ `Former Names`                            <chr> NA, "Dod Usa", NA, "Dod Usa"…
$ `NID ID`                                  <chr> "KY00728", "OK20994", "ND011…
$ `Other Structure ID`                      <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal ID`                              <chr> "KY00728", "OK20994", "ND011…
$ `Owner Names`                             <chr> "AICHI", "DEPT. OF DEFENSE/U…
$ `Owner Types`                             <chr> "Private", "Federal", "Priva…
$ `Primary Owner Type`                      <chr> "Private", "Federal", "Priva…
$ `State or Federal Agency ID`              <chr> "KY00728", NA, NA, NA, NA, N…
$ `Number of Associated Structures`         <dbl> 0, 0, NA, 0, 0, 0, 0, 0, 0, …
$ `Designer Names`                          <chr> "AUSTIN ENGINEERS, INC. SOUT…
$ `Non-Federal Dam on Federal Property`     <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `State Regulated Dam`                     <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Jurisdictional Dam`                <chr> "Yes", "Yes", "Yes", "Yes", …
$ `State Regulatory Agency`                 <chr> "KY Division of Water", NA, …
$ `State Permitting Authority`              <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Inspection Authority`              <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Enforcement Authority`             <chr> "Yes", "No", "Yes", "No", "Y…
$ `Source Agency`                           <chr> "Kentucky", "Oklahoma", "Nor…
$ Latitude                                  <dbl> 38.28828, 35.70341, 47.15639…
$ Longitude                                 <dbl> -84.55619, -95.13638, -102.7…
$ County                                    <chr> "Scott", "Muskogee", "Dunn",…
$ State                                     <chr> "Kentucky", "Oklahoma", "Nor…
$ City                                      <chr> "DELAPLAIN-AREA", "ARROWHEAD…
$ `Distance to Nearest City (Miles)`        <dbl> 1, 6, NA, 1, 4, 4, 4, NA, 0,…
$ `River or Stream Name`                    <chr> "TR-DRY RUN", "TR-WEST SPANI…
$ `Congressional District`                  <chr> NA, NA, "Congressional Distr…
$ `Section, Township, or Range Location`    <chr> NA, "12", NA, "21", "13;1N;5…
$ `Federal Agency Owners`                   <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Funding`      <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Design`       <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Construction` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Regulatory`   <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Inspection`   <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Operation`    <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Other`        <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Primary Purpose`                         <chr> "Other", "Recreation", "Othe…
$ Purposes                                  <chr> "Other", "Recreation", "Othe…
$ `Primary Dam Type`                        <chr> "Earth", "Earth", NA, "Earth…
$ `Dam Types`                               <chr> "Earth", "Earth", NA, "Earth…
$ `Core Types`                              <chr> NA, "Earth", NA, "Earth", "U…
$ Foundation                                <chr> NA, "Soil", NA, "Soil", "Unl…
$ `Dam Height (Ft)`                         <dbl> 15.0, 18.0, 29.0, 15.0, 18.0…
$ `Hydraulic Height (Ft)`                   <dbl> NA, 10.0, NA, 12.0, NA, NA, …
$ `Structural Height (Ft)`                  <dbl> NA, 18.0, NA, 15.0, NA, NA, …
$ `NID Height (Ft)`                         <dbl> 15.0, 18.0, 29.0, 15.0, 18.0…
$ `NID Height Category`                     <chr> "Less than 25 feet", "Less t…
$ `Dam Length (Ft)`                         <dbl> 325, 340, 1740, 240, 700, 80…
$ `Volume (Cubic Yards)`                    <dbl> NA, 0, NA, 0, 0, 0, 0, NA, 0…
$ `Year Completed`                          <dbl> 1974, 1939, NA, 1940, 1975, …
$ `Year Completed Category`                 <chr> "1970-1979", "1930-1939", "U…
$ `NID Storage (Acre-Ft)`                   <dbl> 86.8, 50.0, 682.6, 50.0, 394…
$ `Max Storage (Acre-Ft)`                   <dbl> 86.8, 50.0, 682.6, 50.0, 394…
$ `Normal Storage (Acre-Ft)`                <dbl> 26.6, 25.0, NA, 25.0, 180.0,…
$ `Surface Area (Acres)`                    <dbl> 8.00, 5.00, 37.50, NA, 31.80…
$ `Drainage Area (Sq Miles)`                <dbl> 1.35, NA, NA, NA, 0.14, 1.23…
$ `Max Discharge (Cubic Ft/Second)`         <dbl> NA, 200, NA, 20, 2930, 230, …
$ `Spillway Type`                           <chr> "Uncontrolled", "Controlled"…
$ `Spillway Width (Ft)`                     <dbl> NA, 1, NA, 1, 0, 0, 2, NA, 1…
$ `Number of Locks`                         <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Length of Locks`                         <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Lock Width (Ft)`                         <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Years Modified`                          <chr> NA, NA, NA, NA, NA, NA, "199…
$ `Outlet Gate Type`                        <chr> NA, "Valve1", NA, "Valve1", …
$ `Data Last Updated`                       <date> 2021-04-09, 2018-06-22, 202…
$ `Last Inspection Date`                    <date> 2013-04-30, 1998-10-01, NA,…
$ `Inspection Frequency`                    <dbl> 5, 5, NA, 5, 5, 5, 1, 2, 5, …
$ `Hazard Potential Classification`         <chr> "Low", "Low", "Low", "Low", …
$ `Condition Assessment`                    <chr> "Not Rated", "Not Rated", "N…
$ `Condition Assessment Date`               <date> NA, NA, NA, NA, NA, NA, NA,…
$ `EAP Prepared`                            <chr> "Not Required", "Not Require…
$ `EAP Last Revision Date`                  <date> NA, NA, NA, NA, NA, NA, 201…
$ `Website URL`                             <chr> "https://damsafety.org/kentu…

Access Raster Data

GDAL provides a number of Virtual File System drivers to read compressed and network data.

Three of the most useful for common spatial data tasks include:

vsizip –> file handler that allows reading ZIP archives on-the-fly without decompressing them beforehand.
vsicurl –> A generic file system handler exists for online resources that do not require particular signed authentication schemes
vsis3 –> specialized into sub-filesystems for commercial cloud storage services (e.g. AWS, also see /vsigs/, /vsiaz/, /vsioss/ or /vsiswift/).

Basic Usgage

Duke store VRT files for the POLARIS soils dataset. terra can generate pointers to these datasets however data is not extracted until an operation is called on the data. As seen above dap provides simplified access to extract subsets of data for OpenDap and VRT endpoints.

rast('/vsicurl/http://hydrology.cee.duke.edu/POLARIS/PROPERTIES/v1.0/vrt/alpha_mean_0_5.vrt')

class       : SpatRaster 
dimensions  : 97200, 212400, 1  (nrow, ncol, nlyr)
resolution  : 0.0002777778, 0.0002777778  (x, y)
extent      : -125, -66, 23, 50  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326) 
source      : alpha_mean_0_5.vrt 
name        : alpha_mean_0_5

dap('/vsicurl/http://hydrology.cee.duke.edu/POLARIS/PROPERTIES/v1.0/vrt/alpha_mean_0_5.vrt', AOI = AOI::aoi_get(state = "CO", county = "Larimer")) |> 
  plot()

Transect example

Here is an example where a global elevation dataset can quickly be used in conjunction with the NLDI client to plot the elevation transect of a the main river channel draining to the USGS office in Fort Collins, CO!

d = findNLDI(loc = geocode("2150 Centre Ave, Fort Collins, CO 80526", pt = TRUE),
             nav = "UM", 
             find = c("flowlines", "basin")) 

mapview(d)

COP30 = dap("/vsicurl/https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt", 
            AOI = d$basin)

transect = extract(COP30, vect(st_union(d$UM_flowlines))) |> 
  mutate(location = c(n():1))

ggplot(transect, aes(x = location, y = COP30_hh)) +
    geom_line() +
    geom_smooth()

Access Vector Data

Basic Usgage: TIGER lines

Since the base data reader of sf is GDAL the vsi capabilities also apply to vector data! The added challenge typically is that vector data (especially shapefile data) is generally stored in zip files. This adds an extra layer of complexity when remotely reading these datasets.

In this example we look at the US Census Bureaus FTP server for TIGER road lines. The data is stored by by 5 digit (county) FIP code in zip files. To access this data we need to identify the FIP code of interest and then chain a vsizip with a vsicurl call.

AOI  = aoi_get(state = "AL", county = "Tuscaloosa")
file = paste0('tl_2021_',AOI$fip_code, '_roads')
url  = paste0('/vsizip/{/vsicurl/https://www2.census.gov/geo/tiger/TIGER2021/ROADS/', file, '.zip}/',file,'.shp')

system.time({
  roads  = read_sf(url)
})

   user  system elapsed 
  0.201   0.040   4.975

mapview(roads)

For those curious about the generalization of this, we can see that any GDAL backed software (here terra) can utilize the same URL:

system.time({
  roads2 = vect(url)
})

   user  system elapsed 
  0.180   0.039   4.078

plot(roads2)

Basic Usgage: Geoconnex JSON-LD:

Lastly, no IoW workshop would not be complete without highlighting the utility of the ever growing geoconnex system. This system build and maintains the infrastructure and data needed to “make water data as easily discover-able, accessible, and usable as possible.”

One of the datasets that this system has minted Persistent Identifiers (PIDs) for is a collection of USA mainstems. To be a “PID” a resource - in this case spatial feature - has to have a unique URI that can be resolved. This URI are store as JSON-LD, which, can be read by GDAL! So, if you can identify the geoconnex resource you want, you can read it into R and have confidence the URI will not change on you:

feat = read_sf('https://geoconnex.us/ref/mainstems/2238605') 
mapview(feat$geometry)

Section 4: Hands on Example

See here

Working with Geospatial Hydrologic Data Using Web Services (R)

Mike Johnson, Lynker

Taher Chegini, University of Houston

Marc Weber, US Environmental Protection Agency

Introduction

Section 1: Spatial Data Foundations

Basics for R

Getting R packages

R function syntax

Spatial Data

Communities, libraries and Standards

Simple Features Model

sf: simple features

Basic use:

terra:

Basic Use

mapview:

Basic Useage:

Section 2: Geospatial/Hydrology focused packages:

AOI

Basic Useage:

nhdplusTools

Basic Use

dataRetrival

Basic Use: Streamflow Data

NLDI client

Basic use: NLDI

nwmTools

Basic Use: Retrival

Basic Use: Aggregation

elevatr

Basic Use

opendap.catalog

Basic Usage:

Generalized space (XY) and Time (T) subsets for remote and local NetCDF data with dap()

A catalog of 14691 web resources (as of 06/2022)

Pass catalog elements to the generalized toolsets:

Tiled data streams

zonal

Basic Usage:

Section 3: Web data without a “service”

Access tables

Basic usage:

Access Raster Data

Basic Usgage

Transect example

Access Vector Data

Basic Usgage: TIGER lines

Basic Usgage: Geoconnex JSON-LD:

Section 4: Hands on Example

R function `syntax`