Working with Geospatial Hydrologic Data Using Web Services

Impact of Dams on Streamflow - R

Mike Johnson https://github.com/mikejohnson51 (Lynker)https://www.lynker.com/solutions/water-environmental-resources/ , Taher Chegini https://cheginit.github.io/ (University of Houston)https://www.lynker.com/solutions/water-environmental-resources/ , Marc Weber https://mhweber.github.io (US Environmental Protection Agency)

This working example is part 4 of the Internet of Water Workshop on Geospatial data and web services in R. The first three sections can be found here. # Introduction

This working example is part 4 of the Internet of Water Workshop on Geospatial data and web services in R. The first three sections can be found here.

The primary goal of this hands-on tutorial is to introduce a handful of geospatial web services for conducting scientific studies. For this purpose, we’re going to take a look at the effects of building dams on stream flow. Here are some example peer-reviewed papers on this topic.

library(AOI)
library(sf)             # vector data
library(terra)          # raster data
library(dplyr)          # data manipulation
library(dataRetrieval)  # USGS data access
library(ggplot2)        # data visualization
library(opendap.catalog)# VRT support
library(nhdplusTools)   # Data access

Data Gathering

Streamflow Gauges

We set the area of interest (AOI) for this study to Texas and study dams that have been built in the 1995-2005 period.

# Define AOI
texas = aoi_get(state = "TX")

# Instantiate plot
ggplot() +
  # Add Texas POLYGON
  geom_sf(data = texas, color = "red", fill = "blue") +
  # Apply theme
  theme_light()

We use the National Water Information System (NWIS) service to check stream flow availability in Texas, for the period 10 years before and after the 1985-2015 period of interest.

# Query NWIS site data

nwis = whatNWISdata(
  # State to search
  stateCd = "TX",
  # Date range
  startDt = "1985-01-01", endDt = "2015-01-01",
  # Daily Values
  outputDataTypeCd = "dv",
  # Stream flow
  parameterCd = "00060",
  # Mean statistic
  statCd = '00003')

glimpse(nwis)
#> Rows: 650
#> Columns: 24
#> $ agency_cd          <chr> "USGS", "USGS", "USGS", "USGS", "USGS", "…
#> $ site_no            <chr> "07227420", "07227500", "07227890", "0722…
#> $ station_nm         <chr> "Cramer Ck at US Hwy 54 nr Dalhart, TX", …
#> $ site_tp_cd         <chr> "ST", "ST", "ST", "ST", "ST", "ST", "ST",…
#> $ dec_lat_va         <dbl> 35.75125, 35.47033, 35.72078, 35.66476, 3…
#> $ dec_long_va        <dbl> -102.89317, -101.87963, -101.66253, -101.…
#> $ coord_acy_cd       <chr> "S", "F", "1", "F", "F", "F", "F", "F", "…
#> $ dec_coord_datum_cd <chr> "NAD83", "NAD83", "NAD83", "NAD83", "NAD8…
#> $ alt_va             <dbl> 3962.24, 2989.16, 2933.19, 2835.84, 2301.…
#> $ alt_acy_va         <dbl> 0.01, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10,…
#> $ alt_datum_cd       <chr> "NGVD29", "NGVD29", "NAVD88", "NGVD29", "…
#> $ huc_cd             <chr> "11090102", "11090105", "11090105", "1109…
#> $ data_type_cd       <chr> "dv", "dv", "dv", "dv", "dv", "dv", "dv",…
#> $ parm_cd            <chr> "00060", "00060", "00060", "00060", "0006…
#> $ stat_cd            <chr> "00003", "00003", "00003", "00003", "0000…
#> $ ts_id              <chr> "131754", "131763", "131775", "131784", "…
#> $ loc_web_ds         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ medium_grp_cd      <chr> "wat", "wat", "wat", "wat", "wat", "wat",…
#> $ parm_grp_cd        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ srs_id             <chr> "1645423", "1645423", "1645423", "1645423…
#> $ access_cd          <chr> "0", "0", "0", "0", "0", "0", "0", "0", "…
#> $ begin_date         <date> 2007-10-01, 1938-04-01, 2010-02-26, 1974…
#> $ end_date           <date> 2020-12-15, 2022-07-18, 2022-07-18, 1989…
#> $ count_nu           <dbl> 4825, 30784, 4522, 5692, 30790, 20912, 22…

We can see that there are 650 stream flow gauges in Texas that fit our criteria. Now, let’s filter these stations a step further to include only those stations that have around 30 years of daily stream flow data with drainage area of larger than 10 km2 that have been impacted by human activities.

Since upstream drainage area is not part of the returned NWIS values, we need to join our gauge data to another dataset that contains this information - for example the GAGESII dataset!

The nhdplusTools packages provides access to subset the gagesII data for our AOI. We can then further refine the outputs by HDCN class, sqkm, and record count.

sites = 
  # Read data for AOI
  get_gagesII(AOI = texas)  %>%
  # Find those that the reference classification is NA and the drainage area is greater then 10
  filter(hcdn_2009 == "", drain_sqkm > 10) %>%
  # Join to the site IDs returned from NWIS
  inner_join(nwis, by = c("staid" = "site_no")) %>%
  # Find those that have at least 30 years of data
  filter(count_nu > 30*365) |> 
  # Send to a projected CRS
  st_transform(5070)
  

We can see that there are 280 stream flow gauge stations in Texas that fit our criteria.

# Instantiate plot
ggplot() +
  # Add Texas POLYGON
  geom_sf(data = texas, color = "red", fill = "blue") +
  # Add sites to map
  geom_sf(data = sites, pch = 8, color = "white") +
  # Apply theme
  theme_light()

Dams

Next, we need to retrieve the dams in Texas that have been built between 1995 and 2005. For this exercise, their is not a function based web service for getting the latest (2019) National Inventory of Dams (NID) data from the US Army Corps. However, the USACE exposes the data as a CSV file online meaning we can read it directly into our work session:

dams = 
  # Read data from URL
  read.csv('https://nid.usace.army.mil/api/nation/csv', skip = 1) %>%
  # Find those within year range, the state of Texas, and of a given size
  filter(between(Year.Completed, 1995, 2005), State == "Texas") %>%
  # convert table to sf object
  st_as_sf(coords = c('Longitude', 'Latitude'), crs = 4326)  %>%
  # Keep only relevant columns and rename
  select(name      = Dam.Name, 
         nidid     =  NID.ID, 
         storage   = NID.Storage..Acre.Ft., 
         height    = Dam.Height..Ft.,
         completed = Year.Completed) %>%
  # Send to a projected transformation
  st_transform(5070)

# Instantiate plot
ggplot() +
  # Add Texas POLYGON
  geom_sf(data = texas, color = "red", fill = "blue") +
  # Add sites to map
  geom_sf(data = sites, pch = 8, color = "white") +
  # Add dams to map
  geom_sf(data = dams, pch = 16, color = "darkgreen") +
  # Apply theme
  theme_light()

As is evident from the plot above, there are many gauge locations that don’t have any dams in their vicinity. One way to eliminate these stations is using a spatial query based on a search radius. We can determine an estimation for our search radius based on the upstream drainage area distribution of the stream flow gauges.

ggplot(data = sites, aes(x = drain_sqkm)) +
  # Generate a histogram
  geom_histogram() + 
  # Add vertical (v) line
  geom_vline(xintercept = 10000, color = "red") +
  # Add labels
  labs(title = "Drainage Area (km2) Histogram")

We see that most stations have a drainage area of less than 10,000 km2. Now, we define a function that carries out an efficient spatial mapping to find the stations and dams that have at least one (other) feature within a 10-km radius.

distance_filter = function(dams, gages, dist_m){
  # calculate the pairwise distance between gages and dams
  # Find boolean condition if distance is greater then dist_m
  # Extract the row/col indexes
  map = which(st_distance(dams, gages) < units::as_units(dist_m, "m"), arr.ind = TRUE) 
  
  # use the index values to extract IDs
  map = data.frame(NID = dams$nidid[map[,1]], 
                   NWIS = gages$staid[map[,2]])
  
  # use IDs to filter dams and gages
  return(list(dams = filter(dams, nidid %in% map$NID),
              sites = filter(gages, staid %in% map$NWIS)))
}

# Execute function to get a list of features within proximety of 10000m
proxy = distance_filter(dams, gages = sites, dist_m = 10000)

# How many of each feature type?
lengths(proxy)
#>  dams sites 
#>     6    39
# Instantiate plot
ggplot() +
  # Add Texas POLYGON
  geom_sf(data = texas, color = "red", fill = "blue") +
  # Add sites to map
  geom_sf(data = proxy$sites, pch = 8, color = "white") +
  # Add dams to map
  geom_sf(data = proxy$dams, pch = 16, color = "darkgreen") +
  # Apply theme
  theme_light()

Hydrolinking

So far, we obtained only the stations that have at least one dam in their 10-km radius, but we didn’t find if those dams are in their upstream or downstream of the associated gauge. We use the Hydro Network-Linked Data Index (NLDI) web service to obtain the upstream flowlines of the stream flow gauges up to 10 km. Note that this 10 km is the distance along the flowlines.

crosswalk = list()
proxy$dams$dam_comid = NA
proxy$sites$usptreamDam = NA

# Discover NHDPlus catchment comid of each dam 
for(i in 1:nrow(proxy$dams)){
  proxy$dams$dam_comid[i] = findNLDI(location = st_transform(proxy$dams[i,], 4326))$comid
}

# 
for(i in 1:nrow(proxy$sites)){
  
 # Find upstream comids of each gauge within 10km
 UT = findNLDI(nwis = proxy$site$staid[i],
                nav = "UT",
                distance_km = 10)$UT_flowlines$nhdplus_comid
 
 # Mark upstream (us) dams by COMID matching
 us = proxy$dams$nidid[which(proxy$dams$dam_comid %in% UT)]

 # If no dams upstream report NA, else report dam ID
 us = if(length(us) > 0){
   us
 } else {
   NA
 }
 
 # Save index of nwis to nid relationships
 crosswalk[[i]] = data.frame(staid = proxy$site$staid[i], nidid = us)

}

# Merge crosswalk with dams
crosswalk = inner_join(bind_rows(crosswalk), proxy$dams, by = "nidid")
# Merge crosswalk with gages
crosswalk = inner_join(crosswalk, proxy$sites, by = "staid")

# Reduce crosswalk
crosswalk = select(crosswalk, c("staid", dam_comid, nidid, storage, height, name,  completed,  begin_date, end_date))

# View
datatable(crosswalk)

Plot/Evaluate

Upon finalizing the stations that satisfy our criteria, we use NWIS to obtain the daily stream flow data.

We can use this data to make a series of plots showing the annual mean stream flow values overlaid with the years a dam was constructed. We can do this iteritively for each identified gauge:

# Empty list to add plots to...
p = list()

# For each unique site ID
for(i in unique(crosswalk$staid)){
  
  # year of completion minus 1
  years_of_construction = filter(crosswalk, staid == i)$completed - 1
  # Find the 15 year range 
  y = range(years_of_construction - 15, years_of_construction + 15)
  
  # Extact complete NWIS record for streamflow at each site
  flows = readNWISdv(site = i, parameterCd  = "00060") |> 
    # Rename columns
    renameNWISColumns() |> 
    # Strip the Year from each data and group by that year
    group_by(year = as.numeric(format(Date, '%Y'))) |> 
    # Determine the mean flow in each year and convert to CMS
    summarise(meanQ = mean(Flow, na.rm = TRUE) * 0.028316846592,
              # Count number of observations in each year
              n = n()) |> 
    # Only keep data for years within our range and that have 300+ records
    filter(between(year, min(y), max(y)), n > 300) |> 
    # Join to all years (add NAs for years not observered or cut)
    right_join(data.frame(year = min(y):max(y)), by = "year")
  
  # make plot and add to p list
  p[[length(p)  + 1]] = 
    ggplot(data = flows) +
    geom_line(aes(x = year, y = meanQ )) + 
    geom_vline(xintercept = years_of_construction, color  = "red") + 
    labs(title = i, x = "Year", y = "Flow (cfs)")
}

# Combine and render plots:
do.call("grid.arrange", c(p, ncol = 2))  

Influence of Land cover

We can see that based on the available data and visual inspection, the first station shows a noticeable difference before and after the dam construction. Next we take a closer look at this station, USGS-07312200 to find if there is a visible reason for why these dams were added in the land cover.

To do this we identify the up and downstream networks (dataRetrival), and find the associated catchments (nhdplusTools)

# Use NLDI to find up and down stream flowline of the dam COMID within 30 km
region = findNLDI(comid = 13730353, nav = c("UT", "DD"), distance_km = 30)

# Used the returned data to extract catchment realizations of the NHDPlus
catchments = get_nhdplus(comid = c(region$DD_flowlines$nhdplus_comid,
                                   region$UT_flowlines$nhdplus_comid),
                                   realization = "catchment")

We can then use dap (opendap.catalog) to subset a cloud native TIF file with vsicurl, and plot the returned data:

# Subset cloud-native TIF file of NLCD

lc = dap("/vsicurl/https://storage.googleapis.com/feddata-r/nlcd/2019_Land_Cover_L48.tif", 
         AOI = catchments)

# Transform catchments to the projection of the NLCD
catchments_proj = st_transform(catchments, crs(lc))

# Mask land cover data with the catchments
masked_data = mask(lc, vect(catchments_proj))

# Plot!
{
  plot(masked_data)
  plot(dams$geometry, add = TRUE, pch = 25, bg = "red")
  plot(catchments_proj$geometry, add = TRUE)
}

Last, we might be interested in the percent of land cover up and down stream of the dam(s). We can find these with the data already queried:

# Downstream catchments
ds_cat = filter(catchments_proj, 
                featureid %in% region$DD_flowlines$nhdplus_comid)

# Mask Land cover to Downstream catchments, and compute land cover frequency
ds_frac = freq(mask(lc, vect(ds_cat))) %>%
  arrange(-count)  %>%
  select(value, count) %>%
  mutate(DD_sqkm = count * prod(res(lc)) / 1e6, 
         DD_percentage = round(DD_sqkm / sum(DD_sqkm), 2),
         DD_sqkm = NULL, count = NULL)

# Upstream catchments
us_cat = filter(catchments_proj, 
                featureid %in% region$UT_flowlines$nhdplus_comid)

# Mask Land cover to Upstream catchments, and compute land cover frequency
us_frac = freq(mask(lc, vect(us_cat))) %>%
  arrange(-count)  %>%
  select(value, count) %>%
  mutate(UT_sqkm = count * prod(res(lc)) / 1e6, 
         UT_percentage = round(UT_sqkm / sum(UT_sqkm), 2),
         UT_sqkm = NULL, count = NULL,)

# View:
full_join(ds_frac, us_frac) %>% 
  mutate(`delta(DD - UT)` = DD_percentage - UT_percentage) %>%
  filter(`delta(DD - UT)` != 0) %>%
  datatable() %>%
  formatStyle(
    'delta(DD - UT)',
    color = styleInterval(0, c('red', 'green'))
  )

Notably we see that the upstream area has significantly more grassland then downstream, while the downstream catchments have significantly more cropland. The implications is that these dams were constructed to support the transition of grassland to agricultural land.

Conclusions

While a small example, this workflow shows how the details learned today can be pieced together to answer complex and meaningful questions in a programmatic, reproducible, and lightweight way.