This document carries us through some of the basics for finding and accessing data from the internet using R. While the details (e.g. specific functions) are R based, the concepts apply to all programming languages.
Section 1: R basics, and foundation spatial libraries
Section 2: Key hydrology based packages (focus on data access)
Section 3: Web Data without services
Section 4: Hands on example
Install guides can be found here for R and RStudio
“R packages are extensions to the base R statistical programming language. Packages contain”code, data, and documentation that can be installed by users of R” [wikipedia]
install.packages(name)
can install packages from CRAN
remotes::install_github("user/repo")
can install
packages from Github that are not released to CRAN (development
versions, personal repos, etc.)
Packages are loaded (attached) into a working environment by
calling library(name)
Once attached all objects (code, data, and documentation) of that package are exposed for use.
syntax
Functions are code snippets that can be reused, for common and/or complex tasks. Functions can be developed and shared in packages.
In R, package::function
is the syntax for calling
functions from a package that has not been attached, or, that has a
conflicting function name.
Example: sf::read_sf()
calls the
read_sf
function from the sf
package
Preceding a function with a ?
in brings up the help
page for that function
::read_sf ?sf
#
) inserted before lines of code are
comments, they are present for human use not machine interpretation# this is a comment, below is code
2 + 2
[1] 4
We are going to discuss spatial data for hydrology in two contexts.
“Vector” data are comprised of points, lines, and polygons that represent discrete spatial entities, such as a river, watershed, or stream gauge.
“Raster” data divides spaces into rectilinear cells (pixels) to represent spatially continuous phenomena, such as elevation or the weather. The cell size (or resolution) defines the fidelity of the data.
These libraries include:
PROJ –> projections, transformations
GEOS –> Geometry operations (measures, relations)
GDAL –> geodata abstraction and processing (read, write)
Simple Features (officially Simple Feature Access) is both an OGC and International Organization for Standardization (ISO) standard that specifies how (mostly) two-dimensional geometries can represent and describe objects in the real world.
It describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.
It outlines how the spatial elements of POINTS (XY locations with a specific coordinate reference system) extend to LINES, POLYGONS and GEOMETRYCOLLECTION(s).
In R, the sf
package provides “support for simple features, a standardized
way to encode spatial vector data…. [and] Binds to ‘GDAL’ for reading
and writing data, to ‘GEOS’ for geometrical operations, and to ‘PROJ’
for projection conversions and datum transformations.”
Therefore when using R, you are using an interface to the core
community standards, software, and practices (this is no exclusive to
R). TO highlight this we can install (do this once) and attach
sf
to view the external dependencies versions of the
libraries linked to sf
.
# install.packages("sf")
library(sf)
sf_extSoftVersion()
GEOS GDAL proj.4 GDAL_with_GEOS USE_PROJ_H
"3.10.2" "3.4.2" "8.2.1" "false" "true"
PROJ
"8.2.1"
The bindings to these lower-level C libraries, and, the larger
sf
ecosystem in R can be seen below:
What is novel about the sf
implementation in the R
ecosystem is the way in which the authors built on the list structure in
R to store simple feature geometries (sfg) as part of a larger
data.frame using a simple feature geometry list-column (sfg). The
collection of attribute and spatial information define a simple feature
that can be operated on in both table (SQL) and spatial (GEOS, etc)
contexts. Not only does this allow us to make the most use of the
growing spatial community but also of the growing data science
community (see ggplot
, dplyr
,
data.table
, dbplyr
, arrow
,
etc.)
In practice, an sf
object in R looks like the
following:
This extends the idea of “tidy” data in that each row represents one observation, which has one geometric representation of the real world feature it describes.
# Define file path
<- system.file("shape/nc.shp", package="sf")) (filename
[1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library/sf/shape/nc.shp"
# read in file
<- read_sf(filename) ) (nc
Simple feature collection with 100 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
Geodetic CRS: NAD27
# A tibble: 100 × 15
AREA PERIMETER CNTY_ CNTY_ID NAME FIPS FIPSNO CRESS_ID BIR74 SID74 NWBIR74
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
# … with 90 more rows, and 4 more variables: BIR79 <dbl>, SID79 <dbl>,
# NWBIR79 <dbl>, geometry <MULTIPOLYGON [°]>
# Map
plot(nc['SID79'])
# Spatial measures!
head(st_area(nc))
Units: [m^2]
[1] 1137107793 610916077 1423145355 694378925 1520366979 967504822
# Spatial operations!
{st_union(nc) |> plot()
st_centroid(nc)$geometry |> plot(col = "red", add = TRUE)
}
# data science operations
library(dplyr)
{plot(nc$geometry)
plot(slice_max(nc, AREA, n = 10)$geometry,
col = "red", add = TRUE)
plot(slice_max(nc, AREA, n =5)$geometry,
col = "yellow", add = TRUE)
plot(slice_max(nc, AREA, n = 1)$geometry,
col = "green", add = TRUE)
}
While sf
is a leading package for working with
vector data, terra
is a primary R package for working with raster data. terra
also operates on vector data but we focus on sf
for those
purposes
terra
builds on the older raster
package and provides methods for low-level data manipulation as well as
high-level global, local, zonal, and focal computations. The predict and
interpolate methods facilitate the use of regression type
(interpolation, machine learning) models for spatial prediction,
including with satellite remote sensing data. Processing of very large
files is supported.
Like sf
, terra
links to GDAL, and the
version and drivers can be viewed using package support functions:
# install.packages(terra)
library(terra)
gdal()
[1] "3.4.2"
::datatable(gdal(drivers = TRUE)) DT
<- system.file("ex/meuse.tif", package="terra")) (filename
[1] "/Library/Frameworks/R.framework/Versions/4.2/Resources/library/terra/ex/meuse.tif"
<- rast(filename)) (r
class : SpatRaster
dimensions : 115, 80, 1 (nrow, ncol, nlyr)
resolution : 40, 40 (x, y)
extent : 178400, 181600, 329400, 334000 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=sterea +lat_0=52.1561605555556 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +datum=WGS84 +units=m +no_defs
source : meuse.tif
name : test
min value : 138
max value : 1736
plot(r, main='SpatRaster')
mapview
is a leaflet
helper package designed to quickly and
conveniently create interactive visualizations of spatial data.
# install.packages(mapview)
library(mapview)
mapview(nc)
Together, sf
/terra
/mapview
provide much of the functionality needed to get the best of both a
programmatic and GUI driven GIS!
With a strong grasp on the basic spatial libraries for reading, writing, and manipulating spatial data, we can move onto packages that seek to support the hydrologic and earth systems communities.
AOI is NOT on CRAN but provides interfaces for geocoding and retrieving spatial boundaries (e.g state, county, and country). When working with web-based, subsetting services, the ability to quickly define common features is really convenient.
We’ll see through these examples that the concept of AOI is key to requesting subsets of data. Further well see that a reproducable way of generating these is a quickway to make the sytanx of disparate packages more aligned.
# remotes::install_github("mikejohnson51/AOI")
library(AOI)
# state
aoi_get(state = "CO") |> mapview()
# county
aoi_get(state = "AL", county = "Tuscaloosa") |> mapview()
# country
aoi_get(country = "Ukraine") |> mapview()
# geocoding
geocode("NOAA National Water Center", pt = TRUE) |> mapview()
nhdplusTools is a R package for working with the NHD and the NHD data model. The package is robust and covers 5 primary use topics that include:
Given the focus here on geospatial data via web services, we will look at functions supporting use case 2.
# install.packages('nhdplusTools')
library(nhdplusTools)
grep("get_", ls("package:nhdplusTools"), value = TRUE)
[1] "get_boundaries" "get_DD"
[3] "get_DM" "get_elev_along_path"
[5] "get_flowline_index" "get_gagesII"
[7] "get_hr_data" "get_huc12"
[9] "get_huc8" "get_hydro_location"
[11] "get_levelpaths" "get_nhdarea"
[13] "get_nhdplus" "get_nhdplushr"
[15] "get_nldi_basin" "get_nldi_characteristics"
[17] "get_nldi_feature" "get_nldi_index"
[19] "get_node" "get_nwis"
[21] "get_partial_length" "get_path_lengths"
[23] "get_pathlength" "get_pfaf"
[25] "get_raindrop_trace" "get_sorted"
[27] "get_split_catchment" "get_streamlevel"
[29] "get_streamorder" "get_terminal"
[31] "get_tocomid" "get_UM"
[33] "get_UT" "get_vaa"
[35] "get_vaa_names" "get_vaa_path"
[37] "get_waterbodies" "get_waterbody_index"
[39] "get_wb_outlet" "get_xs_point"
[41] "get_xs_points"
Here we are interested in getting data for an AOI around Boston. We
use AOI::aoi_get
to get the OpenStreetMap representation of
Boston, and initialize an empty list to populate:
= list()
boston $AOI = aoi_get("Boston") boston
$nhdplus <- get_nhdplus(AOI = boston$AOI)
boston$waterbodies <- get_waterbodies(AOI = boston$AOI)
boston$gages <- get_nwis(AOI = boston$AOI)
boston$huc12 <- get_huc12(AOI = boston$AOI)
boston
# How many features per entry?
sapply(boston, nrow)
AOI nhdplus waterbodies gages huc12
1 201 58 9 18
mapview(boston)
In both the NHDPlus data model, and the emerging international OGC standard for representing surface water features HY_features it’s recognized that hydrologic entities can be “realized” in a number of ways.
For example the holistic notion of a ‘catchment’ can be realized as a
‘divide’, ‘flowpath’, ‘outlet’, or other representation. The
nhdplusTools::get_nhdplus
functions supports getting 3
realizations of the NHD features that are consistent with the NHDPlus
Data model (outlet
, flowline
,
catchment
, and all
).
<- get_nhdplus(aoi_get("boulder"), realization = "all")
boulder mapview(boulder)
The USGS supported dataRetrival package provides retrieval functions for USGS and EPA hydrologic and water quality data. Recently a Network Linked Data Index (NLDI) client was also added to the package.
#install.package('dataRetrieval')
library(dataRetrieval)
# Supply gauge ids from Boston example
# Parameter codes can be found with (`dataRetrieval::parameterCdFile`)
# "00060" is stream flow
= readNWISdv(site = boston$gages$site_no, parameterCd = "00060") |>
flows renameNWISColumns()
ggplot(data = flows) +
geom_line(aes(x = Date, y = Flow)) +
facet_wrap('site_no')
The hydro Network-Linked Data Index (NLDI) is a system that can index spatial and river network-linked data and navigate the river network to allow discovery of indexed information.
The NLDI service requires the following:
comid
, nwis
,
wqp
, location
and more).get_nldi_sources
function.::datatable(get_nldi_sources()) DT
get_nldi_sources
)The NLDI client in dataRetrival
operates by speficing
the origin, the navigation mode, and the features to find. In the
following example, we want to navigate along the upstream tributary of
COMID 101 (up to 1000 km) and find all flowline features, nwis gages,
and the contributing drainage area.
= findNLDI(comid = 101,
nldi nav = "UT",
find = c("flowline", "nwis", 'basin'),
distance_km = 1000)
sapply(nldi, nrow)
origin basin UT_nwissite UT_flowlines
1 1 28 1815
mapview(nldi)
While dataRetrival provides retrieval functions for observed data, there is an increasing amount of simulated data available for use. Once example is the 3 retrospective simulations of the National Water Model. This data certainly falls under the umbrella of “big data” and can be difficult to work with given each hourly file (for up to 42 years!) is stored in separate, cloud based files.
nwmTools
provides retrieval functions for these retrospective stream flow data
from the NOAA National Water Model reanalysis products:
# remotes::install_github("mikejohnson51/nwmTools")
library(nwmTools)
= readNWISdv(site = gsub("USGS-", "", nldi$UT_nwissite$identifier[10]),
nwis param = "00060") |>
renameNWISColumns()
= readNWMdata(comid = nldi$UT_nwissite$comid[10]) nwm
The timestep of the retrospective NWM data is hourly. In many cases, its more desirable to have the time step aggregated to a different temporal resolution. nwmTools provides a family of aggregation functions that allow you to specify the aggregation time period, and the aggregation summary statistic
grep('aggregate_', ls("package:nwmTools"), value = TRUE)
[1] "aggregate_dowy" "aggregate_j" "aggregate_m" "aggregate_record"
[5] "aggregate_s" "aggregate_wy" "aggregate_wym" "aggregate_wymd"
[9] "aggregate_wys" "aggregate_y" "aggregate_yj" "aggregate_ym"
[13] "aggregate_ymd" "aggregate_ys"
# Core function
= aggregate_ymd(nwm, "mean")
nwm_ymd # User defined function
= aggregate_s(nwm, fun = function(x){quantile(x,.75)})
seasonal # Multiple functions
= aggregate_m(nwm, c('mean', "max"))
month
ggplot(data = nwis) +
geom_line(aes(x = Date, y = Flow * 0.028316846592)) +
labs(title = "NWIS Data") +
ggplot(data = nwm_ymd) +
geom_line(aes(x = ymd, y = flow_cms_v2.1)) +
labs(title = "NWM Daily Average") +
ggplot(data = seasonal) +
geom_col(aes(x = season, y = flow_cms_v2.1)) +
labs(title = "NWM Seasonal 75%") +
ggplot(data = month) +
geom_col(aes(x = as.factor(month), y = flow_cms_v2.1_max)) +
geom_col(aes(x = as.factor(month), y = flow_cms_v2.1_mean), fill = "red") +
labs(title = "NWM Monthly Mean/Maximum")
elevatr is R package focused of returning elevation data as a SpatialPointsDataFrame from point elevation services or as a raster object from raster elevation services. Currently, the package supports access to the Amazon Web Services Terrain Tiles, the Open Topography Global Datasets API, and the USGS Elevation Point Query Service.
# install.packages("elevatr")
library(elevatr)
<- get_elev_raster(nldi$basin, z = 12)
x
plot(rast(x))
One of the biggest challenges with Earth System and spatial research is extracting data. These challenges include not only finding the source data but then downloading, managing, and extracting the partitions critical for a given task.
Services exist to make data more readily available over the web but introduce new challenges of identifying subsets, working across a wide array of standards (e.g. non-standards), all without alleviating the challenge of finding resources.
In light of this, opendap.catalog provides 4 primary services.
# remotes::install_github("mikejohnson51/opendap.catalog")
library(opendap.catalog)
= aoi_get(state = "FL")
AOI
<- dap(URL = "https://cida.usgs.gov/thredds/dodsC/bcsd_obs",
dap AOI = AOI,
startDate = "1995-01-01")
source: https://cida.usgs.gov/thredds/dodsC/bcsd_obs
varname(s):
> pr [mm/m] (monthly_sum_pr)
> prate [mm/d] (monthly_avg_prate)
> tas [C] (monthly_avg_tas)
> tasmax [C] (monthly_avg_tasmax)
> tasmin [C] (monthly_avg_tasmin)
> wind [m/s] (monthly_avg_wind)
==================================================
diminsions: 63, 48, 1 (names: longitude,latitude,time)
resolution: 0.125, 0.125, 1 months
extent: -87.75, -79.88, 25.12, 31.12 (xmin, xmax, ymin, ymax)
crs: +proj=longlat +a=6378137 +f=0.00335281066474748 +p...
time: 1995-01-01 to 1995-01-01
==================================================
values: 18,144 (vars*X*Y*T)
str(dap, max.level = 1)
List of 6
$ pr :Formal class 'SpatRaster' [package "terra"] with 1 slot
$ prate :Formal class 'SpatRaster' [package "terra"] with 1 slot
$ tas :Formal class 'SpatRaster' [package "terra"] with 1 slot
$ tasmax:Formal class 'SpatRaster' [package "terra"] with 1 slot
$ tasmin:Formal class 'SpatRaster' [package "terra"] with 1 slot
$ wind :Formal class 'SpatRaster' [package "terra"] with 1 slot
plot(rast(dap))
###
= dap('/Users/mjohnson/mosaic_lai_glass_1.tif',
lai AOI = AOI)
{plot(lai)
plot(st_transform(AOI$geometry, crs(lai)), add = TRUE)
}
glimpse(opendap.catalog::params)
Rows: 14,691
Columns: 15
$ id <chr> "hawaii_soest_1727_02e2_b48c", "hawaii_soest_1727_02e2_b48c"…
$ grid_id <chr> "73", "73", "73", "73", "73", "73", "73", "73", "73", "73", …
$ URL <chr> "https://apdrc.soest.hawaii.edu/erddap/griddap/hawaii_soest_…
$ tiled <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", …
$ variable <chr> "nudp", "nusf", "nuvdp", "nuvsf", "nvdp", "nvsf", "sudp", "s…
$ varname <chr> "nudp", "nusf", "nuvdp", "nuvsf", "nvdp", "nvsf", "sudp", "s…
$ long_name <chr> "number of deep zonal velocity profiles", "number of surface…
$ units <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ model <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ ensemble <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ scenario <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ T_name <chr> "time", "time", "time", "time", "time", "time", "time", "tim…
$ duration <chr> "2001-01-01/2022-01-01", "2001-01-01/2022-01-01", "2001-01-0…
$ interval <chr> "365 days", "365 days", "365 days", "365 days", "365 days", …
$ nT <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, …
With 14,160 web resources documented, there are simply too many resources to search through by hand unless you know exactly what you want. This voids the possibility of serendipitous discovery. So, we have added a generally fuzzy search tool to help discover datasets.
Say you want to find what resources there are for daily rainfall?
search
and search_summary
can help:
search("daily precipitation") |>
search_summary() |>
::datatable() DT
s = search('gridmet daily precipitation')) (
# A tibble: 1 × 16
id grid_id URL tiled variable varname long_name units model ensemble
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 gridmet 176 http://… "" pr precip… pr mm <NA> <NA>
# … with 6 more variables: scenario <chr>, T_name <chr>, duration <chr>,
# interval <chr>, nT <dbl>, rank <dbl>
Now that you know the precipitation product you want, lets say you want to study the rainfall seen during Hurricane Harvey over Texas and Louisiana:
harvey = dap(catalog = s,
(AOI = aoi_get(state = c("TX", "LA")),
startDate = "2017-08-17",
endDate = "2017-09-03"))
source: http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg...
varname(s):
> precipitation_amount [mm] (pr)
==================================================
diminsions: 430, 257, 18 (names: lon,lat,day)
resolution: 0.042, 0.042, 1 days
extent: -106.66, -88.75, 25.8, 36.5 (xmin, xmax, ymin, ymax)
crs: +proj=longlat +a=6378137 +f=0.00335281066474748 +p...
time: 2017-08-17 to 2017-09-03
==================================================
values: 1,989,180 (vars*X*Y*T)
$precipitation_amount_total
class : SpatRaster
dimensions : 257, 430, 18 (nrow, ncol, nlyr)
resolution : 0.04166667, 0.04166667 (x, y)
extent : -106.6625, -88.74583, 25.79583, 36.50417 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +ellps=WGS84 +no_defs
source : memory
names : 2017-08-17, 2017-08-18, 2017-08-19, 2017-08-20, 2017-08-21, 2017-08-22, ...
min values : 0, 0, 0, 0, 0, 0, ...
max values : 106.3, 56.8, 62.9, 66.0, 55.3, 121.1, ...
unit : mm, mm, mm, mm, mm, mm, ...
plot(harvey[[1]])
# ----
= aoi_get(state = c("TX", "LA"), county = "all")
harvey_counties
{plot(sum(harvey[[1]]) * 0.001,
main = "Meters of rainfall: Hurricane Harvey")
plot(harvey_counties$geometry, add = TRUE)
}
One of the key features of dap
is the capacity to make
requests over tiled resources. Some resources (like MODIS) and tiled in
space, and some (like MACA, and LOCA) are tiled in time. Below we see an
example of how a single dap
call can consolidate a request
that covers multiple spatial tiles:
= dap(
dap catalog = search("MOD16A2.006 PET"),
AOI = AOI::aoi_get(state = "FL"),
startDate = "2010-01-01",
endDate = "2010-01-31"
)
source: https://opendap.cr.usgs.gov/opendap/hyrax/MOD16A2.006/h10v05...
tiles: 3 XY_modis tiles
varname(s):
> PET_500m [kg/m^2/8day] (MODIS Gridded 500m 8-day Composite potential Evapotranspiration (ET))
==================================================
diminsions: 1383, 1586, 5 (names: XDim,YDim,time)
resolution: 463.313, 463.313, 8 days
extent: -8417233.78, -7776472.29, 2712464.3, 3447278.27 (xmin, xmax, ymin, ymax)
crs: +proj=sinu +lon_0= +x_0= +y_0= +units=m +a=6371007...
time: 2010-01-02 to 2010-02-03
==================================================
values: 10,967,190 (vars*X*Y*T)
plot(dap)
Often getting spatial data is not the end goal and in many cases producing area based summaries is critical.
zonal
is
a package that wraps the excellent exactextractr
package with some added performance and convenience for these types of
tasks. To use zonal, you must supply a raster dataset (SpatRaster), a
geometry set to summarize over, and a summary functions (similar to
nwmTools::aggregate_*
). Lastly, you must provide the column
with a unique identifier for each polygon object, and whether the
summary data should be joined (join = TRUE
) to the
original summary data.
# remotes::install_github("NOAA-OWP/zonal")
library(zonal)
# Summary options
::ee_functions() zonal
[1] "min" "max"
[3] "count" "sum"
[5] "mean" "median"
[7] "quantile" "mode"
[9] "majority" "minority"
[11] "variety" "variance"
[13] "stdev" "coefficient_of_variation"
[15] "weighted_mean" "weighted_sum"
[17] "frac" "weighted_frac"
::weight_functions() zonal
collapse base
1 fmean mean
2 fmedian median
3 fmode mode
4 fsum sum
5 fprod prod
6 fsd sd
7 fvar var
8 fmin min
9 fmax max
10 fnth nth
11 ffirst first
12 flast last
13 fnobs nobs
14 fndistinct ndistinct
For big data processing zonal
offers a unique ability to
produce and work with weight grids. However this is out of scope for
this workshop:
Returning to the Hurricane Harvey example, imagine you wanted the county wide total rainfall for each day:
library(zonal)
system.time({
= execute_zonal(harvey,
summary_data geom = harvey_counties,
fun = "sum",
ID = "fip_code",
join = TRUE)
})
user system elapsed
0.462 0.072 0.567
plot(summary_data[grepl("sum", names(summary_data))], border = NA,
max.plot = 16)
Often the needed data is not delivered by a service or function. This does not mean we are out of luck, but rather that we have to do a little of the “heavy lifting” ourselves. This final section highlights how to read tabular, vector and raster data from URL using the http/https and s3 protocols.
These approaches are advantageous when you don’t want to download data locally for analysis.
Many CSV readers in R have the ability to read from a URL. If your desired data is stored in CSV you have a number of options:
base::read.csv
readr::read_csv
data.table::fread
One example dataset in which there is not web service client is the USACE National Inventory of Dams (NID) data. This data is stored as a CSV file.
library(readr)
= read_csv('https://nid.usace.army.mil/api/nation/csv',
dams skip = 1)
glimpse(dams)
Rows: 92,029
Columns: 74
$ `Dam Name` <chr> "Aichi Forge Usa Dam", "Dod …
$ `Other Names` <chr> "Aichi Forge Usa Lake", NA, …
$ `Former Names` <chr> NA, "Dod Usa", NA, "Dod Usa"…
$ `NID ID` <chr> "KY00728", "OK20994", "ND011…
$ `Other Structure ID` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal ID` <chr> "KY00728", "OK20994", "ND011…
$ `Owner Names` <chr> "AICHI", "DEPT. OF DEFENSE/U…
$ `Owner Types` <chr> "Private", "Federal", "Priva…
$ `Primary Owner Type` <chr> "Private", "Federal", "Priva…
$ `State or Federal Agency ID` <chr> "KY00728", NA, NA, NA, NA, N…
$ `Number of Associated Structures` <dbl> 0, 0, NA, 0, 0, 0, 0, 0, 0, …
$ `Designer Names` <chr> "AUSTIN ENGINEERS, INC. SOUT…
$ `Non-Federal Dam on Federal Property` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `State Regulated Dam` <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Jurisdictional Dam` <chr> "Yes", "Yes", "Yes", "Yes", …
$ `State Regulatory Agency` <chr> "KY Division of Water", NA, …
$ `State Permitting Authority` <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Inspection Authority` <chr> "Yes", "No", "Yes", "No", "Y…
$ `State Enforcement Authority` <chr> "Yes", "No", "Yes", "No", "Y…
$ `Source Agency` <chr> "Kentucky", "Oklahoma", "Nor…
$ Latitude <dbl> 38.28828, 35.70341, 47.15639…
$ Longitude <dbl> -84.55619, -95.13638, -102.7…
$ County <chr> "Scott", "Muskogee", "Dunn",…
$ State <chr> "Kentucky", "Oklahoma", "Nor…
$ City <chr> "DELAPLAIN-AREA", "ARROWHEAD…
$ `Distance to Nearest City (Miles)` <dbl> 1, 6, NA, 1, 4, 4, 4, NA, 0,…
$ `River or Stream Name` <chr> "TR-DRY RUN", "TR-WEST SPANI…
$ `Congressional District` <chr> NA, NA, "Congressional Distr…
$ `Section, Township, or Range Location` <chr> NA, "12", NA, "21", "13;1N;5…
$ `Federal Agency Owners` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Funding` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Design` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Construction` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Regulatory` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Inspection` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Operation` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Federal Agency Involvement Other` <chr> NA, NA, NA, NA, NA, NA, NA, …
$ `Primary Purpose` <chr> "Other", "Recreation", "Othe…
$ Purposes <chr> "Other", "Recreation", "Othe…
$ `Primary Dam Type` <chr> "Earth", "Earth", NA, "Earth…
$ `Dam Types` <chr> "Earth", "Earth", NA, "Earth…
$ `Core Types` <chr> NA, "Earth", NA, "Earth", "U…
$ Foundation <chr> NA, "Soil", NA, "Soil", "Unl…
$ `Dam Height (Ft)` <dbl> 15.0, 18.0, 29.0, 15.0, 18.0…
$ `Hydraulic Height (Ft)` <dbl> NA, 10.0, NA, 12.0, NA, NA, …
$ `Structural Height (Ft)` <dbl> NA, 18.0, NA, 15.0, NA, NA, …
$ `NID Height (Ft)` <dbl> 15.0, 18.0, 29.0, 15.0, 18.0…
$ `NID Height Category` <chr> "Less than 25 feet", "Less t…
$ `Dam Length (Ft)` <dbl> 325, 340, 1740, 240, 700, 80…
$ `Volume (Cubic Yards)` <dbl> NA, 0, NA, 0, 0, 0, 0, NA, 0…
$ `Year Completed` <dbl> 1974, 1939, NA, 1940, 1975, …
$ `Year Completed Category` <chr> "1970-1979", "1930-1939", "U…
$ `NID Storage (Acre-Ft)` <dbl> 86.8, 50.0, 682.6, 50.0, 394…
$ `Max Storage (Acre-Ft)` <dbl> 86.8, 50.0, 682.6, 50.0, 394…
$ `Normal Storage (Acre-Ft)` <dbl> 26.6, 25.0, NA, 25.0, 180.0,…
$ `Surface Area (Acres)` <dbl> 8.00, 5.00, 37.50, NA, 31.80…
$ `Drainage Area (Sq Miles)` <dbl> 1.35, NA, NA, NA, 0.14, 1.23…
$ `Max Discharge (Cubic Ft/Second)` <dbl> NA, 200, NA, 20, 2930, 230, …
$ `Spillway Type` <chr> "Uncontrolled", "Controlled"…
$ `Spillway Width (Ft)` <dbl> NA, 1, NA, 1, 0, 0, 2, NA, 1…
$ `Number of Locks` <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Length of Locks` <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Lock Width (Ft)` <dbl> NA, 0, NA, 0, 0, 0, 0, 0, 0,…
$ `Years Modified` <chr> NA, NA, NA, NA, NA, NA, "199…
$ `Outlet Gate Type` <chr> NA, "Valve1", NA, "Valve1", …
$ `Data Last Updated` <date> 2021-04-09, 2018-06-22, 202…
$ `Last Inspection Date` <date> 2013-04-30, 1998-10-01, NA,…
$ `Inspection Frequency` <dbl> 5, 5, NA, 5, 5, 5, 1, 2, 5, …
$ `Hazard Potential Classification` <chr> "Low", "Low", "Low", "Low", …
$ `Condition Assessment` <chr> "Not Rated", "Not Rated", "N…
$ `Condition Assessment Date` <date> NA, NA, NA, NA, NA, NA, NA,…
$ `EAP Prepared` <chr> "Not Required", "Not Require…
$ `EAP Last Revision Date` <date> NA, NA, NA, NA, NA, NA, 201…
$ `Website URL` <chr> "https://damsafety.org/kentu…
GDAL provides a number of Virtual File System drivers to read compressed and network data.
Three of the most useful for common spatial data tasks include:
vsizip –> file handler that allows reading ZIP archives on-the-fly without decompressing them beforehand.
vsicurl –> A generic file system handler exists for online resources that do not require particular signed authentication schemes
vsis3 –> specialized into sub-filesystems for commercial cloud storage services (e.g. AWS, also see /vsigs/, /vsiaz/, /vsioss/ or /vsiswift/).
Duke store VRT
files for the POLARIS soils dataset. terra
can generate
pointers to these datasets however data is not extracted until an
operation is called on the data. As seen above dap
provides
simplified access to extract subsets of data for OpenDap and VRT
endpoints.
rast('/vsicurl/http://hydrology.cee.duke.edu/POLARIS/PROPERTIES/v1.0/vrt/alpha_mean_0_5.vrt')
class : SpatRaster
dimensions : 97200, 212400, 1 (nrow, ncol, nlyr)
resolution : 0.0002777778, 0.0002777778 (x, y)
extent : -125, -66, 23, 50 (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (EPSG:4326)
source : alpha_mean_0_5.vrt
name : alpha_mean_0_5
dap('/vsicurl/http://hydrology.cee.duke.edu/POLARIS/PROPERTIES/v1.0/vrt/alpha_mean_0_5.vrt', AOI = AOI::aoi_get(state = "CO", county = "Larimer")) |>
plot()
Here is an example where a global elevation dataset can quickly be used in conjunction with the NLDI client to plot the elevation transect of a the main river channel draining to the USGS office in Fort Collins, CO!
= findNLDI(loc = geocode("2150 Centre Ave, Fort Collins, CO 80526", pt = TRUE),
d nav = "UM",
find = c("flowlines", "basin"))
mapview(d)
= dap("/vsicurl/https://opentopography.s3.sdsc.edu/raster/COP30/COP30_hh.vrt",
COP30 AOI = d$basin)
= extract(COP30, vect(st_union(d$UM_flowlines))) |>
transect mutate(location = c(n():1))
ggplot(transect, aes(x = location, y = COP30_hh)) +
geom_line() +
geom_smooth()
Since the base data reader of sf is GDAL the vsi capabilities also apply to vector data! The added challenge typically is that vector data (especially shapefile data) is generally stored in zip files. This adds an extra layer of complexity when remotely reading these datasets.
In this example we look at the US Census Bureaus FTP server for TIGER
road lines. The data is stored by by 5 digit (county) FIP code in zip
files. To access this data we need to identify the FIP code of interest
and then chain a vsizip
with a vsicurl
call.
= aoi_get(state = "AL", county = "Tuscaloosa")
AOI = paste0('tl_2021_',AOI$fip_code, '_roads')
file = paste0('/vsizip/{/vsicurl/https://www2.census.gov/geo/tiger/TIGER2021/ROADS/', file, '.zip}/',file,'.shp')
url
system.time({
= read_sf(url)
roads })
user system elapsed
0.201 0.040 4.975
mapview(roads)