
Welcome + Your Digital Environment
This is a class about water.
It is also a class about data.
Because in 2026, you cannot do one without the other.
Every flood forecast, drought assessment, water quality model, and basin-scale analysis that matters is built on someone’s ability to find, wrangle, analyze, and communicate data.
That person should be you. This class is about making sure it is.

Most environmental scientists have one or two of these. Few have all three.
Combining writing, data analysis, and domain expertise creates a rare and flexible professional — one who can move between science, policy, and practice.
AI can generate code. It cannot ask the right scientific question, evaluate whether the answer makes physical sense, or stand in front of a stakeholder and defend it. That combination is what we’re building.
“If not you, then who?”


The work we do in this course is in many ways the same work being done at the federal/state level right now. You’re learning a living skillset.
Course site:
https://mikejohnson51.github.io/csu-ess-523c/
| Week | Topic |
|---|---|
| Week 1 | Data Science Tools & Digital Environment |
| Week 2–3 | Vector Data |
| Week 4 | Raster Data |
| Week 5–6 | Machine Learning |
| Week 7 | Time Series |
Setup 50 points - Due Wednesday — ensures your environment is ready for the course
Labs (6 × 150 pts = 900 pts)
Total: 950 pts (1100 possible with EC)
Grade Scale:
| Grade | Range |
|---|---|
| A+ | ≥ 96.67% |
| A | 93.33–96.67% |
| A– | 90.0–93.33% |
| B+ | 86.67–90.0% |
| B | 83.33–86.67% |
| B– | 80.0–83.33% |
| C+ | 76.67–80.0% |
| C | 70.0–76.67% |
| D | 60.0–70.0% |
| F | < 60% |
Tip
Collaboration — encouraged
Discuss problems, share approaches, and help each other debug — you learn more working with peers.
Important
Individual work — required
Your code, writing, and results must be submitted individually.
Note
AI tools (ChatGPT, Claude, Copilot)


Working Definition: Thinking about data from our world to better understand and make choices about the past, present, and future.
The rest of today is about building the foundation for that focusing on how computers store, find, and interpret information — and why that matters for doing science well.

Every piece of work in this course — and in your career — follows this chain:
Raw Data ← files on disk, often from URLs or APIs
↓ R Scripts ← read, clean, transform
Processed Data ← files on disk
↓ R Scripts ← model, summarize, visualize
Outputs ← figures, tables, reports — files on disk
A broken link anywhere in this chain means the analysis cannot be reproduced.
A file you can’t find, a path that only works on your machine, a format no one else can open — these are not minor inconveniences. They are scientific failures.
“An analysis is only as reproducible as the data you can find next year.”
Three subsystems you need to understand:
Why this matters for you:
When you load a 4 GB raster into R, it moves from disk into RAM. If your RAM is full, R crashes or slows to a crawl. Knowing where the bottleneck is (I/O? memory? compute?) is how you diagnose and fix slow analyses.

Data flows: disk → memory → CPU → memory → disk
Every read_csv(), st_read(), and rast() call you make starts this cycle. Every write_csv() and writeRaster() ends it.
0 or 1Why it matters in practice:
| Dataset | Size estimate |
|---|---|
| USGS daily flow record, 1 gauge, 50 years | ~150 KB |
| NHDPlus HR flowlines, CONUS | ~12 GB |
| NWM retrospective, 1 variable, 1 year | ~200 GB |
| 3DEP 1m DEM, single HUC4 | ~8–40 GB |

These numbers stop being surprising once you know the unit chain: KB → MB → GB → TB (each ×1,024).
Files save bytes on disk in a structured, meaningful way
Every file has three key properties:
Hard drives don’t understand files — they store bytes. The file system is the organizational layer that makes bytes into named, navigable objects.
How operating systems differ:
C:\), backslash separators/Key vocabulary:
| Term | Meaning |
|---|---|
| Root | Top-level directory — contains everything |
| Working directory | Where your session is (getwd()) |
| Parent directory | One level up (..) |
| Subdirectory | A folder inside the working directory |

Tip
“There are only two hard things in Computer Science: cache invalidation and naming things.” ~Phil Karlton
File names are how you — and your collaborators — navigate a project. There is no enforced standard. The decisions are always yours.
Bad names compound silently. One bad name is annoying. A project full of them is a crisis.
YYYY-MM-DD — they sort correctly as strings_ separates metadata fields- separates words within a fieldThis makes files easy to …
#> [1] "1903-07-01_08033500_00060_tyler_tx.txt"
#> [2] "1923-10-01_08033000_00060_angelina_tx.txt"
Metadata is recoverable without opening a single file:
#> StartDate siteID parameterCode county state extension
#> 1 1903-07-01 08033500 00060 tyler tx txt
#> 2 1923-10-01 08033000 00060 angelina tx txt
#> 3 1923-10-01 08180500 00060 medina tx txt
#> 4 1923-12-01 08082500 00060 baylor tx txt
#> 5 1924-08-01 08062500 00060 ellis tx txt
Human readable — the name communicates content and purpose. Here, we see the order that the files run (utilities, download, clean, analyze, figures), the project they belong to (src), are archived in the file names.
#> Order Project Purpose extension
#> 1 00 src utils R
#> 2 01 src data-download R
#> 3 02 src data-clean R
#> 4 03 src analysis R
#> 5 04 src figures R
Sortable — leading zeros and ISO dates keep files in logical order:
✅ 01_src_download.R 2024-01-15_sw-discharge_co.csv
02_src_clean.R 2024-02-03_sw-discharge_co.csv
03_src_analyze.R 2024-11-20_sw-discharge_co.csv
❌ 1_download.R 01-15-2024_data.csv
10_figures.R 11-20-2024_data.csv
2_clean.R 02-03-2024_data.csv
File paths tell us the location of a file within the file system
Directories are stored as hierarchies, again with root (home) directory being the one holding everything on a system
The folder you are in, is called your working directory. (think pwd)
The folder above the working directory is the parent directory
All folders within the working directory are sub folders or child folder

Absolute — starts from root, always resolves, but only on your machine:
Relative — starts from the working directory, works anywhere the project is opened:
. and .. shorthand:
. = this directory
.. = the parent directory
../data/raw/flow.csv → up one level, then into data/raw/
Important
Absolute paths are the #1 reproducibility killer in shared projects. If your code contains /Users/yourname/, it will break the moment anyone else runs it — including future you on a new machine.
hereTip
RStudio Projects (.Rproj) automatically set the working directory to the project root when opened — making relative paths reliable by default.
Best practices:
here::here() for paths inside packages or Quarto documents — it anchors to the project root regardless of where the .qmd livesTip
Working with absolute paths can be a pain compared to relative paths…
It is a good practice to keep all the files associated with a project — input data, R scripts, analytic results, figures - together.
This is such a common practice that RStudio has built-in support for this via projects.
A good project layout will ultimately make your life easier:
All files store bits.
Extensions can be considered a type of metadata that provides information about the way data might be stored
There are 1000’s of different formats for data ranging from common to custom
Each format defines how the sequence of bits and bytes are laid out
Indicate the characteristics of the file, its intended use, and the default applications that can open/use the file.
If you double click a .docx file it opens in Word which interprets the meaning of the bytes
If you double click an .R file it opens with RStudio, and R interprets the meaning of the bytes

#> [1] "CSU 523C"
#> [1] 43 53 55 20 35 32 33 43
#> [1] 01 01 00 00 00 00 01 00 01 01 00 00 01 00 01 00 01 00 01 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 00 01 00 01 01 00 00 00 01 00 00 01 01 00 00 01 01
#> [51] 00 00 01 01 00 00 01 01 00 00 00 00 01 00
#> [1] "CSU 523C"
#> [1] 43 53 55 20 35 32 33 43
#> [1] 01 01 00 00 00 00 01 00 01 01 00 00 01 00 01 00 01 00 01 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 00 01 00 01 01 00 00 00 01 00 00 01 01 00 00 01 01
#> [51] 00 00 01 01 00 00 01 01 00 00 00 00 01 00
#> [1] 64
#> [1] "CSU 523C"
#> [1] 43 53 55 20 35 32 33 43
#> [1] 01 01 00 00 00 00 01 00 01 01 00 00 01 00 01 00 01 00 01 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 00 01 00 01 01 00 00 00 01 00 00 01 01 00 00 01 01
#> [51] 00 00 01 01 00 00 01 01 00 00 00 00 01 00
#> [1] 64
#> [1] 8
#> [1] "CSU 523C"
#> [1] 43 53 55 20 35 32 33 43
#> [1] 01 01 00 00 00 00 01 00 01 01 00 00 01 00 01 00 01 00 01 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 00 01 00 01 01 00 00 00 01 00 00 01 01 00 00 01 01
#> [51] 00 00 01 01 00 00 01 01 00 00 00 00 01 00
#> [1] 64
#> [1] 8
#> [1] TRUE
| Format | Type | Common Use |
|---|---|---|
.csv |
Text | Tabular field data, gauge records, water quality |
.json / .geojson |
Structured | Web APIs, geographic feature exchange |
.tif / .geotiff |
Binary | Satellite imagery, DEMs, classified rasters |
.nc (NetCDF) |
Binary | NWM output, climate models, multi-dimensional arrays |
.gpkg (GeoPackage) |
Structured | Vector + raster, open standard, replaces Shapefile |
.shp (Shapefile) |
Binary | Legacy vector data — still everywhere, many limitations |
.parquet |
Binary | Large tabular/spatial data, fast I/O, cloud-native |
| Part | Example | Analogy |
|---|---|---|
| Protocol | https://, s3:// |
How to travel |
| Domain | waterservices.usgs.gov |
The server (building) |
| Path | /nwis/iv/ |
Directory (floor/room) |
| File | flow_2024.csv |
The file |
| Parameters | ?sites=06752260¶meterCd=00060 |
Filters on the request |
The file system logic you just learned applies directly to URLs:
https://mikejohnson51.github.io/csu-ess-330/slides/1-welcome.html#/title-slide
↑ server (domain) ↑ directory ↑ file ↑ anchor
s3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr/chrtout.zarr
↑ bucket ↑ directory ↑ file
And URLs are your data pipeline:
# USGS streamflow — the path IS the query
url <- "https://waterservices.usgs.gov/nwis/iv/?sites=06752260¶meterCd=00060&format=json"
data <- jsonlite::fromJSON(url)
dplyr::glimpse(data)(data)
# NWM retrospective on S3 — same concept, different protocol
s3_path <- "s3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr/chrtout.zarr"Query parameters (?sites=...¶meterCd=...) filter data at the server before it reaches your machine. Understanding URL structure means understanding how to ask for exactly what you need.
Every analysis: question → data (files + paths + formats) → compute → output (files + paths + formats)
The concepts from today — bytes, files, names, paths, extensions, URLs — are the substrate every analysis runs on. They feel like setup. They are actually the foundation.
Return in 5 minutes.
R — the language and engine - Does the actual computation - Runs without RStudio - Must be installed first
RStudio — the IDE (development environment) - The interface you’ll look at all day - Organizes files, environment, plots, packages, terminal - Makes R usable — but it is not R
Analogy: R is the engine. RStudio is the dashboard and steering wheel. You need both, but they are not the same thing.

| Pane | Location | Purpose |
|---|---|---|
| Source | Top left | Write and save scripts — your permanent work |
| Console / Terminal | Top right | Run R interactively; access the shell |
| History / Packages / Git | Bottom left | Command history, package manager, version control |
| Environment / Files / Plots / Help | Bottom right | Objects in memory, file browser, output viewer, documentation |
Important
The console does not save your work. Anything you run there and don’t put in a script is gone when you close RStudio. Build the habit early: if it matters, it goes in a script.
Core packages for this course:
| Package | Purpose |
|---|---|
tidyverse |
Data manipulation (dplyr, tidyr) and visualization (ggplot2) |
sf |
Vector spatial data — points, lines, polygons |
terra |
Raster spatial data — grids, DEMs, satellite imagery |
tidymodels |
Machine learning with tidy principles |
analysis_final.R
analysis_final_v2.R
analysis_final_ACTUAL_FINAL.R
analysis_mikereview_jan12.R
analysis_mikereview_jan12_USETHISONE.R
This is version control, just done badly.
It doesn’t tell you what changed, why it changed, or which version to trust. It breaks completely the moment two people work on the same file.
Git provides a solution.
It tracks every change to every file — with a timestamp, an author, and a message — and lets you revert to any previous state. Not just for code: data, reports, quarto documents, everything.
Git runs on your machine — the version control engine
GitHub hosts your repositories in the cloud — collaboration, backup, and portfolio in one place
Together:
All labs in this course are submitted via GitHub. We set it up together in this week’s homework — by next class, this will already be part of your workflow.

The terminal is a text interface to your file system. It feels archaic. It is indispensable.
Git, package installation, server connections, working in the cloud — all of it eventually touches the terminal. The sooner it feels normal, the better.
| Command | Does |
|---|---|
pwd |
Where am I? |
ls |
What’s here? |
cd folder |
Move into folder |
cd .. |
Move up one level |
mkdir name |
Create a directory name |
cp a b |
Copy a to b |
mv a b |
Move / rename a to b |
In RStudio: Terminal tab, next to Console — use it there until the standalone terminal feels comfortable.

Lab 00: Verify Your Environment + Meet Your Tools
Confirm your R, RStudio, and Git setup — install course packages, configure Git, connect to GitHub. Come to the next class ready to code!
