Lecture 02

The Digital Environment

Mike Johnson

mike.johnson@colostate.edu

2025-02-09

Slido!

Quantitative Reasoning

Working Definition: Thinking about *data* from our world, to better understand and make choices about the past, present and future.
Many people can work with data (e.g., data scientists), many people have domain knowledge (e.g., hydrologists), but few can do both in a careful way. That’s the goal of this class!
Almost all data is digital.
Working with computers is essential for quantitative analysis (precursor to reasoning).
Today is all about understanding how we interface with computers, their strengths, and limitations.

Your Environment

Computer: An electronic device for storing and processing data, typically in binary form, according to instruction. Computers store persistent data on disk, ephemeral data in memory, and executes processes with the CPU (Central Processing Unit).

File: A block of arbitrary information available to a computer program.

Compute

Bytes

Bytes vs Bits

A byte is a group of 8 binary digits (bits) and a unit of memory size.
The bit is a basic unit of information in computing and represents a logical state with two possible values (0 or 1).

Files

File Storage

Files are stored as a collection of bytes on a hard drive
Hard drives do not understand files - they just store bytes and directions to those bytes
We need ways to retrieve (and write) bytes from (and to) the hard drive.

Interpreting Capacity

New computer capacity

Storage Pattern

File Storage

Defragmentation

What is a file?

Files save data on the hard drive as bytes in a meaningful way.
Files have three key properties:
- a name (machine and human interpret-able address)
- a path (a location in the file system)
- an extension (how to/what program reads the format)

Filesystem

Tip

For every file, there are paths and directories that lead to that specific file. These paths and directories are called File Systems.

Filesystem: describes the methods an operating system uses to organize files
- If you are on a Windows device , and you want to find a file you just downloaded, you go to “This PC,” from where you click on “Documents” and there you find yet another folder called “Downloads” that has the downloaded files stored.
- If you are on a Mac , to get the same result is click on “Downloads.” Whether you do it from the menu bar or anywhere else, it’s going to get you to exactly where you need to go.

Directories

Directory: is a location for storing, organizing, and separating files and other directories on a computer. Think of folders!
Root directory: the “highest” or top-level directory in the hierarchy.
- The root directory contains all other folders/files in the drive or folder
- Sometimes referred to as the home directory

File Names

Tip

“There are only two hard things in Computer Science: cache invalidation and naming things.” ~Phil Karlton

File names/paths are how we locate and identify information stored on a machine.
Names are always up to us as users!

Three Principles for file names

machine readable
human readable
sortable

machine readable

avoid spaces, punctuation, accented characters, and mixed cases
regular expression friendly (e.g. use patterns!)
use ISO 8601 dates (YYYY-MM-DD)
be consistent with delimiters (easy to compute on)
Use “_” (underscores) to separate “metadata” you want latter
Use “-” (hyphens) to separate words for readability (like dates or names)

Doing so makes files …

easy to search for & narrow

# Only those files with pattern "_tx"
files = list.files("data/usgs-files", pattern = "_tx") 
length(files) # total number of Texas files
#> [1] 27

#> [1] "1903-07-01_08033500_00060_tyler_tx.txt"   
#> [2] "1923-10-01_08033000_00060_angelina_tx.txt"

and metadata easy to recover (easy to compute on)

stringr::str_split(files, "[_\\.]", simplify = TRUE)

#>    StartDate   siteID parameterCode   county state extension
#> 1 1903-07-01 08033500         00060    tyler    tx       txt
#> 2 1923-10-01 08033000         00060 angelina    tx       txt
#> 3 1923-10-01 08180500         00060   medina    tx       txt
#> 4 1923-12-01 08082500         00060   baylor    tx       txt
#> 5 1924-08-01 08062500         00060    ellis    tx       txt

human readable

File names contain information about the content and purpose of the file
easy to find the right file a year from now

list.files("R")
#> [1] "00_src_utils.R"         "01_src_data-download.R" "02_src_data-clean.R"   
#> [4] "03_src_analysis.R"      "04_src_figures.R"

Here, we see the order that the files run (utilities, download, clean, analyze, figures), the project they belong to (src), are archived in the file names.

#>   Order Project       Purpose extension
#> 1    00     src         utils         R
#> 2    01     src data-download         R
#> 3    02     src    data-clean         R
#> 4    03     src      analysis         R
#> 5    04     src       figures         R

Sort easily

put numeric values first (use leading 0 for 1-9)
use ISO 8601 dates (YYYY-MM-DD)

File Paths

File paths tell us the location of a file within the file system
Directories are stored as hierarchies, again with root (home) directory being the one holding everything on a system
The folder you are in, is called your working directory. (think pwd)
The folder above the working directory is the parent directory
All folders within the working directory are sub folders or child folder

Declaring File Paths

Files are located by their path. Think of this as the directions - from the root directory.
Directories are separated with backslashes (“\") on windows, and forward slashes (”/“) or MacOS and Linux machines.

Example:

/Users/mikejohnson/github/csu-ess-330/slides/images/03-bit-byte.png

Root: /Users/mikejohnson

Sys.getenv("LOGNAME")
#> [1] "mikejohnson"

git-enabled projects: github
project: csu-ess-330
sub directory: slides
sub directory: images
file: 03-bit-byte
ext: .png

Absolute vs Realative Paths

There are two ways to specify a file path.

An absolute path always begins in the root folder

img <- png::readPNG('/Users/mikejohnson/github/csu-ess-330/slides/images/03-bit-byte.png')

A relative path is relative to the current working directory

getwd()
#> [1] "/Users/mikejohnson/github/csu-ess-330/slides"
img <- png::readPNG('images/03-bit-byte.png')

.. and . notation

The dot (.) and dot-dot (..) notation to help us write shorter paths
A single ‘.’ denotes “this directory”.
Two periods (“..”) means “the parent directory”

Enforcing Relative Paths

Tip

Working with absolute paths can be a pain compared to relative paths…

It is a good practice to keep all the files associated with a project — input data, R scripts, analytic results, figures - together.
This is such a common practice that RStudio has built-in support for this via projects.
A good project layout will ultimately make your life easier:
- It will help ensure the integrity of your data;
- It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
- It allows you to easily upload your code with your manuscript submission;
- It makes it easier to pick the project back up after a break.

File Extensions

All files store bits.
Extensions can be considered a type of metadata that provides information about the way data might be stored
There are 1000’s of different formats for data ranging from common to custom
Each format defines how the sequence of bits and bytes are laid out
Indicate the characteristics of the file, its intended use, and the default applications that can open/use the file.
If you double click a .docx file it opens in Word which interprets the meaning of the bytes
If you double click an .R file it opens with RStudio, and R interprets the meaning of the bytes

Extension Interpretation

Readers depended on anticipated structure

img <- jpeg::readJPEG('images/03-bit-byte.png')
#> Error in jpeg::readJPEG("images/03-bit-byte.png"): JPEG decompression error: Not a JPEG file: starts with 0x89 0x50

The file is actually a PNG with the wrong file extension. “0x89 0x50” is how a PNG file starts.

img <- png::readPNG('images/03-bit-byte.png')

The data returned to R is a structured set of bits, interpreted according to the directions of the file and the interpreting language!

dim(img)
#> [1] 394 768   3
class(img)
#> [1] "array"
str(img)
#>  num [1:394, 1:768, 1:3] 1 1 1 1 1 1 1 1 1 1 ...

plot(NA, xlim = c(0, 2), ylim = c(0, 1))
rasterImage(img, 0, 0, 2, 1)

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

#> [1] "ESS 330"

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

# Character Object to Raw Type
# How R sees the data
(y <- charToRaw(x))

#> [1] "ESS 330"
#> [1] 45 53 53 20 33 33 30

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

# Character Object to Raw Type
# How R sees the data
(y <- charToRaw(x))

# Raw to Bits
# Whats on disk
(z <- rawToBits(y))

#> [1] "ESS 330"
#> [1] 45 53 53 20 33 33 30
#>  [1] 01 00 01 00 00 00 01 00 01 01 00 00 01 00 01 00 01 01 00 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 00 00
#> [51] 00 00 01 01 00 00

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

# Character Object to Raw Type
# How R sees the data
(y <- charToRaw(x))

# Raw to Bits
# Whats on disk
(z <- rawToBits(y))

length(z)

#> [1] "ESS 330"
#> [1] 45 53 53 20 33 33 30
#>  [1] 01 00 01 00 00 00 01 00 01 01 00 00 01 00 01 00 01 01 00 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 00 00
#> [51] 00 00 01 01 00 00
#> [1] 56

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

# Character Object to Raw Type
# How R sees the data
(y <- charToRaw(x))

# Raw to Bits
# Whats on disk
(z <- rawToBits(y))

length(z)

nchar(x)

#> [1] "ESS 330"
#> [1] 45 53 53 20 33 33 30
#>  [1] 01 00 01 00 00 00 01 00 01 01 00 00 01 00 01 00 01 01 00 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 00 00
#> [51] 00 00 01 01 00 00
#> [1] 56
#> [1] 7

Bytes Being Interpreted

# Character Object
(x <- "ESS 330")

# Character Object to Raw Type
# How R sees the data
(y <- charToRaw(x))

# Raw to Bits
# Whats on disk
(z <- rawToBits(y))

length(z)

nchar(x)

nchar(x) == (length(z)/8)

#> [1] "ESS 330"
#> [1] 45 53 53 20 33 33 30
#>  [1] 01 00 01 00 00 00 01 00 01 01 00 00 01 00 01 00 01 01 00 00 01 00 01 00 00
#> [26] 00 00 00 00 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 01 01 00 00 00 00
#> [51] 00 00 01 01 00 00
#> [1] 56
#> [1] 7
#> [1] TRUE

Example

Overview of File Types

Text Files: Human-readable, easy to edit (e.g., .txt, .csv).
Binary Files: Optimized for performance, not human-readable (e.g., .nc, .tif).
Structured Files: Contain unified metadata and data (e.g., .json, .xlsx).

Practical Contexts for Files in Ecosystem Science

CSV: Used for tabular field survey data (e.g., species counts, water quality measurements).
GeoTIFF: Stores spatial data like satellite imagery or digital elevation models (DEMs).
NetCDF: Common for data with space and time data (multi-dimensional arrays) like climate models, hydrological simulations, and multidimensional data.
JSON/GeoJSON: Facilitates sharing geographic features in web applications or APIs.
GPKG (GeoPackage): An open, standards-based format for spatial data that supports vector and raster data.
SHP (Shapefile): Widely used for vector geographic data, though limited in attribute sizes and modern functionality.