class: center, middle, inverse, title-slide # Geography 13 ## Lecture 03: The Digital Environment ### Mike Johnson --- class: center, middle <img src="lec-img/03-R-situated.png" width="75%"> --- class: middle, center ## Today -- Files -- Naming Things -- Data Types --- class: center, middle ## Managing your projects in a reproducible way doesn't just make your science better, but it makes your life easier --- # Your Computer **Computer**: an electronic device for _storing_ and _processing_ data, typically in binary form, according to instructions <center> <i class="fas fa-laptop-code fa-5x"></i> </center> -- **File**: a block of arbitrary information available to a computer program <center> <i class="fas fa-file fa-5x"></i> </center> --- # File Storage - Files are stored as a collection of bytes on a hard drive -- - Hard drives do not understand files - they just store bytes -- - We need ways to retrieve (and write) **bytes** from (to) the hard drive. --- # Bytes? But what is a byte? - a group of binary digits or bits (usually eight) operated on as a unit. - a byte considered as a unit of memory size. - The bit is a basic unit of information in computing and represents a logical state with two possible values (0 or 1). <img src="lec-img/03-bit-byte.png"> --- class: center, middle # [Example](https://www.rapidtables.com/convert/number/ascii-hex-bin-dec-converter.html) --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r *(x <- "GIS is Great!!!") ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ] --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r (x <- "GIS is Great!!!") *(y <- charToRaw(x)) ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ``` [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21 ``` ] --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r (x <- "GIS is Great!!!") (y <- charToRaw(x)) *(z <- rawToBits(y)) ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ``` [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21 ``` ``` [1] 01 01 01 00 00 00 01 00 01 00 00 01 00 00 01 00 01 01 00 00 01 00 01 00 00 00 00 00 00 01 00 00 01 00 00 01 00 01 01 00 01 01 00 00 01 01 01 00 00 00 00 00 00 [54] 01 00 00 01 01 01 00 00 00 01 00 00 01 00 00 01 01 01 00 01 00 01 00 00 01 01 00 01 00 00 00 00 01 01 00 00 00 01 00 01 01 01 00 01 00 00 00 00 01 00 00 01 00 [107] 00 00 00 01 00 00 01 00 00 00 00 01 00 00 ``` ] --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r (x <- "GIS is Great!!!") (y <- charToRaw(x)) (z <- rawToBits(y)) *length(z) ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ``` [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21 ``` ``` [1] 01 01 01 00 00 00 01 00 01 00 00 01 00 00 01 00 01 01 00 00 01 00 01 00 00 00 00 00 00 01 00 00 01 00 00 01 00 01 01 00 01 01 00 00 01 01 01 00 00 00 00 00 00 [54] 01 00 00 01 01 01 00 00 00 01 00 00 01 00 00 01 01 01 00 01 00 01 00 00 01 01 00 01 00 00 00 00 01 01 00 00 00 01 00 01 01 01 00 01 00 00 00 00 01 00 00 01 00 [107] 00 00 00 01 00 00 01 00 00 00 00 01 00 00 ``` ``` [1] 120 ``` ] --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r (x <- "GIS is Great!!!") (y <- charToRaw(x)) (z <- rawToBits(y)) length(z) *nchar(x) ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ``` [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21 ``` ``` [1] 01 01 01 00 00 00 01 00 01 00 00 01 00 00 01 00 01 01 00 00 01 00 01 00 00 00 00 00 00 01 00 00 01 00 00 01 00 01 01 00 01 01 00 00 01 01 01 00 00 00 00 00 00 [54] 01 00 00 01 01 01 00 00 00 01 00 00 01 00 00 01 01 01 00 01 00 01 00 00 01 01 00 01 00 00 00 00 01 01 00 00 00 01 00 01 01 01 00 01 00 00 00 00 01 00 00 01 00 [107] 00 00 00 01 00 00 01 00 00 00 00 01 00 00 ``` ``` [1] 120 ``` ``` [1] 15 ``` ] --- count: false #Bytes - Raw - Human Readable... .panel1-bytes-auto[ ```r (x <- "GIS is Great!!!") (y <- charToRaw(x)) (z <- rawToBits(y)) length(z) nchar(x) *nchar(x) == (length(z)/8) ``` ] .panel2-bytes-auto[ ``` [1] "GIS is Great!!!" ``` ``` [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21 ``` ``` [1] 01 01 01 00 00 00 01 00 01 00 00 01 00 00 01 00 01 01 00 00 01 00 01 00 00 00 00 00 00 01 00 00 01 00 00 01 00 01 01 00 01 01 00 00 01 01 01 00 00 00 00 00 00 [54] 01 00 00 01 01 01 00 00 00 01 00 00 01 00 00 01 01 01 00 01 00 01 00 00 01 01 00 01 00 00 00 00 01 01 00 00 00 01 00 01 01 01 00 01 00 00 00 00 01 00 00 01 00 [107] 00 00 00 01 00 00 01 00 00 00 00 01 00 00 ``` ``` [1] 120 ``` ``` [1] 15 ``` ``` [1] TRUE ``` ] <style> .panel1-bytes-auto { color: black; width: 29.1%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-bytes-auto { color: black; width: 67.9%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-bytes-auto { color: black; width: 0%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- class: center, middle <img src="lec-img/03-units-of-data.jpg" width="50%"> --- class: center, middle <img src="lec-img/03-new-mac-hd.png" width="75%"> --- class: center, middle # File Storage <img src="lec-img/03-disk-sectors.png" width="50%"> --- class: center, middle #Defragmentation <img src="lec-img/03-defrag.jpg"> --- **File**: files save data on the hard drive as bytes in a meaningful way. <center> <i class="fas fa-file fa-5x"></i> </center> A file has three key properties: -- - an _extension_ (how to/what program reads the format) -- - a _path_ (a location in the file system) -- - a _name_ (machine and human interpretable address) --- # Extensions - All files store bits. -- - There are 1000's of different formats for data -- - Each format defines how the sequence of bits and bytes are laid out -- - indicate the characteristics of the file, its intended use, and the default applications that can open/use the file. -- - If you double click a .docx file it opens in Word and the Word software interprets the meaning of the bytes -- - If you double click an .R file it opens with RStudio, and RStudio interprets the meaning of the bytes -- can be considered a type of metadata that provides information about the way data might be stored -- - common extensions include .doc, .docx, .txt, .csv, .ppt, .shp, .R ... <br> <center> <i class="fas fa-file-code fa-5x"></i> <i class="fas fa-file-word fa-5x"></i> <i class="fas fa-file-pdf fa-5x"></i> <i class="fas fa-file-image fa-5x"></i> <i class="fas fa-file-video fa-5x"></i> <i class="fas fa-file-csv fa-5x"></i> </center> --- **Filesystem**: describes the methods that an operating system uses to organize files <center>
</center> --- # ... **Directory**: is a location for storing, organizing, and separating files and other directories on a computer. Think of folders! <center> <i class="fas fa-folder fa-5x"></i> <i class="fas fa-folder-open fa-5x"></i> </center> --- **Root directory**: the "highest" or top-level directory in the hierarchy. The root directory contains all other folders/files in the drive or folder <center> <i class="fas fa-home fa-5x"></i> </center> - Sometimes referred to as the home directory --- # File Paths - File paths tell us the location of a file within the file system -- - Directories are stored as hierarchies, again with root (home) directory being the one holding everything on a system -- ![](lec-img/03-file-system.png) - The folder you are in, is called your **working directory**. (think pwd) - The folder above
the working directory is the **parent directory** - All folders within the working directory are **subfolders** or **child** folder --- # Declaring File Paths Files are located by their path. Think of this as the directions - from the root directory. - Directories are separated with backslashes ("\") on windows, and forward slashes ("/") or MacOS and Linux machines. ## Example: <span style=" color: red !important;font-size: 24px;" >/Users/mikejohnson/github/spds/lectures/lec-img/02-isaias-track.png</span> - **Root**: /Users/mikejohnson ```bash cd ~ pwd ``` ``` /Users/mikejohnson ``` - **git-enabled projects**: github - **project**: spds - **subdirectory**: lectures - **subdirectory**: lec-img - **file**: 02-isaias-track.png --- # Absolute vs Realative Paths There are two ways to specify a file path. - An absolute path always begins in the **root folder** - A relative path is relative to the current working directory - The dot (.) and dot-dot (..) notation to help us write shorter paths - A single '.' denotes “this directory”. - Two periods (“..”) means “the parent directory” <center> <img src="lec-img/03-realtive-absolute-paths.jpg" width="60%"> </center> --- class: inverse, middle, center ## Working with absolute paths can be a pain compared to relative paths... --- # Enforcing Relative Paths ... It is a good practice to keep all the files associated with a project — input data, R scripts, analytic results, figures - together. -- This is such a common practice that RStudio has built-in support for this via **projects**. -- A good project layout will ultimately make your life easier: -- - It will help ensure the integrity of your data; - It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor); - It allows you to easily upload your code with your manuscript submission; - It makes it easier to pick the project back up after a break. --- class: middle, center # Things that are good but not fun <img src="lec-img/03-laundry.jpeg" width="100%"> --- # Code Organization ## Typical Project ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ├── R │ └── some-code.R | └── utils.R ├── img | └── cool-img.png ├── docs | └── index.Rmd | └── index.html ├── data └── data.csv ``` --- # my_project - This is a project called "my_project" -- - The name of the project is the same as the directory -- - <i class="fa fa-ban" aria-hidden="true"></i> spaces -- - <i class="fa fa-ban" aria-hidden="true"></i> special characters -- - Each Project is a directory containing files relevant to that project -- - all git enabled projects can go in our `~/github` directory --- # Creating an R Project - RStudio can help us create a new R project. - An RStudio Project file (.Rproj) is analogous to an .mxd file for ArcMap. It contains information about the specific settings for a “project”. -- - File --> New Project -- Everything we do in R should be version controlled (with git!). When making a new project: -- - File --> New Project --> New Directory --> New Project -- - Create the project as a subfolder of the `~github` directory -- - <i class="fas fa-check-square"></i> create git repository -- ```r . └── github └── my_project └── my_project.Rproj └── .gitignore ``` The "." in front of a file name denotes it is a "hidden" file. --- # R Project - An R project is a working directory designated with a .RProj file. ```r . └── github └── my_project └── my_project.Rproj └── .gitignore ``` -- - When you open a project: - In RStudio: File --> Open Project - Outside RStudio: double–clicking on the .Rproj file the working directory is automatically be set to the directory where .RProj file is located! -- - Allows you to work with relative rather then absolute paths! -- - Consider creating a new R Project whenever you are starting a new project. -- - This will enforce a self contained project with associated data, scripts, and output --- # Building the rest of the Project... --- # README.md README files are the "users manual" for the project - What is the name - purpose - installation directions - rules of use We use the md extension (markdown) because GitHub For us, a title, 1-2 sentence description and data attribution is plenty. ```bash touch README.md ``` ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ``` --- # R or (src) - A directory call R (or src) is used to hold all scripts used in the analysis. - The can be data processing, analysis, or figure generation sripcts ```bash mkdir R # make a R directory cd R # Enter the R directory touch some-code.R # make a file touch utils.R # make a file cd.. # move back up to my_project directory ``` ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ├── R └── some-code.R └── utils.R ``` --- # imgs (or img or figs or output) This folder is for things that are saved as a result of your scripts - Plot images - Maps - Ect This directory only contains generated files; that is, you should always be able to delete the contents and regenerate them. ```bash mkdir imgs # make a imgs directory ``` ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ├── R │ └── some-code.R | └── utils.R ├── img └── cool-img.png ``` --- # docs (only docs) - the docs folder should hold your Rmarkdown files and there rendered output - Github Pages can be deployed from the docs folder making this a good practive if you want to share information over the web in a free secure way ```bash mkdir R # make a docs directory cd docs # Enter the docs directory touch index.Rmd # make a file cd.. # move back up to my_project directory ``` ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ├── R │ └── some-code.R | └── utils.R ├── img | └── cool-img.png ├── docs └── index.Rmd └── index.html ``` --- # data - the data folder is an storage archive for raw data - Its **crucial** to make a distinction between source/raw data and generated data: Treat source/raw data as **read-only** Treat generated data as disposable. - Some might separate raw and generated data into separate sub directories. I prefer to segment them through the naming ```bash mkdir data # make a data directory ``` ```r . └── github └── my_project └── my_project.Rproj └── .gitignore └── README.md ├── R │ └── some-code.R | └── utils.R ├── img | └── some-code.R ├── docs | └── index.Rmd | └── index.html ├── data └── data.csv ``` --- # Rules... ###1. Treat data as read only ###2. Treat generated output as disposable ###3. Other then that, structure should match the project goals and is flexable! <img src="lec-img/03-rules.jpeg" width="100%"> --- # The goal for workflows: We will do everything in **well-annotated**, **organized** scripts that contain streamlined and easy-to-follow records of our entire analyses from **raw data** through **final reports**, with **unbreakable** file paths and with a **complete history** of changes made ^[Allison Horst]. -- - **Well-annotated**: Through documentation and comments -- - **Organized**: Directory Strucutre -- - **Raw Data**: Keep raw data raw! -- - **Final Reports**: Rmarkdown files -- - **Unbreakable Paths**: Rproj to the rescue -- - **Complete History**: Version control with git and GitHub --- class: inverse, middle, center # File Names File names are how we locate and identify information stored on on machine. File names should be machine readable, human readable, and play well with default ordering: --- class: middle, center ## "There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton --- # Three Principles for file names - machine readable - human readable - sortable --- ## machine readable -- - avoid spaces, punctuation, accented characters, and mixed cases -- - regular expression friendly (e.g. use patterns!) -- - use ISO 8601 dates (YYYY-MM-DD) -- - be consistent with delimiters (easy to compute on) - Use "_" (underscores) to separate "metadata" you want latter - Use "-" (hyphens) to separate words for readability (like dates or names) --- ## Doing so makes files ... easy to search for & narrow ```r # Only those files with pattern "_tx" files = list.files("data/usgs-files", pattern = "_tx") length(files) # total number of Texas files ``` ``` [1] 27 ``` ``` [1] "1903-07-01_08033500_00060_tyler_tx.txt" "1923-10-01_08033000_00060_angelina_tx.txt" ``` ### and metadata easy to recover (easy to compute on) ```r str_split(files, "[_\\.]", simplify = TRUE) ``` <table> <thead> <tr> <th style="text-align:left;"> StartDate </th> <th style="text-align:left;"> siteID </th> <th style="text-align:left;"> parameterCode </th> <th style="text-align:left;"> county </th> <th style="text-align:left;"> state </th> <th style="text-align:left;"> extension </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1903-07-01 </td> <td style="text-align:left;"> 08033500 </td> <td style="text-align:left;"> 00060 </td> <td style="text-align:left;"> tyler </td> <td style="text-align:left;"> tx </td> <td style="text-align:left;"> txt </td> </tr> <tr> <td style="text-align:left;"> 1923-10-01 </td> <td style="text-align:left;"> 08033000 </td> <td style="text-align:left;"> 00060 </td> <td style="text-align:left;"> angelina </td> <td style="text-align:left;"> tx </td> <td style="text-align:left;"> txt </td> </tr> <tr> <td style="text-align:left;"> 1923-10-01 </td> <td style="text-align:left;"> 08180500 </td> <td style="text-align:left;"> 00060 </td> <td style="text-align:left;"> medina </td> <td style="text-align:left;"> tx </td> <td style="text-align:left;"> txt </td> </tr> <tr> <td style="text-align:left;"> 1923-12-01 </td> <td style="text-align:left;"> 08082500 </td> <td style="text-align:left;"> 00060 </td> <td style="text-align:left;"> baylor </td> <td style="text-align:left;"> tx </td> <td style="text-align:left;"> txt </td> </tr> <tr> <td style="text-align:left;"> 1924-08-01 </td> <td style="text-align:left;"> 08062500 </td> <td style="text-align:left;"> 00060 </td> <td style="text-align:left;"> ellis </td> <td style="text-align:left;"> tx </td> <td style="text-align:left;"> txt </td> </tr> </tbody> </table> --- ## human readable - File names contain information about the content and purpose of the file - easy to find the right file a year from now ```r list.files("R/code-files") ``` ``` character(0) ``` Here, we see the order that the files run (utilities, download, clean, analyze, figures), the project they belong to (src), are archived in the file names. --- ## Sort easily - put numeric values first (use leading 0 for 1-9) - use ISO 8601 dates (YYYY-MM-DD) ### Chronological ```r list.files("data/usgs-files")[1:5] %>% sort() ``` ``` [1] "1873-06-02_05420500_00060_clinton_ia.txt" "1877-04-01_03193000_00060_fayette_wv.txt" "1887-10-01_01335754_00060_saratoga_ny.txt" [4] "1892-11-01_14174000_00060_linn_or.txt" "1895-03-01_01551500_00060_lycoming_pa.txt" ``` ### Logical ```r list.files("R/code-files") %>% sort() ``` ``` character(0) ``` --- class: center, middle #Now we have a project, a project structure, and well named files... Lets get into R ![](lec-img/03-R-full.jpeg) --- # Variables - Variables store data (values) (my.school = "UCSB") -- - Values can be changed according to our need. (my.school = "Cal Poly") -- - A variable provides us with **named** storage that our programs can manipulate. -- - Variables have a human readable **name** -- a unique identifier -- - An operable value -- - A location in memory where it is stored -- ## So how do we define variables? --- # Variable Names & Values - The variable `name` is arbitrary and helps reference `values.` -- - Names are used by reader (
) of the program -- - Variable `values` are "bound" to a `name` using the `=` or `<-` assignment operators ```r a = 3 a <- 3 ``` -- The result is the value 3 is bound to the name "a". And R can interpret the name as the object/value it holds. ```r 3*a ``` ``` [1] 9 ``` ```r a = 5 rep("TEXT", a) ``` ``` [1] "TEXT" "TEXT" "TEXT" "TEXT" "TEXT" ``` --- ### Binding 101 It is easy to read this statement as "create an object, named x, containing the value 10" ```r x = 10 ``` -- But this is a simplification, in actuality: -- - It's creating a object of value 10 -- - And binding that object to a name 'x' -- - Therefore the value (10) does not have a name, rather, the name (x) has a value --- ### Object Address - Objects have unique identifiers. - These identifiers have a form that looks like the object’s memory “address” - The actual memory addresses changes every time the code is run, so we use these identifiers instead. ```r x = 10 obj_addr(x) ``` ``` [1] "0x7faaccecc3c0" ``` --- ## To illustate this... In the code below, `y` doesn't make another copy of the value `10`, but instead creates an additional binding to the existing object. ```r x = 10 y = x obj_addr(x) ``` ``` [1] "0x7faad9239020" ``` ```r obj_addr(y) ``` ``` [1] "0x7faad9239020" ``` --- Equally, if we create two unique objects (even with the same value), they are different: ```r x2 = 10; y2 = 10 obj_addr(x2) ``` ``` [1] "0x7faad362a848" ``` ```r obj_addr(y2) ``` ``` [1] "0x7faad362a8b8" ``` This is because the values are stored as bytes in memory rather then on hard disk! --- class: center, middle <img src="lec-img/03-new-mac-mem.png" width="75%"> --- class: center, middle <img src="lec-img/03-memory-storage.jpg"> --- ### So what can we do with objects? Remember our school example <i class='fas fa-university'></i> ? We wanted to store information about the school as named values: ```r my.school = "UCSB" lat = 34.4140 lng = -119.8489 ``` -- But these are very different kinds of information with defined capabilities. -- What would happen if we tried to add `lng` to `lat`? ```r lng + lat ``` ``` [1] -85.4349 ``` -- What would happen if we tried to add `lng` to `my.school`? ```r lng + my.school ``` ``` Error in lng + my.school: non-numeric argument to binary operator ``` We see a `non-numeric` argument error telling us that name is not a numeric value. This is our first hint that values have different classes/types. --- #Also ... ```r charToRaw("3") ``` ``` [1] 33 ``` ```r charToRaw(3) ``` ``` Error in charToRaw(3): argument must be a character vector of length 1 ``` --- # Assignment - Today you are going to make your first project. - Starting from Github you are going to make a repository, clone it in RStudio - Then use you knowledge of a good project structure and the Terminal to create an empty project structure. - You will then push this project back to github and submit it via URL <center> # [Details](assignment-03.html) <\center> --- class: center, middle # END