Geography 13

# Geography 13
## Lecture 03: The Digital Environment
### Mike Johnson

---

---
class: middle, center

## Today

Files

Naming Things

Data Types

---

## Managing your projects in a reproducible way doesn't just make your science better, but it makes your life easier
---
# Your Computer

**Computer**: an electronic device for _storing_ and _processing_ data, typically in binary form, according to instructions

**File**: a block of arbitrary information available to a computer program

---

# File Storage

- Files are stored as a collection of bytes on a hard drive

- Hard drives do not understand files  - they just store bytes

--
 
- We need ways to retrieve (and write) **bytes** from (to) the hard drive.

---
# Bytes?

But what is a byte?

- a group of binary digits or bits (usually eight) operated on as a unit.
- a byte considered as a unit of memory size.

- The bit is a basic unit of information in computing and represents a logical state with two possible values (0 or 1).

---

# [Example](https://www.rapidtables.com/convert/number/ascii-hex-bin-dec-converter.html)

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
*(x <- "GIS is Great!!!")
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```
]

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
(x <- "GIS is Great!!!")

*(y <- charToRaw(x))
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```

```
 [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21
```
]

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
(x <- "GIS is Great!!!")

(y <- charToRaw(x))

*(z <- rawToBits(y))
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```

```
 [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21
```

```
  [1] 01 01 01 00 00 00 01 00 01 00 00 01 00 00 01 00 01 01 00 00 01 00 01 00 00 00 00 00 00 01 00 00 01 00 00 01 00 01 01 00 01 01 00 00 01 01 01 00 00 00 00 00 00
 [54] 01 00 00 01 01 01 00 00 00 01 00 00 01 00 00 01 01 01 00 01 00 01 00 00 01 01 00 01 00 00 00 00 01 01 00 00 00 01 00 01 01 01 00 01 00 00 00 00 01 00 00 01 00
[107] 00 00 00 01 00 00 01 00 00 00 00 01 00 00
```
]

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
(x <- "GIS is Great!!!")

(y <- charToRaw(x))

(z <- rawToBits(y))

*length(z)
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```

```
 [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21
```

```
[1] 120
```
]

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
(x <- "GIS is Great!!!")

(y <- charToRaw(x))

(z <- rawToBits(y))

length(z)

*nchar(x)
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```

```
 [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21
```

```
[1] 120
```

```
[1] 15
```
]

---
count: false
 
#Bytes - Raw - Human Readable...
.panel1-bytes-auto[

```r
(x <- "GIS is Great!!!")

(y <- charToRaw(x))

(z <- rawToBits(y))

length(z)

nchar(x)

*nchar(x) == (length(z)/8)
```
]
 
.panel2-bytes-auto[

```
[1] "GIS is Great!!!"
```

```
 [1] 47 49 53 20 69 73 20 47 72 65 61 74 21 21 21
```

```
[1] 120
```

```
[1] 15
```

```
[1] TRUE
```
]

---
class: center, middle
<img src="lec-img/03-units-of-data.jpg" width="50%">
---
class: center, middle
<img src="lec-img/03-new-mac-hd.png" width="75%">
---
class: center, middle
# File Storage
<img src="lec-img/03-disk-sectors.png" width="50%">
---
class: center, middle
#Defragmentation
<img src="lec-img/03-defrag.jpg">
---

**File**: files save data on the hard drive as bytes in a meaningful way.

A file has three key properties:

- an _extension_ (how to/what program reads the format)

- a _path_ (a location in the file system)

- a _name_ (machine and human interpretable address)

---

# Extensions

- All files store bits.

- There are 1000's of different formats for data

- Each format defines how the sequence of bits and bytes are laid out

- indicate the characteristics of the file, its intended use, and the default applications that can open/use the file.

- If you double click a .docx file it opens in Word and the Word software interprets the meaning of the bytes
 
--
 
 - If you double click an .R file it opens with RStudio, and RStudio interprets the meaning of the bytes

--
 
can be considered a type of metadata that provides information about the way data might be stored

- common extensions include .doc, .docx, .txt, .csv, .ppt, .shp, .R ...

---

**Filesystem**: describes the methods that an operating system uses to organize files

---
# ...

**Directory**: is a location for storing, organizing, and separating files and other directories on a computer. Think of folders!

---

**Root directory**: the "highest" or top-level directory in the hierarchy. The root directory contains all other folders/files in the drive or folder

- Sometimes referred to as the home directory
---

# File Paths

- File paths tell us the location of a file within the file system
--

- Directories are stored as hierarchies, again with root (home) directory being the one holding everything on a system

![](lec-img/03-file-system.png)

- The folder you are in, is called your **working directory**. (think pwd)
- The folder above the working directory is the **parent directory**
- All folders within the working directory are **subfolders** or **child** folder
---

# Declaring File Paths

Files are located by their path. Think of this as the directions - from the root directory.

- Directories are separated with backslashes ("\") on windows, and forward slashes ("/") or MacOS and Linux machines.
 
## Example:
 
/Users/mikejohnson/github/spds/lectures/lec-img/02-isaias-track.png

- **Root**: /Users/mikejohnson

```bash
cd ~
pwd
```

```
/Users/mikejohnson
```

- **git-enabled projects**: github
- **project**: spds
- **subdirectory**: lectures
- **subdirectory**: lec-img
- **file**: 02-isaias-track.png

---

# Absolute vs Realative Paths

There are two ways to specify a file path.

- An absolute path always begins in the **root folder**

- A relative path is relative to the current working directory

- The dot (.) and dot-dot (..) notation to help us write shorter paths

- A single '.' denotes “this directory”.

- Two periods (“..”) means “the parent directory”

---
class: inverse, middle, center

## Working with absolute paths can be a pain compared to relative paths...
---

# Enforcing Relative Paths ...

It is a good practice to keep all the files associated with a project  — input data, R scripts, analytic results, figures - together.

This is such a common practice that RStudio has built-in support for this via **projects**.

A good project layout will ultimately make your life easier:

- It will help ensure the integrity of your data;
- It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
- It allows you to easily upload your code with your manuscript submission;
- It makes it easier to pick the project back up after a break.

---
class: middle, center
# Things that are good but not fun

<img src="lec-img/03-laundry.jpeg" width="100%">
---

# Code Organization

## Typical Project

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
        ├── R
        │   └── some-code.R
        |   └── utils.R
        ├── img
        |   └── cool-img.png
        ├── docs
        |   └── index.Rmd
        |   └── index.html
        ├── data
            └── data.csv
```

---

# my_project

- This is a project called "my_project"

- The name of the project is the same as the directory

- spaces
--

- special characters
 
--

- Each Project is a directory containing files relevant to that project

- all git enabled projects can go in our `~/github` directory

---

# Creating an R Project

- RStudio can help us create a new R project.
- An RStudio Project file (.Rproj) is analogous to an .mxd file for ArcMap. It contains information about the specific settings for a “project”.

- File --> New Project
 
--
 
Everything we do in R should be version controlled (with git!). When making a new project:

- File --> New Project --> New Directory --> New Project

- Create the project as a subfolder of the `~github` directory

- create git repository
 
--

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
```

The "." in front of a file name denotes it is a "hidden" file.
---

# R Project

- An R project is a working directory designated with a .RProj file.

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
```

- When you open a project:
   - In RStudio: File --> Open Project 
   - Outside RStudio: double–clicking on the .Rproj file 
  
  the working directory is automatically be set to the directory where .RProj file is located!
  
--

- Allows you to work with relative rather then absolute paths!

- Consider creating a new R Project whenever you are starting a new project.

- This will enforce a self contained project with associated data, scripts, and output

---

# Building the rest of the Project...

---
# README.md

README files are the "users manual" for the project

- What is the name
- purpose
- installation directions
- rules of use

We use the md extension (markdown) because GitHub

For us, a title, 1-2 sentence description and data attribution is plenty.

```bash
touch README.md
```

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
```
---
# R or (src)

- A directory call R (or src) is used to hold all scripts used in the analysis. 
- The can be data processing, analysis, or figure generation sripcts

```bash
mkdir R # make a R directory
cd R # Enter the R directory
touch some-code.R # make a file
touch utils.R # make a file
cd.. # move back up to my_project directory
```

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
        ├── R
            └── some-code.R
            └── utils.R
```
---
# imgs (or img or figs or output)

This folder is for things that are saved as a result of your scripts
- Plot images
- Maps
- Ect

This directory only contains generated files; that is, you should always be able to delete the contents and regenerate them.

```bash
mkdir imgs # make a imgs directory
```

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
        ├── R
        │   └── some-code.R
        |   └── utils.R
        ├── img
            └── cool-img.png
```
---
# docs (only docs)

- the docs folder should hold your Rmarkdown files and there rendered output
- Github Pages can be deployed from the docs folder making this a good practive if you want to share information over the web in a free secure way

```bash
mkdir R # make a docs directory
cd docs # Enter the docs directory
touch index.Rmd # make a file
cd.. # move back up to my_project directory
```

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
        ├── R
        │   └── some-code.R
        |   └── utils.R
        ├── img
        |   └── cool-img.png
        ├── docs
            └── index.Rmd
            └── index.html
```
---

# data

- the data folder is an storage archive for  raw data
- Its **crucial** to make a distinction between source/raw data and generated data:

Treat source/raw data as **read-only**
Treat generated data as disposable.

- Some might separate raw and generated data into separate sub directories. I prefer to segment them through the naming

```bash
mkdir data # make a data directory
```

```r
.
└── github
    └── my_project
        └── my_project.Rproj
        └── .gitignore
        └── README.md
        ├── R
        │   └── some-code.R
        |   └── utils.R
        ├── img
        |   └── some-code.R
        ├── docs
        |   └── index.Rmd
        |   └── index.html
        ├── data
            └── data.csv
```

---

# Rules...

###1. Treat data as read only
###2. Treat generated output as disposable
###3. Other then that, structure should match the project goals and is flexable!

<img src="lec-img/03-rules.jpeg" width="100%">
---

# The goal for workflows:

We will do everything in **well-annotated**, **organized** scripts that contain streamlined and easy-to-follow records of our entire analyses from **raw data** through **final reports**, with **unbreakable** file paths and with a **complete history** of changes made ^[Allison Horst].

- **Well-annotated**: Through documentation and comments

- **Organized**: Directory Strucutre

- **Raw Data**: Keep raw data raw!

- **Final Reports**: Rmarkdown files

- **Unbreakable Paths**: Rproj to the rescue

- **Complete History**: Version control with git and GitHub

---

File names are how we locate and identify information stored on on machine. File names should be machine readable, human readable, and play well with default ordering:

---

## "There are only two hard things in Computer Science: cache invalidation and naming things."

-- Phil Karlton
---

# Three Principles for file names

- machine readable
- human readable
- sortable

---

## machine readable

- avoid spaces, punctuation, accented characters, and mixed cases
  
--
  
  - regular expression friendly (e.g. use patterns!)
  
--

- use ISO 8601 dates (YYYY-MM-DD)

- be consistent with delimiters (easy to compute on)
      - Use "_" (underscores) to separate "metadata" you want latter
      - Use "-" (hyphens) to separate words for readability (like dates or names)
      
      
---

## Doing so makes files ... easy to search for & narrow

```r
# Only those files with pattern "_tx"
files = list.files("data/usgs-files", pattern = "_tx") 
length(files) # total number of Texas files
```

```
[1] 27
```

```
[1] "1903-07-01_08033500_00060_tyler_tx.txt"    "1923-10-01_08033000_00060_angelina_tx.txt"
```

### and metadata easy to recover (easy to compute on)

```r
str_split(files, "[_\\.]", simplify = TRUE) 
```

<table>
 <thead>
 <tr>
 <th style="text-align:left;"> StartDate </th>
 <th style="text-align:left;"> siteID </th>
 <th style="text-align:left;"> parameterCode </th>
 <th style="text-align:left;"> county </th>
 <th style="text-align:left;"> state </th>
 <th style="text-align:left;"> extension </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> 1903-07-01 </td>
 <td style="text-align:left;"> 08033500 </td>
 <td style="text-align:left;"> 00060 </td>
 <td style="text-align:left;"> tyler </td>
 <td style="text-align:left;"> tx </td>
 <td style="text-align:left;"> txt </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 1923-10-01 </td>
 <td style="text-align:left;"> 08033000 </td>
 <td style="text-align:left;"> 00060 </td>
 <td style="text-align:left;"> angelina </td>
 <td style="text-align:left;"> tx </td>
 <td style="text-align:left;"> txt </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 1923-10-01 </td>
 <td style="text-align:left;"> 08180500 </td>
 <td style="text-align:left;"> 00060 </td>
 <td style="text-align:left;"> medina </td>
 <td style="text-align:left;"> tx </td>
 <td style="text-align:left;"> txt </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 1923-12-01 </td>
 <td style="text-align:left;"> 08082500 </td>
 <td style="text-align:left;"> 00060 </td>
 <td style="text-align:left;"> baylor </td>
 <td style="text-align:left;"> tx </td>
 <td style="text-align:left;"> txt </td>
 </tr>
 <tr>
 <td style="text-align:left;"> 1924-08-01 </td>
 <td style="text-align:left;"> 08062500 </td>
 <td style="text-align:left;"> 00060 </td>
 <td style="text-align:left;"> ellis </td>
 <td style="text-align:left;"> tx </td>
 <td style="text-align:left;"> txt </td>
 </tr>
</tbody>
</table>

---
## human readable

- File names contain information about the content and purpose of the file
 - easy to find the right file a year from now

```r
list.files("R/code-files")
```

```
character(0)
```
 
Here, we see the order that the files run (utilities, download, clean, analyze, figures), the project they belong to (src), are archived in the file names.

---

## Sort easily

-  put numeric values first (use leading 0 for 1-9)
-  use ISO 8601 dates (YYYY-MM-DD)

### Chronological

```r
list.files("data/usgs-files")[1:5] %>% 
  sort()
```

```
[1] "1873-06-02_05420500_00060_clinton_ia.txt"  "1877-04-01_03193000_00060_fayette_wv.txt"  "1887-10-01_01335754_00060_saratoga_ny.txt"
[4] "1892-11-01_14174000_00060_linn_or.txt"     "1895-03-01_01551500_00060_lycoming_pa.txt"
```

### Logical

```r
list.files("R/code-files") %>% 
  sort()
```

```
character(0)
```

---
class: center, middle
#Now we have a project, a project structure, and well named files... Lets get into R

![](lec-img/03-R-full.jpeg)
---

# Variables

- Variables store data (values) (my.school = "UCSB")

- Values can be changed according to our need. (my.school = "Cal Poly")

- A variable provides us with **named** storage that our programs can manipulate.

- Variables have a human readable **name** -- a unique identifier

- An operable value
  
--

- A location in memory where it is stored
  
--

## So how do we define variables?

---

# Variable Names & Values

- The variable `name` is arbitrary and helps reference `values.`

- Names are used by reader () of the program

- Variable `values` are "bound" to a `name` using the `=` or `<-` assignment operators

```r
a = 3
a <- 3
```

The result is the value 3 is bound to the name "a". And R can interpret the name as the object/value it holds.

```r
3*a
```

```
[1] 9
```

```r
a = 5

rep("TEXT", a)
```

```
[1] "TEXT" "TEXT" "TEXT" "TEXT" "TEXT"
```

---

### Binding 101

It is easy to read this statement as "create an object, named x, containing the value 10"

```r
x = 10
```

But this is a simplification, in actuality:

- It's creating a object of value 10
  
--

- And binding that object to a name 'x'
  
--

- Therefore the value (10) does not have a name, rather, the name (x) has a value

---
### Object Address

- Objects have unique identifiers. 
- These identifiers have a form that looks like the object’s memory “address” 
- The actual memory addresses changes every time the code is run, so we use these identifiers instead.

```r
x = 10
obj_addr(x)
```

```
[1] "0x7faaccecc3c0"
```

---

## To illustate this...

In the code below, `y` doesn't make another copy of the value `10`, but instead creates an additional binding to the existing object.

```r
x = 10
y = x

obj_addr(x)
```

```
[1] "0x7faad9239020"
```

```r
obj_addr(y)
```

```
[1] "0x7faad9239020"
```

---

Equally, if we create two unique objects (even with the same value), they are different:

```r
x2 = 10; y2 = 10

obj_addr(x2)
```

```
[1] "0x7faad362a848"
```

```r
obj_addr(y2)
```

```
[1] "0x7faad362a8b8"
```

This is because the values are stored as bytes in memory rather then on hard disk!

---
class: center, middle
<img src="lec-img/03-new-mac-mem.png" width="75%">
---
class: center, middle
<img src="lec-img/03-memory-storage.jpg">
---

### So what can we do with objects?

Remember our school example ?

We wanted to store information about the school as named values:

```r
my.school = "UCSB"
lat = 34.4140
lng = -119.8489
```
--

But these are very different kinds of information with  defined capabilities.

What would happen if we tried to add `lng` to `lat`?

```r
lng + lat
```

```
[1] -85.4349
```

--
What would happen if we tried to add `lng` to `my.school`?

```r
lng + my.school
```

```
Error in lng + my.school: non-numeric argument to binary operator
```

We see a `non-numeric` argument error telling us that name is not a numeric value. This is our first hint that values have different classes/types.

---

#Also ...

```r
charToRaw("3")
```

```
[1] 33
```

```r
charToRaw(3)
```

```
Error in charToRaw(3): argument must be a character vector of length 1
```

---

# Assignment

- Today you are going to make your first project.
- Starting from Github you are going to make a repository, clone it in RStudio
- Then use you knowledge of a good project structure and the Terminal to create an empty project structure.
- You will then push this project back to github and submit it via URL

<center>
# [Details](assignment-03.html)
<\center>
---
class: center, middle
# END