You have data saved in an excel sheet, and proceed to open it up, create some new columns using excel functions, remove various rows, and fill in missing values with zeros. A few weeks later, the criteria for missing values changes. You can’t remember which zeros were genuine, and which were originally missing. Furthermore, the removed rows are acutally necessary and they’re gone! Hopefully the originally collected data is somewhere and unmodified.
Problem: there is no good way to track the changes that have been made to an excel sheet.
Solution: treat data as immutable, and use scripts to make procedural changes to the data.
You decide that doing everything in excel is a bad idea, and instead write a script to process the data. You know it’s good to keep backups of files, so every time your script changes, you save it as a new file. You now have
process_data.R
,process_data2.R
,process_data_new.R
, andprocess_data_20210505.R
as files. You send the code to a friend to review, and they send you backprocess_data_20210507.R
as their modified version. Two months later you come back to these files because you can reuse some of the code, but you’re not sure which one is the most up to date. You spend the rest of the day reading through the code of each file looking for differences.
Problem: duplicating files can lead to confusion, and changes are not tracked over time.
Solution: utilize a source control management system (like Git or SVM) to always have access to past versions and keep track of the most recent version.
You have a research project that requires data downloaded from the internet. It’s pretty large, so you download it once and leave it in your downloads folder as
data.txt
. You also have some common functions that you reuse for other projects in your documents folder. Your code references the data and other scripts using absolute file paths (e.g./home/Hogarth/Documents/code/model_fitting.R
and/home/Hogarth/Downloads/data.txt
). You want to share this project with a classmate, but when you send the files, the code no longer works for them.
Problems: the project is not self-contained, and absolute file paths make it harder to run code on another PC.
Solution: create a project directory (and project file if possible) and keep it self contained. Large data sets can be stored on a separate website and downloaded with a script. Reused code should be accessible (e.g. in a GitHub repository).
A worst case scenario
From Towards Data Science:
Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model, and visualizations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.
Now if only you could remember which copy of your model was the correct one; if you could make sense of the spaghetti code scattered throughout your Jupyter Notebooks, each with helpful names such as Untitled_1 and Untitled_2; what did the
data_process
function do and why are there six slightly different versions of it? If only there was some documentation!
“No problem!” you assure her, and after a few sleepless nights, during which you had to reverse engineer the entire codebase, the analysis is ready. Emily looks impressed, that promotion you’ve been waiting for might finally happen.
The next day Emily is back from presenting your findings. She’s not happy – apparently there were mistakes in the analysis caused by simple coding errors that could have cost the company millions. If only you had run some tests! You apologise as she walks away muttering under her breath, you sit there and wonder if maybe you should pack up your desk before you head home for the night.
It is important to first decide on the source control management system that you will be using. Cloud storage services like Google Drive, OneDrive, Box, and dropbox usually offer a limited form of version history. This can be used as a minimum form of backup and version control. A more powerful and flexible solution is to learn a source control management (SCM) system such as Git or SVM. Services like GitHub, GitLab, and Bitbucket use Git to host repositories, or projects that others can view or contribute to.
When starting a new project using SCM, it is often easier to first create the repo (short for repository) on GitHub (or whichever service you decide to use), clone the repo (download to your PC), and then begin adding files. For beginners, it is also easier to use a git graphic user interface (GUI). Git relies on terminal commands, but a GUI can lower the barrier to entry for those less confident with using a terminal. Common GUI applications are:
For simple projects with only a couple of contributors, a GUI application can work fine. For more complex projects with many contributors, it is recommended to become familiar with using git in a terminal (command line interface – CLI).
It is assumed that you have created a GitHub account.
README.md
is a special file that gets rendered in
Markdown and is displayed to anyone who may view your repo (see here for an
example).gitignore
is a file recognized by git that allows you
to optionally ignore (not keep track of) files that match a specified
pattern (e.g. a folder name or path, log files, etc.)“Cloning” essentially means that you are downloading the repository to your local machine from the remote host.
In a “project-oriented” workflow, your IDE can manage the project by
automatically setting the working directory whenever the project is
opened. In RStudio, this is done with an R Project file
(.Rproj
).
By letting your IDE manage the project and the working directory, you can set relative file paths ensuring that everything will work even if the project is opened on another computer. In general, the working directory should be the top level project directory, and all paths should be relative to that.
In the video below, we create a new project from the repository that we cloned earlier.
A virtual environment is a fancy way to describe a system of managing
package versions and other dependencies. In R, there is a library called
renv
that can keep track of libraries installed and used in
a project, and can restore the packages when the project is opened on a
different computer. This can greatly contribute to reproducibility.
When all project folders are organized in the same way, the time spent searching for files is greatly reduced.
├── LICENSE
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions,
│ or model summaries
│
├── notebooks <- R notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited
│ description, e.g. `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── R <- R scripts
│
└── scratch <- Other scratch files for testing, etc.
When you create new files and folders in your github project, they are not automatically tracked. Files need to be “staged” and then “committed”. Finally, commits need to be “pushed” to the remote location (GitHub) in order to be backed up.
This leads to a simple workflow:
It’s a good idea to commit early and commit often. Git works by tracking the differences between commits, so there’s no need to worry about having too many commits that take up a lot of memory.
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. – Brian Kernighan
In a perfect world, code should be self-documenting. For example, the following code should be (relatively) easy to reason about:
df_raw <- read_raw_temperature_data("path/to/file.txt")
df_clean <- parse_raw_temperature_data(df_raw)
write.csv(df_clean, file = "data/processed/temperature_2020.csv")
p <- plot_temperature_by_month(df_clean)
The specifics of what each function is doing is hidden from us, the reader, but it’s clear that this program reads in raw data, cleans it up in some way, writes the clean data back to a new file, and then produces a plot. No comments are required in this code.
However, sometimes comments are necessary to further explain the reasoning behind code. For example, a perfect power number is an integer that can be written in the form of \(a^b = m\) where \(a\), \(b\), and \(m\) are integers, and \(b \ge 2\). Here is a function that checks if a number is a perfect power:
is_perfect_power <- function(n) {
p <- numbers::primeFactors(n)
e <- table(p)
if (length(e) == 1 && e > 1)
return(TRUE)
if (!all(e > 1))
return(FALSE)
if (numbers::mGCD(e) == 1)
return(FALSE)
TRUE
}
is_perfect_power(81) # 3^4
## [1] TRUE
is_perfect_power(72) # 2^3 * 3^2
## [1] FALSE
is_perfect_power(144) # 2^4*3^2 -> 2^2*2^2*3^2
## [1] TRUE
As it is, it is difficult to understand what it is doing or why it is doing it. Let’s break down the logic. If \(m\) is a perfect power, then it can be written as \(a^b\). If \(a\) is not prime, then it can further be written as \((f \cdot g)^b \Rightarrow f^b \cdot g^b\). We can keep breaking down \(a\) into its prime factors, and each prime factor will have an exponent associated with it (number of times the prime factor is present). For example,
\[48 \rightarrow 6\cdot 8 \rightarrow 2 \cdot 3 \cdot 2 \cdot 2 \cdot 2 \rightarrow 2^4 \cdot 3^1\]
numbers::primeFactors(48)
## [1] 2 2 2 2 3
table(numbers::primeFactors(48))
##
## 2 3
## 4 1
So we see that the first two lines in the function are computing the prime factors, and then counting each prime (getting the exponents).
The first test checks to see if there is only one prime factor of the
number (like the number 27). If so, then check to see if the exponent is
greater than or equal to 2 (see the definition of perfect power above).
In this case, the number must be a perfect power, and we can return
TRUE
.
The second test then checks to see if all the exponents are greater than or equal to 2. If not, then we can eliminate the given number as a possible perfect power.
The final check is tricky. If the greatest common divisor between two (or more) numbers is 1, then they are relatively prime (share no common factor). In this case, the number cannot be written as a perfect power. Take 144 as an example.
\[144 \rightarrow 2^4 \cdot 3^2\]
The exponents are not all the same, but the GCD of 4 and 2 is 2, so the number can be written as
\[2^2 \cdot 2^2 \cdot 3^2 \rightarrow (2\cdot 2\cdot 3)^2 \rightarrow 12^2\]
which we can then see is a perfect power. We cannot do this same process if the GCD is 1.
Let’s now add some comments to the function to make it more understandable:
# a perfect power is a number that can be written as `a^b` where `a` and `b` are positive
# integers and `b >= 2`
is_perfect_power <- function(n) {
# computes the prime factorization and gets the exponent associated with each prime
p <- numbers::primeFactors(n)
e <- table(p)
# if there is only one prime factor, then its exponent must be >= 2
# e.g. 81 -> 3^4 -> perfect power
# e.g. 23 -> 23^1 -> not a perfect power
if (length(p) == 1 && e >= 2)
return(TRUE)
# if there are more than one prime factors, all exponents must be >= 2 by definition
if (!all(e >= 2))
return(FALSE)
# if the GCD of the exponents is 1, then the number cannot be written in a way where the
# exponents of all factors are the same
# e.g. 144 -> 2^4*3^2 GCD(4, 2) = 2 -> 2^2*2^2*3^2 -> (2*2*3)^2 -> perfect power
# e.g. 1944 -> 2^3*3^5 -> GCD(3, 5) = 1 -> not a perfect power
if (numbers::mGCD(e) == 1)
return(FALSE)
# if we've made it to here, then by process of elimination the number is a perfect power
TRUE
}
At this point, our function now has more comments than code, which means that we may be better off writing a longer description above the function and cutting out some of examples in the code. The main idea is that your future self should be able to come and read the code you’ve written and understand what is being done and why.
From the Tidyverse style guide:
In data analysis code, use comments to record important findings and analysis decisions. If you need comments to explain what your code is doing, consider rewriting your code to be clearer. If you discover that you have more comments than code, consider switching to R Markdown.
What does immutable mean to you?
Raw data should STAY raw data.
[example given during lecture using psychometric data]
github is great at tracking changes made to code and documents, but what if that code produces large image or video files? Or you have massive Illumina datasets that you canot store on github?
If you have git, using their large file storage works very similarly. First, set up your account to use LFS by running (from the command line):
git lfs install
In each Git repository where you want to use Git LFS, select the file types you’d like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.
git lfs track "*.psd"
# make sure .gitattributes is tracked
git add .gitattributes
Note: That defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history
Now, just commit and push to github as you normally would:
git add file.psd
git commit -m "Add design file"
git push origin main
Box
If you have a UNR email address, you automatically have box storage that is available to you. Box is great for storing all of your files in one place, and does actively track changes made to your files that can easily be reversed by you or someone that you have shared the directory with.
This type of storage can reliably track any changes that have been made to your large files over the course of your project and helps in avoiding a directory that looks like: gif_v1, gif_v2, gif_v3, etc.
[in session example from scratch]
[in session overview including examples]