Reproducible Data Science

Introductions and Overview

Examples of bad habits

Having files, data, code, and documentation all scattered in different locations
Using excel to store, modify, and visualize data
Doing version control by copying and renaming files
Not documenting or commenting code

A few scenarios

You have data saved in an excel sheet, and proceed to open it up, create some new columns using excel functions, remove various rows, and fill in missing values with zeros. A few weeks later, the criteria for missing values changes. You can’t remember which zeros were genuine, and which were originally missing. Furthermore, the removed rows are acutally necessary and they’re gone! Hopefully the originally collected data is somewhere and unmodified.

Problem: there is no good way to track the changes that have been made to an excel sheet.

Solution: treat data as immutable, and use scripts to make procedural changes to the data.

You decide that doing everything in excel is a bad idea, and instead write a script to process the data. You know it’s good to keep backups of files, so every time your script changes, you save it as a new file. You now have process_data.R, process_data2.R, process_data_new.R, and process_data_20210505.R as files. You send the code to a friend to review, and they send you back process_data_20210507.R as their modified version. Two months later you come back to these files because you can reuse some of the code, but you’re not sure which one is the most up to date. You spend the rest of the day reading through the code of each file looking for differences.

Problem: duplicating files can lead to confusion, and changes are not tracked over time.

Solution: utilize a source control management system (like Git or SVM) to always have access to past versions and keep track of the most recent version.

You have a research project that requires data downloaded from the internet. It’s pretty large, so you download it once and leave it in your downloads folder as data.txt. You also have some common functions that you reuse for other projects in your documents folder. Your code references the data and other scripts using absolute file paths (e.g. /home/Hogarth/Documents/code/model_fitting.R and /home/Hogarth/Downloads/data.txt). You want to share this project with a classmate, but when you send the files, the code no longer works for them.

Problems: the project is not self-contained, and absolute file paths make it harder to run code on another PC.

Solution: create a project directory (and project file if possible) and keep it self contained. Large data sets can be stored on a separate website and downloaded with a script. Reused code should be accessible (e.g. in a GitHub repository).

A worst case scenario

From Towards Data Science:

Imagine you completed a one-off analysis a few months ago, creating a fairly complex data pipeline, machine learning model, and visualizations. Fast forward to today and you have Emily, a senior executive at your company, asking you to reuse that work to help solve a similar, time-critical business problem. She looks stressed.

Now if only you could remember which copy of your model was the correct one; if you could make sense of the spaghetti code scattered throughout your Jupyter Notebooks, each with helpful names such as Untitled_1 and Untitled_2; what did the data_process function do and why are there six slightly different versions of it? If only there was some documentation!

“No problem!” you assure her, and after a few sleepless nights, during which you had to reverse engineer the entire codebase, the analysis is ready. Emily looks impressed, that promotion you’ve been waiting for might finally happen.

The next day Emily is back from presenting your findings. She’s not happy – apparently there were mistakes in the analysis caused by simple coding errors that could have cost the company millions. If only you had run some tests! You apologise as she walks away muttering under her breath, you sit there and wonder if maybe you should pack up your desk before you head home for the night.

Principles of reproducible data science

Keeping to a common project structure
Creating self-contained projects
Documenting everything
Using version control
Utilizing tools to produce dynamic and parameterized reports
Splitting up reusable code into separate modules
Treating data as an immutable object

Organizing a new project

Let’s walk through an example of setting up a new data science project in R (and maybe Python)
We will be using RStudio for R
- We will be using Visual Studio Code for Python
- Julia and other languages with no designated IDE can likely use the same general template as Python
- Visual Studio Code has support for many languages, and has an integrated terminal to run code

Setting up version control

It is important to first decide on the source control management system that you will be using. Cloud storage services like Google Drive, OneDrive, Box, and dropbox usually offer a limited form of version history. This can be used as a minimum form of backup and version control. A more powerful and flexible solution is to learn a source control management (SCM) system such as Git or SVM. Services like GitHub, GitLab, and Bitbucket use Git to host repositories, or projects that others can view or contribute to.

When starting a new project using SCM, it is often easier to first create the repo (short for repository) on GitHub (or whichever service you decide to use), clone the repo (download to your PC), and then begin adding files. For beginners, it is also easier to use a git graphic user interface (GUI). Git relies on terminal commands, but a GUI can lower the barrier to entry for those less confident with using a terminal. Common GUI applications are:

GitKraken [cross-platform] [free and pro versions] [my personal choice]
- free version only allows for public repos
GitHub Desktop [Windows, Mac] [free]
Sourcetree [Windows, Mac] [free]

For simple projects with only a couple of contributors, a GUI application can work fine. For more complex projects with many contributors, it is recommended to become familiar with using git in a terminal (command line interface – CLI).

Creating a GitHub Repository

It is assumed that you have created a GitHub account.

Open GitHub.com
Start creating a new repo
- There is a “new” button on the left and a “+” button at the top right
Choose a project (repository) name (no spaces)
Make it public or private
- a private repo may not be supported by all free Git GUI’s
Optional
- README.md is a special file that gets rendered in Markdown and is displayed to anyone who may view your repo (see here for an example)
- .gitignore is a file recognized by git that allows you to optionally ignore (not keep track of) files that match a specified pattern (e.g. a folder name or path, log files, etc.)
- Choose a license so that others know what they can and cannot do with your work

Cloning a GitHub Repository

Download and install the GitHub Desktop app
Select the “Clone a Repository” button and browse to the newly created repo
Select a download location (recommend creating a Projects folder)

“Cloning” essentially means that you are downloading the repository to your local machine from the remote host.

Other resources

git (the underlying SCM software)
GitHub (repo host)
GitLab (repo host)
Bitbucket (repo host)
- Part of the Atlassian group (Trello, Jira, Sourcetree)
git best practices (reference)
GitFlow branching model (reference)
Choose a license explanation of various common software licenses

Start thinking about a project-oriented workflow

In a “project-oriented” workflow, your IDE can manage the project by automatically setting the working directory whenever the project is opened. In RStudio, this is done with an R Project file (.Rproj).

By letting your IDE manage the project and the working directory, you can set relative file paths ensuring that everything will work even if the project is opened on another computer. In general, the working directory should be the top level project directory, and all paths should be relative to that.

In the video below, we create a new project from the repository that we cloned earlier.

(Optional) Set up a virtual environment

A virtual environment is a fancy way to describe a system of managing package versions and other dependencies. In R, there is a library called renv that can keep track of libraries installed and used in a project, and can restore the packages when the project is opened on a different computer. This can greatly contribute to reproducibility.

Other resources

Create the project structure

When all project folders are organized in the same way, the time spent searching for files is greatly reduced.

├── LICENSE
├── README.md      <- The top-level README for developers using this project.
├── data           
│   ├── external   <- Data from third party sources.
│   ├── interim    <- Intermediate data that has been transformed.
│   ├── processed  <- The final, canonical data sets for modeling.
│   └── raw        <- The original, immutable data dump.
│                  
├── docs           <- A default Sphinx project; see sphinx-doc.org for details
│                  
├── models         <- Trained and serialized models, model predictions, 
│                     or model summaries
│                  
├── notebooks      <- R notebooks. Naming convention is a number (for ordering),
│                     the creator's initials, and a short `-` delimited 
│                     description, e.g. `1.0-jqp-initial-data-exploration`.
│                  
├── references     <- Data dictionaries, manuals, and all other explanatory materials.
│                  
├── reports        <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures    <- Generated graphics and figures to be used in reporting
│                  
├── R              <- R scripts
│                  
└── scratch        <- Other scratch files for testing, etc.

Committing Changes to GitHub

When you create new files and folders in your github project, they are not automatically tracked. Files need to be “staged” and then “committed”. Finally, commits need to be “pushed” to the remote location (GitHub) in order to be backed up.

This leads to a simple workflow:

write code, add files, etc.
stage the changes in GitHub Desktop
write a useful commit message
repeat steps 1-3 throughout the day
push the changes to remote

It’s a good idea to commit early and commit often. Git works by tracking the differences between commits, so there’s no need to worry about having too many commits that take up a lot of memory.

Coding practices

Re-using code

functions
creating a user library
implement simple, composable functions

Writing human-readable code

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. – Brian Kernighan

Writing comments

In a perfect world, code should be self-documenting. For example, the following code should be (relatively) easy to reason about:

df_raw   <- read_raw_temperature_data("path/to/file.txt")
df_clean <- parse_raw_temperature_data(df_raw)

write.csv(df_clean, file = "data/processed/temperature_2020.csv")

p <- plot_temperature_by_month(df_clean)

The specifics of what each function is doing is hidden from us, the reader, but it’s clear that this program reads in raw data, cleans it up in some way, writes the clean data back to a new file, and then produces a plot. No comments are required in this code.

However, sometimes comments are necessary to further explain the reasoning behind code. For example, a perfect power number is an integer that can be written in the form of \(a^b = m\) where \(a\), \(b\), and \(m\) are integers, and \(b \ge 2\). Here is a function that checks if a number is a perfect power:

is_perfect_power <- function(n) {
  p <- numbers::primeFactors(n)
  e <- table(p)
  
  if (length(e) == 1 && e > 1)
    return(TRUE)
  
  if (!all(e > 1))
    return(FALSE)
  
  if (numbers::mGCD(e) == 1)
    return(FALSE)
  
  TRUE
}

is_perfect_power(81)  # 3^4

## [1] TRUE

is_perfect_power(72)  # 2^3 * 3^2

## [1] FALSE

is_perfect_power(144) # 2^4*3^2 -> 2^2*2^2*3^2

## [1] TRUE

As it is, it is difficult to understand what it is doing or why it is doing it. Let’s break down the logic. If \(m\) is a perfect power, then it can be written as \(a^b\). If \(a\) is not prime, then it can further be written as \((f \cdot g)^b \Rightarrow f^b \cdot g^b\). We can keep breaking down \(a\) into its prime factors, and each prime factor will have an exponent associated with it (number of times the prime factor is present). For example,

\[48 \rightarrow 6\cdot 8 \rightarrow 2 \cdot 3 \cdot 2 \cdot 2 \cdot 2 \rightarrow 2^4 \cdot 3^1\]

numbers::primeFactors(48)

## [1] 2 2 2 2 3

table(numbers::primeFactors(48))

## 
## 2 3 
## 4 1

So we see that the first two lines in the function are computing the prime factors, and then counting each prime (getting the exponents).

The first test checks to see if there is only one prime factor of the number (like the number 27). If so, then check to see if the exponent is greater than or equal to 2 (see the definition of perfect power above). In this case, the number must be a perfect power, and we can return TRUE.

The second test then checks to see if all the exponents are greater than or equal to 2. If not, then we can eliminate the given number as a possible perfect power.

The final check is tricky. If the greatest common divisor between two (or more) numbers is 1, then they are relatively prime (share no common factor). In this case, the number cannot be written as a perfect power. Take 144 as an example.

\[144 \rightarrow 2^4 \cdot 3^2\]

The exponents are not all the same, but the GCD of 4 and 2 is 2, so the number can be written as

\[2^2 \cdot 2^2 \cdot 3^2 \rightarrow (2\cdot 2\cdot 3)^2 \rightarrow 12^2\]

which we can then see is a perfect power. We cannot do this same process if the GCD is 1.

Let’s now add some comments to the function to make it more understandable:

# a perfect power is a number that can be written as `a^b` where `a` and `b` are positive
# integers and `b >= 2`
is_perfect_power <- function(n) {
  # computes the prime factorization and gets the exponent associated with each prime
  p <- numbers::primeFactors(n)
  e <- table(p)
  
  # if there is only one prime factor, then its exponent must be >= 2
  # e.g. 81 -> 3^4 -> perfect power
  # e.g. 23 -> 23^1 -> not a perfect power
  if (length(p) == 1 && e >= 2)
    return(TRUE)
  
  # if there are more than one prime factors, all exponents must be >= 2 by definition
  if (!all(e >= 2))
    return(FALSE)
  
  # if the GCD of the exponents is 1, then the number cannot be written in a way where the
  # exponents of all factors are the same
  # e.g.  144 -> 2^4*3^2 GCD(4, 2) = 2 -> 2^2*2^2*3^2 -> (2*2*3)^2 -> perfect power
  # e.g. 1944 -> 2^3*3^5 -> GCD(3, 5) = 1 -> not a perfect power
  if (numbers::mGCD(e) == 1)
    return(FALSE)
  
  # if we've made it to here, then by process of elimination the number is a perfect power
  TRUE
}

At this point, our function now has more comments than code, which means that we may be better off writing a longer description above the function and cutting out some of examples in the code. The main idea is that your future self should be able to come and read the code you’ve written and understand what is being done and why.

From the Tidyverse style guide:

In data analysis code, use comments to record important findings and analysis decisions. If you need comments to explain what your code is doing, consider rewriting your code to be clearer. If you discover that you have more comments than code, consider switching to R Markdown.

Utilizing existing libraries

Avoid reinventing the wheel
Keeps projects “modular”
Others may already be familiar with code from popular packages

Great packages to be aware of

Tidyverse
- “metapackage”
- installs a collection of libraries that all have similar syntax
- data import, cleaning, visualization, and wrangling
- useful programming tools (glue, magrittr, purrr)
fable
- used for time series modeling and analysis
- has an associated online book: Forecasting: Principles and Practice
- replaces the older forecast library (fpp2)
zoo
- a different and popular time series package
caret
- a set of functions that attempt to streamline the process for creating predictive models
rstanarm and rstan
- comprehensive Bayesian modeling and inference
- visualization packages like bayesplot
- cross validation via loo
rmarkdown, knitr, bookdown, posterdown, …
- packages for creating elegant reports, slides, posters, and more using R and Markdown
- kableExtra: for creating pretty tables
- patchwork: for arranging ggplots
Great R packages for data import, wrangling and visualization

Data is immutable

What does immutable mean to you?

Raw data should STAY raw data.

[example given during lecture using psychometric data]

Large File Storage

github is great at tracking changes made to code and documents, but what if that code produces large image or video files? Or you have massive Illumina datasets that you canot store on github?

Git Large File Storage Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.

If you have git, using their large file storage works very similarly. First, set up your account to use LFS by running (from the command line):

git lfs install

In each Git repository where you want to use Git LFS, select the file types you’d like Git LFS to manage (or directly edit your .gitattributes). You can configure additional file extensions at anytime.

git lfs track "*.psd"
# make sure .gitattributes is tracked
git add .gitattributes

Note: That defining the file types Git LFS should track will not, by itself, convert any pre-existing files to Git LFS, such as files on other branches or in your prior commit history

Now, just commit and push to github as you normally would:

git add file.psd
git commit -m "Add design file"
git push origin main

Box

If you have a UNR email address, you automatically have box storage that is available to you. Box is great for storing all of your files in one place, and does actively track changes made to your files that can easily be reversed by you or someone that you have shared the directory with.

This type of storage can reliably track any changes that have been made to your large files over the course of your project and helps in avoiding a directory that looks like: gif_v1, gif_v2, gif_v3, etc.

Dynamic documents and reports

R Markdown

[in session example from scratch]

R Markdown Presentations

https://github.com/adknudson/UNR-Masters-Thesis/blob/master/presentation/defense_presentation.Rmd

Bookdown

[in session overview including examples]

This workshop series
My UNR thesis

Posterdown

[in session example]

https://github.com/adknudson/tramm