Introduction to R: Part Two

Introductions and Overview

What is RStudio

RStudio is an integrated development environment (IDE) for the R programming language
You can create, open, and edit scripts and other text file types
You can create and manage project files
You can build libraries, documents, and other projects
View plots, tables, web apps, and more
Help documentation explorer
Package manager
Debugger

Opening RStudio

Open in Windows

start menu (see video below)
desktop shortcut
task bar shortcut

Open in Linux

application menu (see video below)
dock shortcut

Navigating RStudio

Four separate regions with individual panes
Main panes are
- source editor
- console
- environment and history
- files, plots, packages, help, etc.

Creating a new project

Why would we even bother with creating a new project folder/file?

manages setting the working directory
contributes to portable projects
can specify project-specific settings

Writing and running an Rscript

Writing a script allows one to re-use and re-write code. It means that you can also share your code, keep track of changes, and string together more complex commands to carry out an analysis.

Create a new file (ctrl+shift+N)
Type in an R command

You can run the entire script at once using the “source” button (ctrl+shift+S or ctrl+shift+enter), or you can run line-by-line (ctrl+enter).

Control flow

Control flow statement is a statement that results in a choice being made as to which of two or more paths to follow.

Conditionals

The simplest control flow is the if-else kind.

# An if statement can stand alone
if (7 > 3) {
  cat("hello from inside the 'if' block\n")
}

## hello from inside the 'if' block

# You can have an if and an else
do_first_condition <- FALSE
if (do_first_condition) {
  cat("This shouldn't print\n")
} else {
  cat("This is the fallback option\n")
}

## This is the fallback option

# If, else if, else
x <- 4
if (x < 0) {
  cat(x, " is less than 0\n")
} else if (isTRUE(all.equal(x, 0))) {
  cat(x, " is equal to 0\n")
} else {
  cat(x, " is greater than 0\n") # is this true? ;)
}

## 4  is greater than 0

# An `else` block is not required
if (FALSE) {
  cat("This won't print\n")
} else if (7 < 4) {
  cat("Neither will this\n")
}

Loops

There are two types of loops (technically one, but practically two), and they are the while-loop and the for-loop.

while-loop

some_condition_is_true <- TRUE
some_counter <- 0
number_of_iterations <- 12

while (some_condition_is_true) {
  # do stuff
  cat("This is iteration:\t", some_counter, "\n")
  
  if (some_counter == 7) {
    some_condition_is_true <- FALSE
  }
  
  some_counter <- some_counter + 1
}

## This is iteration:    0 
## This is iteration:    1 
## This is iteration:    2 
## This is iteration:    3 
## This is iteration:    4 
## This is iteration:    5 
## This is iteration:    6 
## This is iteration:    7

We can also further refine the behavior of a loop with the next and break commands.

# Skip if a number is divisible by 7
# Print "fizz" if a number is divisible by 3
# Print "buzz" if a number is divisible by 5
# end the loop if the number reaches 23
number <- 0
while (TRUE) {
  number <- number + 1
  
  if (number == 23) {
    break
  }

  if (number %% 7 == 0) {
    next
  } else if (number %% 15 == 0) {
    print("fizzbuzz")
  } else if (number %% 3 == 0) {
    print("fizz")
  } else if (number %% 5 == 0) {
    print("buzz")
  } else {
    print(number)
  }
  
}

## [1] 1
## [1] 2
## [1] "fizz"
## [1] 4
## [1] "buzz"
## [1] "fizz"
## [1] 8
## [1] "fizz"
## [1] "buzz"
## [1] 11
## [1] "fizz"
## [1] 13
## [1] "fizzbuzz"
## [1] 16
## [1] 17
## [1] "fizz"
## [1] 19
## [1] "buzz"
## [1] 22

Please note, the above code is a poor implementation for the fizzbuzz test.

for-loop

When we know ahead of time how many iterations we are going to do, we can instead use a for-loop which will take care of incrementing for us. We can also use a for-loop to iterate over elements in a vector.

x <- 3:7
# print the number doubled
# note that the alias for each element in 'x' can be any variable name
for (some_alias_for_element in x) {
  print(some_alias_for_element * 2)
}

## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14

Switches

A switch statement is a compact way to find a match over multiple conditions.

x <- rnorm(n = 30, mean = pi, sd = 0.4)
stat <- "mean"
switch (stat,
  mean = mean(x),
  sd = sd(x),
  var = var(x),
  round = round(x, 2),
  "Default option: no match found"
)

## [1] 3.071956

The -apply family of functions

apply()
- generic
- can be used on matrices, data frames, and arrays
lapply()
- takes a list, vector, or data frame
- returns a list
sapply()
- “simplified” apply
- return type depends on input
tapply()
- table apply
- apply a function to a table grouped by some index
- useful for e.g. counting number of rows for each factor level
vapply()
- vectorized apply
- very strict about the input and output type
- useful to ensure a specific return type
rapply()
- recursive lapply
replicate()
- a simplified wrapper for sapply
- replicates an expression n times
- useful for more complex random sampling

Writing functions

We’ve already seen some functions like math functions
functions are necessary to write flexible and reusable code
rule of thumb: if you copy and paste a block of code more than once, turn it into a function
- this applies to making plots as well
functions can have arguments
- positional arguments
- optional arguments
- named arguments

The Huber loss function (or just Huber function, for short) is defined as:

\[ \psi(x) = \begin{cases} x^2 & \text{if } |x| \leq 1 \\ 2|x| - 1 & \text{if } |x| > 1 \end{cases} \]

# write a function `huber()` that takes as an input a number, x,
# and returns the Huber value
huber <- function(x) {
  if (abs(x) <= 1) {
    x^2
  } else {
    2*abs(x) - 1
  }
}

The Huber function can be modified so that the transition from quadratic to linear happens at an arbitrary cutoff value \(a\), as in:

\[ \psi_a(x) = \begin{cases} x^2 & \text{if } |x| \leq a \\ 2a|x| - a^2 & \text{if } |x| > a \end{cases} \]

Starting with the code above, update the huber() function so that it takes two arguments: \(x\), a number at which to evaluate the loss, and \(a\), a number representing the cutoff value.

It should now return \(\psi_a(x)\), as defined above. Check that huber(3, 2) returns 8, and huber(3, 4) returns 9.

huber <- function(x, a) {
  if (abs(x) <= a) {
    x^2
  } else {
    2 * a * abs(x) - a^2
  }
}

huber(3, 2)

## [1] 8

huber(3, 4)

## [1] 9

Update the huber() function so that the default value of the cutoff \(a\) is 1. Check that huber(3) returns 5.

huber <- function(x, a = 1) {
  if (abs(x) <= a) {
    x^2
  } else {
    2 * a * abs(x) - a^2
  }
}

huber(3)

## [1] 5

Check that huber(a=1, x=3) returns 5. Check that huber(1, 3) returns 1. Why are these different?

huber(a = 1, x = 3)

## [1] 5

huber(1, 3)

## [1] 1

Finally, we can vectorize this function over a set of inputs in two different ways:

# ifelse()
huber_ifelse <- function(x, a) {
  ifelse(abs(x) <= a, x^2, 2*a*abs(x) - a^2)
}
# Vectorize()
huber_vec <- Vectorize(huber, vectorize.args = "x", USE.NAMES = TRUE)

Working with libraries

On the subject of code reuse, we can stand on the shoulders of giants (or other researchers) and use the code that they’ve decided to share on CRAN or GitHub (or Bioconductor, or …). We use their code through something called a library or package.

# load the `readxl` package using the `library()` function
library(readxl)

Sometimes two packages will export two different functions by the same name. When this happens, we get what is called a namespace conflict. In those cases (and in general), it is best to be explicit about which package’s function you are using:

# load the `dplyr` package and observe the 'onload' message
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

It says that it’s masked the object filter from package:stats
stats is a default R library that is available on startup
dplyr also exports an object called filter and overwrites it in the namespace

# This should work for time series
x <- 1:100
filter(x, rep(1, 3))

## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "c('integer', 'numeric')"

# fix the above error by being explicit about the namespace
stats::filter(x, rep(1, 3))

## Time Series:
## Start = 1 
## End = 100 
## Frequency = 1 
##   [1]  NA   6   9  12  15  18  21  24  27  30  33  36  39  42  45  48  51  54
##  [19]  57  60  63  66  69  72  75  78  81  84  87  90  93  96  99 102 105 108
##  [37] 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162
##  [55] 165 168 171 174 177 180 183 186 189 192 195 198 201 204 207 210 213 216
##  [73] 219 222 225 228 231 234 237 240 243 246 249 252 255 258 261 264 267 270
##  [91] 273 276 279 282 285 288 291 294 297  NA

Best practices and style

Take some time to read through the first three sections of the Tidyverse Style Guide. The developers of RStudio also develop a suite of packages called the Tidyverse, and they use this style guide to make code more readable and uniform.