Introductions and Overview

A majority of this workshop is based on the Tidy Data paper by Hadley Wickham and the corresponding chapter in the R For Data Science e-book. I highly recommend reading both the paper and the e-book to get a deeper understanding of using R and tidy data.

Tidy Data

Tidy data is a set of definitions and guidelines to organize data by in order to make it easier to facilitate data analysis. Data can be organized in many different ways, and it may seem like there is either no “best” way to arrange data. Take the following imaginary data. Table @ref(tab:040-imaginary-data-1) has two columns and three rows in a commonly seen format:

Each row has patient data
Treatment A Treatment B
John Smith 2
Jane Doe 16 11
Mary Johnson 3 1

The same data can also be presented as in table @ref(tab:040-imaginary-data-2), this time a table with three columns and two rows (a transposition of the first table). Neither is more correct than the other, but necessity may dictate that one is more easy to work with than the other.

Each row is the treatment data
John Smith Jane Doe Mary Johnson
Treatment A 16 3
Treatment B 2 11 1

Tidy data defines a few terms to “describe the underlying semantics … of the values displayed in [a] table.”

Consider now the same imaginary data as above, but now in a tidy format:

Each row is an observation, and each column is a variable.
Name Treatment Outcome
John Smith A
John Smith B 2
Jane Doe A 16
Jane Doe B 11
Mary Johnson A 3
Mary Johnson B 1

In tidy data:

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

Tidy data is now well-defined, and messy data is data that is in any other format. Messy data is not useless, though, and certain analyses and operations require data to be in a format other than tidy. But Tidy data can serve as a launching point (or target) for cleaning, plotting, and analyzing a new data set.

Tidying Messy Data

Five most common problems of messy data sets:

  1. Column headers are values, not variable names
  2. Multiple variables are stored in one column
  3. Variable names are stored in both rows and columns
  4. Multiple types of observational units are stored in the same variable
  5. A single observational unit is stored in multiple tables

A small set of tools can be used to clean most of the problems of messy data:

  1. Melting (pivot longer)
  2. String manipulation (splitting)
  3. Casting (pivot wider)

Data in the column headers

Messy

## re-encoding from CP1252
Column header with variables
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
Agnostic 27 34 60 81 76 137
Atheist 12 27 37 52 35 70
Buddhist 27 21 30 34 33 58
Catholic 418 617 732 670 638 1116
Don’t know/refused 15 14 15 11 10 35
Evangelical Prot 575 869 1064 982 881 1486
Hindu 1 9 7 9 11 34
Historically Black Prot 228 244 236 238 197 223
Jehovah’s Witness 20 27 24 24 21 30
Jewish 19 19 25 25 30 95

Tidy

Pew data in a tidy format
religion income freq
Agnostic <$10k 27
Agnostic $10-20k 34
Agnostic $20-30k 60
Agnostic $30-40k 81
Agnostic $40-50k 76
Agnostic $50-75k 137
Agnostic $75-100k 122
Agnostic $100-150k 109
Agnostic >150k 84
Agnostic Don’t know/refused 96

Multiple variables in one column

Messy

Multiple data in individual columns
country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
AD 2000 0 0 1 0 0 0 0
AE 2000 2 4 4 6 5 12 10 3
AF 2000 52 228 183 149 129 94 80 93
AG 2000 0 0 0 0 0 0 1 1
AL 2000 2 19 21 14 24 19 16 3
AM 2000 2 152 130 131 63 26 21 1
AN 2000 0 0 1 2 0 0 0 0
AO 2000 186 999 1003 912 482 312 194 247
AR 2000 97 278 594 402 419 368 330 121
AS 2000 1 1

Tidy

country year sex age cases
AD 2000 m 0-14 0
AD 2000 m 15-24 0
AD 2000 m 25-34 1
AD 2000 m 35-44 0
AD 2000 m 45-54 0
AD 2000 m 55-64 0
AD 2000 m 65+ 0
AE 2000 f 0-14 3
AE 2000 f 15-24 16
AE 2000 f 25-34 1

Variable names in both rows and columns

Messy

id year month element d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20
MX17004 2010 1 tmax
MX17004 2010 1 tmin
MX17004 2010 2 tmax 27.3 24.1
MX17004 2010 2 tmin 14.4 14.4
MX17004 2010 3 tmax 32.1
MX17004 2010 3 tmin 14.2
MX17004 2010 4 tmax
MX17004 2010 4 tmin
MX17004 2010 5 tmax
MX17004 2010 5 tmin

Tidy

id date tmax tmin
MX17004 2010-02-05 27.3 14.4
MX17004 2010-02-09 24.1 14.4
MX17004 2010-03-17 32.1 14.2
MX17004 2010-07-09 28.6 17.5
MX17004 2010-08-17 29.6 15.8
MX17004 2010-08-29 29.0 17.3
MX17004 2010-10-17 27.0 14.0
MX17004 2010-10-25 28.1 12.9
MX17004 2010-11-05 31.3 16.3
MX17004 2010-11-13 27.2 12.0

Multiple types of observational units are stored in the same variable

Messy

year artist track time date.entered wk1 wk2 wk3
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 87 82 72
2000 2Ge+her The Hardest Part Of Breaking Up 3M 15S 2000-09-02 91 87 92
2000 3 Doors Down Kryptonite 3M 53S 2000-04-08 81 70 68
2000 3 Doors Down Loser 4M 24S 2000-10-21 76 76 72
2000 504 Boyz Wobble Wobble 3M 35S 2000-04-15 57 34 25
2000 98^0 Give Me Just One Night 3M 24S 2000-08-19 51 39 34
2000 A*Teens Dancing Queen 3M 44S 2000-07-08 97 97 96
2000 Aaliyah I Don’t Wanna 4M 15S 2000-01-29 84 62 51
2000 Aaliyah Try Again 4M 3S 2000-03-18 59 53 38
2000 Adams, Yolanda Open My Heart 5M 30S 2000-08-26 76 76 74

Half Tidy

year artist track time date.entered week rank
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 1 87
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 2 82
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 3 72
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 4 77
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 5 87
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 6 94
2000 2 Pac Baby Don’t Cry 4M 22S 2000-02-26 7 99
2000 2Ge+her The Hardest Part Of Breaking Up 3M 15S 2000-09-02 1 91
2000 2Ge+her The Hardest Part Of Breaking Up 3M 15S 2000-09-02 2 87
2000 2Ge+her The Hardest Part Of Breaking Up 3M 15S 2000-09-02 3 92

Tidy

## Joining with `by = join_by(artist, track, time)`
id artist track time
1 2 Pac Baby Don’t Cry 4M 22S
2 2Ge+her The Hardest Part Of Breaking Up 3M 15S
3 3 Doors Down Kryptonite 3M 53S
4 3 Doors Down Loser 4M 24S
5 504 Boyz Wobble Wobble 3M 35S
6 98^0 Give Me Just One Night 3M 24S
7 A*Teens Dancing Queen 3M 44S
8 Aaliyah I Don’t Wanna 4M 15S
9 Aaliyah Try Again 4M 3S
10 Adams, Yolanda Open My Heart 5M 30S
id date.entered rank
1 2000-02-26 87
1 2000-02-26 82
1 2000-02-26 72
1 2000-02-26 77
1 2000-02-26 87
1 2000-02-26 94
1 2000-02-26 99
2 2000-09-02 91
2 2000-09-02 87
2 2000-09-02 92

A single observational unit is stored in multiple tables

It is common to find data values for individual observational units spread out over multiple tables or files. An example is the US census data, where each data set is split up by the year the census was conducted (e.g. census_2020, census_2010, etc.).

Formatting to Tidy Data

Melting/Pivoting Longer/Stacking

# cases
table4a %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "cases")
## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
# population
table4b %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "population")
## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

Joins

tidy4a <- table4a %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "cases")
# population
tidy4b <- table4b %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "population")

# join
left_join(tidy4a, tidy4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

See the dplyr cheatsheet for a more visual depiction for the various types of joins.

  • bind_rows() (Base equivalent is rbind())
  • bind_cols() (Base equivalent is cbind())
  • left_join(), right_join()
  • full_join(), inner_join()

Casting/Pivoting Wider/Unstacking

table2 %>%
  pivot_wider(id_cols = c(country, year),
              names_from = type,
              values_from = count)
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

String Manipulation

separate() (the opposite is unite())

table3 %>%
  separate(col = rate, 
           into = c("cases", "population"))
## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

package:stringr

The stringr package is a collection of consistent functions that make operations on strings easier. It is largely a wrapper for the stringi package.

ls("package:stringr")
##  [1] "%>%"               "boundary"          "coll"             
##  [4] "fixed"             "fruit"             "invert_match"     
##  [7] "regex"             "sentences"         "str_c"            
## [10] "str_conv"          "str_count"         "str_detect"       
## [13] "str_dup"           "str_ends"          "str_equal"        
## [16] "str_escape"        "str_extract"       "str_extract_all"  
## [19] "str_flatten"       "str_flatten_comma" "str_glue"         
## [22] "str_glue_data"     "str_interp"        "str_length"       
## [25] "str_like"          "str_locate"        "str_locate_all"   
## [28] "str_match"         "str_match_all"     "str_order"        
## [31] "str_pad"           "str_rank"          "str_remove"       
## [34] "str_remove_all"    "str_replace"       "str_replace_all"  
## [37] "str_replace_na"    "str_sort"          "str_split"        
## [40] "str_split_1"       "str_split_fixed"   "str_split_i"      
## [43] "str_squish"        "str_starts"        "str_sub"          
## [46] "str_sub_all"       "str_sub<-"         "str_subset"       
## [49] "str_to_lower"      "str_to_sentence"   "str_to_title"     
## [52] "str_to_upper"      "str_trim"          "str_trunc"        
## [55] "str_unique"        "str_view"          "str_view_all"     
## [58] "str_which"         "str_width"         "str_wrap"         
## [61] "word"              "words"

Manipulating Tidy Data

Select, transform, summarize

Group by and Summarize

library(gapminder)
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
gapminder %>%
  filter(year > 1960) %>%
  group_by(continent) %>%
  summarise(avg_life_exp = mean(lifeExp)) %>%
  arrange(desc(avg_life_exp))
## # A tibble: 5 × 2
##   continent avg_life_exp
##   <fct>            <dbl>
## 1 Oceania           75.2
## 2 Europe            73.2
## 3 Americas          66.7
## 4 Asia              62.5
## 5 Africa            50.6

Minimizing Repetition

package:purrr

purrr: part of the tidyverse, the basic function map() is very similar to the apply family of functions

  • The goal is to minimize repetition, while maximizing replication
  • Keep your code D.R.Y.

The basic arguments for map() are:

  • .x: A list or vector
  • .f: A function or formula
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
# notice you do not need the () after the function as you would if executing from consol
map(df, mean)
## $a
## [1] 0.3480497
## 
## $b
## [1] -0.06649011
## 
## $c
## [1] 0.1056875
## 
## $d
## [1] -0.6421021
# can pipe it an input to make the process even easier
df %>% map(mean)
## $a
## [1] 0.3480497
## 
## $b
## [1] -0.06649011
## 
## $c
## [1] 0.1056875
## 
## $d
## [1] -0.6421021

You may have noticed that the output format is not necessarily in the format that we might expect. The map() function always outputs a list.

There are helper functions that are explicit about the return types of a map function: map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).

# Should return a numerical vector rather than a list
map_dbl(df, mean)
##           a           b           c           d 
##  0.34804966 -0.06649011  0.10568747 -0.64210213