Tidy R

Introductions and Overview

A majority of this workshop is based on the Tidy Data paper by Hadley Wickham and the corresponding chapter in the R For Data Science e-book. I highly recommend reading both the paper and the e-book to get a deeper understanding of using R and tidy data.

Tidy Data

Tidy data is a set of definitions and guidelines to organize data by in order to make it easier to facilitate data analysis. Data can be organized in many different ways, and it may seem like there is either no “best” way to arrange data. Take the following imaginary data. Table @ref(tab:040-imaginary-data-1) has two columns and three rows in a commonly seen format:

Each row has patient data
	Treatment A	Treatment B
John Smith	—	2
Jane Doe	16	11
Mary Johnson	3	1

The same data can also be presented as in table @ref(tab:040-imaginary-data-2), this time a table with three columns and two rows (a transposition of the first table). Neither is more correct than the other, but necessity may dictate that one is more easy to work with than the other.

Each row is the treatment data
	John Smith	Jane Doe	Mary Johnson
Treatment A	—	16	3
Treatment B	2	11	1

Tidy data defines a few terms to “describe the underlying semantics … of the values displayed in [a] table.”

Values
- A single element of a data set
- Every value belongs to a variable and an observation
Variables
- A collection of all values that measure the same thing
Observation
- Contains all values measured on the same unit

Consider now the same imaginary data as above, but now in a tidy format:

Each row is an observation, and each column is a variable.
Name	Treatment	Outcome
John Smith	A	—
John Smith	B	2
Jane Doe	A	16
Jane Doe	B	11
Mary Johnson	A	3
Mary Johnson	B	1

In tidy data:

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

Tidy data is now well-defined, and messy data is data that is in any other format. Messy data is not useless, though, and certain analyses and operations require data to be in a format other than tidy. But Tidy data can serve as a launching point (or target) for cleaning, plotting, and analyzing a new data set.

Tidying Messy Data

Five most common problems of messy data sets:

Column headers are values, not variable names
Multiple variables are stored in one column
Variable names are stored in both rows and columns
Multiple types of observational units are stored in the same variable
A single observational unit is stored in multiple tables

A small set of tools can be used to clean most of the problems of messy data:

Melting (pivot longer)
String manipulation (splitting)
Casting (pivot wider)

Data in the column headers

Messy

## re-encoding from CP1252

Column header with variables
religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k
Agnostic	27	34	60	81	76	137
Atheist	12	27	37	52	35	70
Buddhist	27	21	30	34	33	58
Catholic	418	617	732	670	638	1116
Don’t know/refused	15	14	15	11	10	35
Evangelical Prot	575	869	1064	982	881	1486
Hindu	1	9	7	9	11	34
Historically Black Prot	228	244	236	238	197	223
Jehovah’s Witness	20	27	24	24	21	30
Jewish	19	19	25	25	30	95

Tidy

Pew data in a tidy format
religion	income	freq
Agnostic	<$10k	27
Agnostic	$10-20k	34
Agnostic	$20-30k	60
Agnostic	$30-40k	81
Agnostic	$40-50k	76
Agnostic	$50-75k	137
Agnostic	$75-100k	122
Agnostic	$100-150k	109
Agnostic	>150k	84
Agnostic	Don’t know/refused	96

Multiple variables in one column

Messy

Multiple data in individual columns
country	year	m014	m1524	m2534	m3544	m4554	m5564	m65	mu	f014
AD	2000	0	0	1	0	0	0	0	—	—
AE	2000	2	4	4	6	5	12	10	—	3
AF	2000	52	228	183	149	129	94	80	—	93
AG	2000	0	0	0	0	0	0	1	—	1
AL	2000	2	19	21	14	24	19	16	—	3
AM	2000	2	152	130	131	63	26	21	—	1
AN	2000	0	0	1	2	0	0	0	—	0
AO	2000	186	999	1003	912	482	312	194	—	247
AR	2000	97	278	594	402	419	368	330	—	121
AS	2000	—	—	—	—	1	1	—	—	—

Tidy

country	year	sex	age	cases
AD	2000	m	0-14	0
AD	2000	m	15-24	0
AD	2000	m	25-34	1
AD	2000	m	35-44	0
AD	2000	m	45-54	0
AD	2000	m	55-64	0
AD	2000	m	65+	0
AE	2000	f	0-14	3
AE	2000	f	15-24	16
AE	2000	f	25-34	1

Variable names in both rows and columns

Messy

id	year	month	element	d1	d2	d3	d4	d5	d6	d7	d8	d9	d10	d11	d12	d13	d14	d15	d16	d17	d18	d19	d20
MX17004	2010	1	tmax	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	1	tmin	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	2	tmax	—	—	—	—	27.3	—	—	—	24.1	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	2	tmin	—	—	—	—	14.4	—	—	—	14.4	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	3	tmax	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	32.1	—	—	—
MX17004	2010	3	tmin	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	14.2	—	—	—
MX17004	2010	4	tmax	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	4	tmin	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	5	tmax	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
MX17004	2010	5	tmin	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—

Tidy

id	date	tmax	tmin
MX17004	2010-02-05	27.3	14.4
MX17004	2010-02-09	24.1	14.4
MX17004	2010-03-17	32.1	14.2
MX17004	2010-07-09	28.6	17.5
MX17004	2010-08-17	29.6	15.8
MX17004	2010-08-29	29.0	17.3
MX17004	2010-10-17	27.0	14.0
MX17004	2010-10-25	28.1	12.9
MX17004	2010-11-05	31.3	16.3
MX17004	2010-11-13	27.2	12.0

Multiple types of observational units are stored in the same variable

Messy

year	artist	track	time	date.entered	wk1	wk2	wk3
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	87	82	72
2000	2Ge+her	The Hardest Part Of Breaking Up	3M 15S	2000-09-02	91	87	92
2000	3 Doors Down	Kryptonite	3M 53S	2000-04-08	81	70	68
2000	3 Doors Down	Loser	4M 24S	2000-10-21	76	76	72
2000	504 Boyz	Wobble Wobble	3M 35S	2000-04-15	57	34	25
2000	98^0	Give Me Just One Night	3M 24S	2000-08-19	51	39	34
2000	A*Teens	Dancing Queen	3M 44S	2000-07-08	97	97	96
2000	Aaliyah	I Don’t Wanna	4M 15S	2000-01-29	84	62	51
2000	Aaliyah	Try Again	4M 3S	2000-03-18	59	53	38
2000	Adams, Yolanda	Open My Heart	5M 30S	2000-08-26	76	76	74

Half Tidy

year	artist	track	time	date.entered	week	rank
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	1	87
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	2	82
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	3	72
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	4	77
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	5	87
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	6	94
2000	2 Pac	Baby Don’t Cry	4M 22S	2000-02-26	7	99
2000	2Ge+her	The Hardest Part Of Breaking Up	3M 15S	2000-09-02	1	91
2000	2Ge+her	The Hardest Part Of Breaking Up	3M 15S	2000-09-02	2	87
2000	2Ge+her	The Hardest Part Of Breaking Up	3M 15S	2000-09-02	3	92

Tidy

## Joining with `by = join_by(artist, track, time)`

id	artist	track	time
1	2 Pac	Baby Don’t Cry	4M 22S
2	2Ge+her	The Hardest Part Of Breaking Up	3M 15S
3	3 Doors Down	Kryptonite	3M 53S
4	3 Doors Down	Loser	4M 24S
5	504 Boyz	Wobble Wobble	3M 35S
6	98^0	Give Me Just One Night	3M 24S
7	A*Teens	Dancing Queen	3M 44S
8	Aaliyah	I Don’t Wanna	4M 15S
9	Aaliyah	Try Again	4M 3S
10	Adams, Yolanda	Open My Heart	5M 30S

id	date.entered	rank
1	2000-02-26	87
1	2000-02-26	82
1	2000-02-26	72
1	2000-02-26	77
1	2000-02-26	87
1	2000-02-26	94
1	2000-02-26	99
2	2000-09-02	91
2	2000-09-02	87
2	2000-09-02	92

A single observational unit is stored in multiple tables

It is common to find data values for individual observational units spread out over multiple tables or files. An example is the US census data, where each data set is split up by the year the census was conducted (e.g. census_2020, census_2010, etc.).

Formatting to Tidy Data

Melting/Pivoting Longer/Stacking

# cases
table4a %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "cases")

## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <dbl>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

# population
table4b %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "population")

## # A tibble: 6 × 3
##   country     year  population
##   <chr>       <chr>      <dbl>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

Joins

tidy4a <- table4a %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "cases")
# population
tidy4b <- table4b %>%
  pivot_longer(cols = `1999`:`2000`,
               names_to = "year",
               values_to = "population")

# join
left_join(tidy4a, tidy4b)

## Joining with `by = join_by(country, year)`

## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <dbl>      <dbl>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

See the dplyr cheatsheet for a more visual depiction for the various types of joins.

bind_rows() (Base equivalent is rbind())
bind_cols() (Base equivalent is cbind())
left_join(), right_join()
full_join(), inner_join()

Casting/Pivoting Wider/Unstacking

table2 %>%
  pivot_wider(id_cols = c(country, year),
              names_from = type,
              values_from = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

String Manipulation

separate() (the opposite is unite())

table3 %>%
  separate(col = rate, 
           into = c("cases", "population"))

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

package:stringr

The stringr package is a collection of consistent functions that make operations on strings easier. It is largely a wrapper for the stringi package.

ls("package:stringr")

##  [1] "%>%"               "boundary"          "coll"             
##  [4] "fixed"             "fruit"             "invert_match"     
##  [7] "regex"             "sentences"         "str_c"            
## [10] "str_conv"          "str_count"         "str_detect"       
## [13] "str_dup"           "str_ends"          "str_equal"        
## [16] "str_escape"        "str_extract"       "str_extract_all"  
## [19] "str_flatten"       "str_flatten_comma" "str_glue"         
## [22] "str_glue_data"     "str_interp"        "str_length"       
## [25] "str_like"          "str_locate"        "str_locate_all"   
## [28] "str_match"         "str_match_all"     "str_order"        
## [31] "str_pad"           "str_rank"          "str_remove"       
## [34] "str_remove_all"    "str_replace"       "str_replace_all"  
## [37] "str_replace_na"    "str_sort"          "str_split"        
## [40] "str_split_1"       "str_split_fixed"   "str_split_i"      
## [43] "str_squish"        "str_starts"        "str_sub"          
## [46] "str_sub_all"       "str_sub<-"         "str_subset"       
## [49] "str_to_lower"      "str_to_sentence"   "str_to_title"     
## [52] "str_to_upper"      "str_trim"          "str_trunc"        
## [55] "str_unique"        "str_view"          "str_view_all"     
## [58] "str_which"         "str_width"         "str_wrap"         
## [61] "word"              "words"

Manipulating Tidy Data

Select, transform, summarize

Group by and Summarize

library(gapminder)
glimpse(gapminder)

## Rows: 1,704
## Columns: 6
## $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

gapminder %>%
  filter(year > 1960) %>%
  group_by(continent) %>%
  summarise(avg_life_exp = mean(lifeExp)) %>%
  arrange(desc(avg_life_exp))

## # A tibble: 5 × 2
##   continent avg_life_exp
##   <fct>            <dbl>
## 1 Oceania           75.2
## 2 Europe            73.2
## 3 Americas          66.7
## 4 Asia              62.5
## 5 Africa            50.6

Minimizing Repetition

`package:purrr`

purrr: part of the tidyverse, the basic function map() is very similar to the apply family of functions

The goal is to minimize repetition, while maximizing replication
Keep your code D.R.Y.

The basic arguments for map() are:

.x: A list or vector
.f: A function or formula

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
# notice you do not need the () after the function as you would if executing from consol
map(df, mean)

## $a
## [1] -0.1965615
## 
## $b
## [1] -0.4400086
## 
## $c
## [1] -0.2027829
## 
## $d
## [1] 0.3475113

# can pipe it an input to make the process even easier
df %>% map(mean)

## $a
## [1] -0.1965615
## 
## $b
## [1] -0.4400086
## 
## $c
## [1] -0.2027829
## 
## $d
## [1] 0.3475113

You may have noticed that the output format is not necessarily in the format that we might expect. The map() function always outputs a list.

There are helper functions that are explicit about the return types of a map function: map_lgl(), map_int(), map_dbl() and map_chr() return an atomic vector of the indicated type (or die trying).

# Should return a numerical vector rather than a list
map_dbl(df, mean)

##          a          b          c          d 
## -0.1965615 -0.4400086 -0.2027829  0.3475113