A majority of this workshop is based on the Tidy Data paper by Hadley Wickham and the corresponding chapter in the R For Data Science e-book. I highly recommend reading both the paper and the e-book to get a deeper understanding of using R and tidy data.
Tidy data is a set of definitions and guidelines to organize data by in order to make it easier to facilitate data analysis. Data can be organized in many different ways, and it may seem like there is either no “best” way to arrange data. Take the following imaginary data. Table @ref(tab:040-imaginary-data-1) has two columns and three rows in a commonly seen format:
Treatment A | Treatment B | |
---|---|---|
John Smith | — | 2 |
Jane Doe | 16 | 11 |
Mary Johnson | 3 | 1 |
The same data can also be presented as in table @ref(tab:040-imaginary-data-2), this time a table with three columns and two rows (a transposition of the first table). Neither is more correct than the other, but necessity may dictate that one is more easy to work with than the other.
John Smith | Jane Doe | Mary Johnson | |
---|---|---|---|
Treatment A | — | 16 | 3 |
Treatment B | 2 | 11 | 1 |
Tidy data defines a few terms to “describe the underlying semantics … of the values displayed in [a] table.”
Consider now the same imaginary data as above, but now in a tidy format:
Name | Treatment | Outcome |
---|---|---|
John Smith | A | — |
John Smith | B | 2 |
Jane Doe | A | 16 |
Jane Doe | B | 11 |
Mary Johnson | A | 3 |
Mary Johnson | B | 1 |
In tidy data:
Tidy data is now well-defined, and messy data is data that is in any other format. Messy data is not useless, though, and certain analyses and operations require data to be in a format other than tidy. But Tidy data can serve as a launching point (or target) for cleaning, plotting, and analyzing a new data set.
Five most common problems of messy data sets:
A small set of tools can be used to clean most of the problems of messy data:
Messy
## re-encoding from CP1252
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k |
---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 |
Don’t know/refused | 15 | 14 | 15 | 11 | 10 | 35 |
Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 |
Hindu | 1 | 9 | 7 | 9 | 11 | 34 |
Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 |
Jehovah’s Witness | 20 | 27 | 24 | 24 | 21 | 30 |
Jewish | 19 | 19 | 25 | 25 | 30 | 95 |
Tidy
religion | income | freq |
---|---|---|
Agnostic | <$10k | 27 |
Agnostic | $10-20k | 34 |
Agnostic | $20-30k | 60 |
Agnostic | $30-40k | 81 |
Agnostic | $40-50k | 76 |
Agnostic | $50-75k | 137 |
Agnostic | $75-100k | 122 |
Agnostic | $100-150k | 109 |
Agnostic | >150k | 84 |
Agnostic | Don’t know/refused | 96 |
Messy
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 |
---|---|---|---|---|---|---|---|---|---|---|
AD | 2000 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | — | — |
AE | 2000 | 2 | 4 | 4 | 6 | 5 | 12 | 10 | — | 3 |
AF | 2000 | 52 | 228 | 183 | 149 | 129 | 94 | 80 | — | 93 |
AG | 2000 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | — | 1 |
AL | 2000 | 2 | 19 | 21 | 14 | 24 | 19 | 16 | — | 3 |
AM | 2000 | 2 | 152 | 130 | 131 | 63 | 26 | 21 | — | 1 |
AN | 2000 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | — | 0 |
AO | 2000 | 186 | 999 | 1003 | 912 | 482 | 312 | 194 | — | 247 |
AR | 2000 | 97 | 278 | 594 | 402 | 419 | 368 | 330 | — | 121 |
AS | 2000 | — | — | — | — | 1 | 1 | — | — | — |
Tidy
country | year | sex | age | cases |
---|---|---|---|---|
AD | 2000 | m | 0-14 | 0 |
AD | 2000 | m | 15-24 | 0 |
AD | 2000 | m | 25-34 | 1 |
AD | 2000 | m | 35-44 | 0 |
AD | 2000 | m | 45-54 | 0 |
AD | 2000 | m | 55-64 | 0 |
AD | 2000 | m | 65+ | 0 |
AE | 2000 | f | 0-14 | 3 |
AE | 2000 | f | 15-24 | 16 |
AE | 2000 | f | 25-34 | 1 |
Messy
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | d9 | d10 | d11 | d12 | d13 | d14 | d15 | d16 | d17 | d18 | d19 | d20 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MX17004 | 2010 | 1 | tmax | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 1 | tmin | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 2 | tmax | — | — | — | — | 27.3 | — | — | — | 24.1 | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 2 | tmin | — | — | — | — | 14.4 | — | — | — | 14.4 | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 3 | tmax | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 32.1 | — | — | — |
MX17004 | 2010 | 3 | tmin | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | 14.2 | — | — | — |
MX17004 | 2010 | 4 | tmax | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 4 | tmin | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 5 | tmax | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
MX17004 | 2010 | 5 | tmin | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — | — |
Tidy
id | date | tmax | tmin |
---|---|---|---|
MX17004 | 2010-02-05 | 27.3 | 14.4 |
MX17004 | 2010-02-09 | 24.1 | 14.4 |
MX17004 | 2010-03-17 | 32.1 | 14.2 |
MX17004 | 2010-07-09 | 28.6 | 17.5 |
MX17004 | 2010-08-17 | 29.6 | 15.8 |
MX17004 | 2010-08-29 | 29.0 | 17.3 |
MX17004 | 2010-10-17 | 27.0 | 14.0 |
MX17004 | 2010-10-25 | 28.1 | 12.9 |
MX17004 | 2010-11-05 | 31.3 | 16.3 |
MX17004 | 2010-11-13 | 27.2 | 12.0 |
Messy
year | artist | track | time | date.entered | wk1 | wk2 | wk3 |
---|---|---|---|---|---|---|---|
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 87 | 82 | 72 |
2000 | 2Ge+her | The Hardest Part Of Breaking Up | 3M 15S | 2000-09-02 | 91 | 87 | 92 |
2000 | 3 Doors Down | Kryptonite | 3M 53S | 2000-04-08 | 81 | 70 | 68 |
2000 | 3 Doors Down | Loser | 4M 24S | 2000-10-21 | 76 | 76 | 72 |
2000 | 504 Boyz | Wobble Wobble | 3M 35S | 2000-04-15 | 57 | 34 | 25 |
2000 | 98^0 | Give Me Just One Night | 3M 24S | 2000-08-19 | 51 | 39 | 34 |
2000 | A*Teens | Dancing Queen | 3M 44S | 2000-07-08 | 97 | 97 | 96 |
2000 | Aaliyah | I Don’t Wanna | 4M 15S | 2000-01-29 | 84 | 62 | 51 |
2000 | Aaliyah | Try Again | 4M 3S | 2000-03-18 | 59 | 53 | 38 |
2000 | Adams, Yolanda | Open My Heart | 5M 30S | 2000-08-26 | 76 | 76 | 74 |
Half Tidy
year | artist | track | time | date.entered | week | rank |
---|---|---|---|---|---|---|
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 1 | 87 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 2 | 82 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 3 | 72 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 4 | 77 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 5 | 87 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 6 | 94 |
2000 | 2 Pac | Baby Don’t Cry | 4M 22S | 2000-02-26 | 7 | 99 |
2000 | 2Ge+her | The Hardest Part Of Breaking Up | 3M 15S | 2000-09-02 | 1 | 91 |
2000 | 2Ge+her | The Hardest Part Of Breaking Up | 3M 15S | 2000-09-02 | 2 | 87 |
2000 | 2Ge+her | The Hardest Part Of Breaking Up | 3M 15S | 2000-09-02 | 3 | 92 |
Tidy
## Joining with `by = join_by(artist, track, time)`
|
|
It is common to find data values for individual observational units
spread out over multiple tables or files. An example is the US census
data, where each data set is split up by the year the census was
conducted (e.g. census_2020
, census_2010
,
etc.).
# cases
table4a %>%
pivot_longer(cols = `1999`:`2000`,
names_to = "year",
values_to = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
# population
table4b %>%
pivot_longer(cols = `1999`:`2000`,
names_to = "year",
values_to = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
tidy4a <- table4a %>%
pivot_longer(cols = `1999`:`2000`,
names_to = "year",
values_to = "cases")
# population
tidy4b <- table4b %>%
pivot_longer(cols = `1999`:`2000`,
names_to = "year",
values_to = "population")
# join
left_join(tidy4a, tidy4b)
## Joining with `by = join_by(country, year)`
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
See the dplyr cheatsheet for a more visual depiction for the various types of joins.
bind_rows()
(Base equivalent is
rbind()
)bind_cols()
(Base equivalent is
cbind()
)left_join()
, right_join()
full_join()
, inner_join()
table2 %>%
pivot_wider(id_cols = c(country, year),
names_from = type,
values_from = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
separate()
(the opposite is unite()
)
table3 %>%
separate(col = rate,
into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
package:stringr
The stringr
package is a collection of consistent
functions that make operations on strings easier. It is largely a
wrapper for the stringi
package.
ls("package:stringr")
## [1] "%>%" "boundary" "coll"
## [4] "fixed" "fruit" "invert_match"
## [7] "regex" "sentences" "str_c"
## [10] "str_conv" "str_count" "str_detect"
## [13] "str_dup" "str_ends" "str_equal"
## [16] "str_escape" "str_extract" "str_extract_all"
## [19] "str_flatten" "str_flatten_comma" "str_glue"
## [22] "str_glue_data" "str_interp" "str_length"
## [25] "str_like" "str_locate" "str_locate_all"
## [28] "str_match" "str_match_all" "str_order"
## [31] "str_pad" "str_rank" "str_remove"
## [34] "str_remove_all" "str_replace" "str_replace_all"
## [37] "str_replace_na" "str_sort" "str_split"
## [40] "str_split_1" "str_split_fixed" "str_split_i"
## [43] "str_squish" "str_starts" "str_sub"
## [46] "str_sub_all" "str_sub<-" "str_subset"
## [49] "str_to_lower" "str_to_sentence" "str_to_title"
## [52] "str_to_upper" "str_trim" "str_trunc"
## [55] "str_unique" "str_view" "str_view_all"
## [58] "str_which" "str_width" "str_wrap"
## [61] "word" "words"
Select, transform, summarize
library(gapminder)
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
gapminder %>%
filter(year > 1960) %>%
group_by(continent) %>%
summarise(avg_life_exp = mean(lifeExp)) %>%
arrange(desc(avg_life_exp))
## # A tibble: 5 × 2
## continent avg_life_exp
## <fct> <dbl>
## 1 Oceania 75.2
## 2 Europe 73.2
## 3 Americas 66.7
## 4 Asia 62.5
## 5 Africa 50.6
package:purrr
purrr
: part of the tidyverse, the basic function
map()
is very similar to the apply
family of
functions
The basic arguments for map()
are:
.x
: A list or vector.f
: A function or formuladf <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
# notice you do not need the () after the function as you would if executing from consol
map(df, mean)
## $a
## [1] 0.3480497
##
## $b
## [1] -0.06649011
##
## $c
## [1] 0.1056875
##
## $d
## [1] -0.6421021
# can pipe it an input to make the process even easier
df %>% map(mean)
## $a
## [1] 0.3480497
##
## $b
## [1] -0.06649011
##
## $c
## [1] 0.1056875
##
## $d
## [1] -0.6421021
You may have noticed that the output format is not necessarily in the
format that we might expect. The map()
function always
outputs a list.
There are helper functions that are explicit about the return types
of a map function: map_lgl()
, map_int()
,
map_dbl()
and map_chr()
return an atomic
vector of the indicated type (or die trying).
# Should return a numerical vector rather than a list
map_dbl(df, mean)
## a b c d
## 0.34804966 -0.06649011 0.10568747 -0.64210213