MATH/COSC 3570 Introduction to Data Science
Have data organised in an unideal way for our analysis
Want to re-organise the data to carry on with our analysis
The goal of tidyr is to help you tidy your data via
pivoting for going between wide and long data
separating and uniting character columns
clarifying how NA
s should be treated
nesting and unnesting columns
To tidy your data, (1) figure out what the (column) variables and (row) observations are; (2) resolve one of two common problems:
One (column) variable spreads across multiple columns
customers <- read_csv("./data/sales/customers.csv")
more columns than we want!
customers
# A tibble: 2 Ă— 4
customer_id item_1 item_2 item_3
<dbl> <chr> <chr> <chr>
1 1 bread milk banana
2 2 milk toilet paper <NA>
item
showing all purchased times.One (row) subject is scattered across multiple rows
more rows than we want!
# A tibble: 6 Ă— 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
We may want each row corresponds to one single customer, not one single purchased item.
Which data format we adopt depends on our own research question.
pivot_longer()
and pivot_wider()
To transform our data to the one we want, we use pivot_longer()
and pivot_wider()
.
Starts with a data set,
pivot_longer()
``lengthens” data, adding more rows and decreasing the number of columns.
pivot_wider()
``widens” data, adding more columns and decreasing the number of rows.
pivot_longer()
and pivot_wider()
customers <- read_csv("data/sales/customers.csv")
more columns
customers
# A tibble: 2 Ă— 4
customer_id item_1 item_2 item_3
<dbl> <chr> <chr> <chr>
1 1 bread milk banana
2 2 milk toilet paper <NA>
more rows by pivot_longer()
# A tibble: 6 Ă— 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
pivot_longer()
data
: data frame
cols
: columns to pivot into longer format (1960, 1970, 2010)
pivot_longer()
names_to
: name of the column where column names of pivoted variables go (year)pivot_longer()
values_to
: name of the column where data values in pivoted variables go (metric)customers
# A tibble: 2 Ă— 4
customer_id item_1 item_2 item_3
<dbl> <chr> <chr> <chr>
1 1 bread milk banana
2 2 milk toilet paper <NA>
purchases <- customers |> pivot_longer(
# variables item_1 to item_3
# to be pivoted into longer format
cols = item_1:item_3,
# col name of the names of item_1:item_3
# item_1 item_2 and item_3
names_to = "item_no",
# col name of the values of item_1:item_3
values_to = "item"
)
purchases
# A tibble: 6 Ă— 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
In customers data,
Names item_1, item_2, item_3 are values of variable item_no
in purchases
Values bread, milk, etc are values of variable item
in purchases
purchases
data set and the prices
data can now be joined together with the common key variable item
.
prices <- read_csv("./data/sales/prices.csv")
prices
# A tibble: 5 Ă— 2
item price
<chr> <dbl>
1 avocado 0.5
2 banana 0.15
3 bread 1
4 milk 0.8
5 toilet paper 3
purchases |>
left_join(prices) #<<
# A tibble: 6 Ă— 4
customer_id item_no item price
<dbl> <chr> <chr> <dbl>
1 1 item_1 bread 1
2 1 item_2 milk 0.8
3 1 item_3 banana 0.15
4 2 item_1 milk 0.8
5 2 item_2 toilet paper 3
6 2 item_3 <NA> NA
pivot_wider()
data
: data frame
names_from
: which column variable in the long format contains the what should be column names in the wide format (year)
pivot_wider()
data
: data frame
values_from
: which column variable in the long format contains the what should be values in the new columns in the wide format (metric)
purchases
# A tibble: 6 Ă— 3
customer_id item_no item
<dbl> <chr> <chr>
1 1 item_1 bread
2 1 item_2 milk
3 1 item_3 banana
4 2 item_1 milk
5 2 item_2 toilet paper
6 2 item_3 <NA>
purchases |>
pivot_wider(
names_from = item_no,
values_from = item
)
# A tibble: 2 Ă— 4
customer_id item_1 item_2 item_3
<dbl> <chr> <chr> <chr>
1 1 bread milk banana
2 2 milk toilet paper <NA>
17-tidyr
In lab.qmd ## Lab 17
section,
Import trump.csv
. Call it trump_data
as below on the left.
Use pivot_longer()
to transform trump_data
into the data set trump_longer
on the right.
trump_data
# A tibble: 2,702 Ă— 4
subgroup date approval disapproval
<chr> <date> <dbl> <dbl>
1 Voters 2020-10-04 44.7 52.2
2 Adults 2020-10-04 43.2 52.6
3 Adults 2020-10-03 43.2 52.6
4 Voters 2020-10-03 45.0 51.7
5 Adults 2020-10-02 43.3 52.4
6 Voters 2020-10-02 44.5 52.1
# ℹ 2,696 more rows
# A tibble: 5,404 Ă— 4
subgroup date rating_type rating_value
<chr> <date> <chr> <dbl>
1 Voters 2020-10-04 approval 44.7
2 Voters 2020-10-04 disapproval 52.2
3 Adults 2020-10-04 approval 43.2
4 Adults 2020-10-04 disapproval 52.6
5 Adults 2020-10-03 approval 43.2
6 Adults 2020-10-03 disapproval 52.6
# ℹ 5,398 more rows
BONUS đź’·: Use trump_longer
to generate a plot like the one below.
customer_id item_1 item_2 item_3
0 1 bread milk banana
1 2 milk toilet paper NaN
purchases = customers.melt(id_vars=['customer_id'],
value_vars=['item_1', 'item_2', 'item_3'],
var_name='item_no', value_name='item')
purchases
customer_id item_no item
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 NaN
customer_id item_no item
0 1 item_1 bread
1 2 item_1 milk
2 1 item_2 milk
3 2 item_2 toilet paper
4 1 item_3 banana
5 2 item_3 NaN
item_no item_1 item_2 item_3
customer_id
1 bread milk banana
2 milk toilet paper NaN