Tidying Data đź§ą

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Grammar of Data Tidying

Grammar of Data Tidying

  • Have data organised in an unideal way for our analysis

  • Want to re-organise the data to carry on with our analysis


The goal of tidyr is to help you tidy your data via

  • pivoting for going between wide and long data

  • separating and uniting character columns

  • clarifying how NAs should be treated

  • nesting and unnesting columns

Wide Data

To tidy your data, (1) figure out what the (column) variables and (row) observations are; (2) resolve one of two common problems:

One (column) variable spreads across multiple columns

customers <- read_csv("./data/sales/customers.csv")

wider (\(2 \times 4\))

more columns than we want!

customers
# A tibble: 2 Ă— 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
  • We may want one single column variable item showing all purchased times.

Long Data

One (row) subject is scattered across multiple rows

longer (\(6 \times 3\))

more rows than we want!

# A tibble: 6 Ă— 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        
  • We may want each row corresponds to one single customer, not one single purchased item.

  • Which data format we adopt depends on our own research question.

pivot_longer() and pivot_wider()

  • To transform our data to the one we want, we use pivot_longer() and pivot_wider().

  • Starts with a data set,

    • pivot_longer() ``lengthens” data, adding more rows and decreasing the number of columns.

    • pivot_wider() ``widens” data, adding more columns and decreasing the number of rows.

pivot_longer() and pivot_wider()

Data: sales/customers.csv

customers <- read_csv("data/sales/customers.csv")

wider (\(2 \times 4\))

more columns

customers
# A tibble: 2 Ă— 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  

longer (\(6 \times 3\))

more rows by pivot_longer()

# A tibble: 6 Ă— 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        

pivot_longer()

  • data: data frame

  • cols: columns to pivot into longer format (1960, 1970, 2010)

data |> pivot_longer(
    cols, 
    names_to = "name", 
    values_to = "value")

Source: https://tavareshugo.github.io/r-intro-tidyverse-gapminder/09-reshaping/index.html

pivot_longer()

  • names_to: name of the column where column names of pivoted variables go (year)
data |> pivot_longer(
    cols, 
    names_to = "name", 
    values_to = "value")

Source: https://tavareshugo.github.io/r-intro-tidyverse-gapminder/09-reshaping/index.html

pivot_longer()

  • values_to: name of the column where data values in pivoted variables go (metric)
data |> pivot_longer(
    cols, 
    names_to = "name", 
    values_to = "value")

Source: https://tavareshugo.github.io/r-intro-tidyverse-gapminder/09-reshaping/index.html

customers \(\rightarrow\) purchases

customers
# A tibble: 2 Ă— 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
purchases <- customers |> pivot_longer(
    # variables item_1 to item_3 
    # to be pivoted into longer format 
    cols = item_1:item_3,
    
    # col name of the names of item_1:item_3
    # item_1 item_2 and item_3 
    names_to = "item_no",
    
    # col name of the values of item_1:item_3
    values_to = "item"
    )
purchases
# A tibble: 6 Ă— 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        

In customers data,

  • Names item_1, item_2, item_3 are values of variable item_no in purchases

  • Values bread, milk, etc are values of variable item in purchases

Why Pivot?

  • The next step of your analysis needs it.
  • The new purchases data set and the prices data can now be joined together with the common key variable item.
prices <- read_csv("./data/sales/prices.csv")
prices
# A tibble: 5 Ă— 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3   
purchases |> 
    left_join(prices) #<<
# A tibble: 6 Ă— 4
  customer_id item_no item         price
        <dbl> <chr>   <chr>        <dbl>
1           1 item_1  bread         1   
2           1 item_2  milk          0.8 
3           1 item_3  banana        0.15
4           2 item_1  milk          0.8 
5           2 item_2  toilet paper  3   
6           2 item_3  <NA>         NA   

pivot_wider()

  • data: data frame

  • names_from: which column variable in the long format contains the what should be column names in the wide format (year)

data |> pivot_wider(
    names_from = column_variable_1, 
    values_from = column_variable_2)

Source: https://tavareshugo.github.io/r-intro-tidyverse-gapminder/09-reshaping/index.html

pivot_wider()

  • data: data frame

  • values_from: which column variable in the long format contains the what should be values in the new columns in the wide format (metric)

data |> pivot_wider(
    names_from = column_variable_1, 
    values_from = column_variable_2)

Source: https://tavareshugo.github.io/r-intro-tidyverse-gapminder/09-reshaping/index.html

purchases \(\rightarrow\) customers

purchases
# A tibble: 6 Ă— 3
  customer_id item_no item        
        <dbl> <chr>   <chr>       
1           1 item_1  bread       
2           1 item_2  milk        
3           1 item_3  banana      
4           2 item_1  milk        
5           2 item_2  toilet paper
6           2 item_3  <NA>        
purchases |> 
    pivot_wider(              
        names_from = item_no, 
        values_from = item    
    ) 
# A tibble: 2 Ă— 4
  customer_id item_1 item_2       item_3
        <dbl> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  

Source: FiveThirtyEight

17-tidyr

In lab.qmd ## Lab 17 section,

  • Import trump.csv. Call it trump_data as below on the left.

  • Use pivot_longer() to transform trump_data into the data set trump_longer on the right.

trump_data
# A tibble: 2,702 Ă— 4
  subgroup date       approval disapproval
  <chr>    <date>        <dbl>       <dbl>
1 Voters   2020-10-04     44.7        52.2
2 Adults   2020-10-04     43.2        52.6
3 Adults   2020-10-03     43.2        52.6
4 Voters   2020-10-03     45.0        51.7
5 Adults   2020-10-02     43.3        52.4
6 Voters   2020-10-02     44.5        52.1
# ℹ 2,696 more rows
trump_longer <- ______________
    pivot_longer(
        cols = ____________,
        names_to = _______________,
        values_to = _______________
    ) 
# A tibble: 5,404 Ă— 4
  subgroup date       rating_type rating_value
  <chr>    <date>     <chr>              <dbl>
1 Voters   2020-10-04 approval            44.7
2 Voters   2020-10-04 disapproval         52.2
3 Adults   2020-10-04 approval            43.2
4 Adults   2020-10-04 disapproval         52.6
5 Adults   2020-10-03 approval            43.2
6 Adults   2020-10-03 disapproval         52.6
# ℹ 5,398 more rows

BONUS đź’·: Use trump_longer to generate a plot like the one below.

Pivoting

pd.melt()

pd.pivot()

pd.melt()

import numpy as np
import pandas as pd
customers = pd.read_csv('./data/sales/customers.csv')
customers
   customer_id item_1        item_2  item_3
0            1  bread          milk  banana
1            2   milk  toilet paper     NaN
purchases = customers.melt(id_vars=['customer_id'], 
                           value_vars=['item_1', 'item_2', 'item_3'],
                           var_name='item_no', value_name='item')
purchases
   customer_id item_no          item
0            1  item_1         bread
1            2  item_1          milk
2            1  item_2          milk
3            2  item_2  toilet paper
4            1  item_3        banana
5            2  item_3           NaN

pd.pivot()

purchases
   customer_id item_no          item
0            1  item_1         bread
1            2  item_1          milk
2            1  item_2          milk
3            2  item_2  toilet paper
4            1  item_3        banana
5            2  item_3           NaN
purchases.pivot(index = "customer_id", columns = "item_no", values = "item")
item_no     item_1        item_2  item_3
customer_id                             
1            bread          milk  banana
2             milk  toilet paper     NaN