Data Wrangling - two data frames 🛠

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

Joining data frames

Have multiple data frames
Want to bring them together
SQL-like functions
- left_join(x, y)
- right_join(x, y)
- full_join(x, y)
- inner_join(x, y)
- semi_join(x, y)
- anti_join(x, y)

Setup

Data sets x and y share the same variable id.

x <- tibble(
    id = c("01", "02", "03"),
    var_x = c("x1", "x2", "x3")
    )

# A tibble: 3 × 2
  id    var_x
  <chr> <chr>
1 01    x1   
2 02    x2   
3 03    x3

y <- tibble(
    id = c("01", "02", "04"),
    var_y = c("y1", "y2", "y4")
    )

# A tibble: 3 × 2
  id    var_y
  <chr> <chr>
1 01    y1   
2 02    y2   
3 04    y4

`left_join(x, y)`: all rows from x

## by = keys
left_join(x, y, by = "id")

# A tibble: 3 × 3
  id    var_x var_y
  <chr> <chr> <chr>
1 01    x1    y1   
2 02    x2    y2   
3 03    x3    <NA>

NA is added to the id not appearing in y.

OK first left_join. Look at this gif.
The idea is that left_join(x, y) keeps all the rows or observations from x, and keep all the variables in x and y, including id, var_x and var_y.
The variables used to connect two data tables are called keys, and we use by argument to tell dplyr which variable is the key
By default, the function uses all variables that appear in both tables as keys. So here, the default key is also “id” because “id” is the only variable that appears in both data sets.
The resulting data frame is shown here. The left join function basically keeps the entire data set x, and attaches the data set y to x with “id” in x.
Because y doesn’t have id 3, its value of var_y is a missing value NA.
We can use the venn diagram to visualize the idea of joining tables. And basically, the main data set is A or x. We keep everything of A, and we add stuff of B or y for observations that are in A or x only.

pop_x

       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196

elec_vote_y

        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3

pop_x |> 
    left_join(elec_vote_y) #<<

       state population elec_vote
1    Alabama    4779736         9
2     Alaska     710231         3
3    Arizona    6392017        11
4   Arkansas    2915918        NA
5 California   37253956        55
6   Colorado    5029196        NA

Connecticut and Delaware in elec_vote_y will not be shown in the left-joined data because they are not in pop_x.

library(tidyverse)
library(dslabs)
pop_x <- murders |> 
    slice(1:6) |>
    select(state, population)

elec_vote_y <- results_us_election_2016 |> 
    filter(state %in% c("Alabama", "Alaska", "Arizona", 
                        "California", "Connecticut", "Delaware")) |> 
    select(state, electoral_votes) |> 
    rename(elec_vote = electoral_votes)

`right_join(x, y)`: all rows from y

right_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 3 × 3
  id    var_x var_y
  <chr> <chr> <chr>
1 01    x1    y1   
2 02    x2    y2   
3 04    <NA>  y4

NA is in the column coming from x.

`right_join()` Example

pop_x

       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196

elec_vote_y

        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3

pop_x |> 
    right_join(elec_vote_y) #<<

        state population elec_vote
1     Alabama    4779736         9
2      Alaska     710231         3
3     Arizona    6392017        11
4  California   37253956        55
5 Connecticut         NA         7
6    Delaware         NA         3

Arkansas and Colorado in pop_x will not be shown in the right-joined data because they are not in elec_vote_y.

`full_join(x, y)`: all rows from both x and y

full_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 4 × 3
  id    var_x var_y
  <chr> <chr> <chr>
1 01    x1    y1   
2 02    x2    y2   
3 03    x3    <NA> 
4 04    <NA>  y4

Keep all the rows and fill the missing parts with NAs.

`full_join()` Example

pop_x

       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196

elec_vote_y

        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3

pop_x |> 
    full_join(elec_vote_y) #<<

        state population elec_vote
1     Alabama    4779736         9
2      Alaska     710231         3
3     Arizona    6392017        11
4    Arkansas    2915918        NA
5  California   37253956        55
6    Colorado    5029196        NA
7 Connecticut         NA         7
8    Delaware         NA         3

full_join() takes the union of observations of x and y, so it produces the data set with the most rows.

`inner_join(x, y)`: only rows w/ keys in both x and y

inner_join(x, y)

Joining with `by = join_by(id)`

# A tibble: 2 × 3
  id    var_x var_y
  <chr> <chr> <chr>
1 01    x1    y1   
2 02    x2    y2

Keep only the rows that have information in both tables.

`inner_join()` Example

pop_x

       state population
1    Alabama    4779736
2     Alaska     710231
3    Arizona    6392017
4   Arkansas    2915918
5 California   37253956
6   Colorado    5029196

elec_vote_y

        state elec_vote
1  California        55
2     Arizona        11
3     Alabama         9
4 Connecticut         7
5      Alaska         3
6    Delaware         3

pop_x |> 
    inner_join(elec_vote_y) #<<

       state population elec_vote
1    Alabama    4779736         9
2     Alaska     710231         3
3    Arizona    6392017        11
4 California   37253956        55

16-Joining tables

In lab.qmd ## Lab 16 section

Import the data at https://www.jaredlander.com/data/DiamondColors.csv. Call it diamond_color.

diamond_color <- readr::read_csv("the url")

Use left_join() to combine the data set diamonds in ggplot2 and diamond_color by the key variable color.

Select the variables carat, color, Description, Details.

## Variable "color" in diamonds but "Color" in diamond_color

joined_df <- diamonds |>  
    _______(_______, by = c('color' = 'Color')) |>  ## join
    _______(_________________________________________)  ## select

Create a bar chart of the variable color.

# A tibble: 53,940 × 4
  carat color Description    Details                  
  <dbl> <chr> <chr>          <chr>                    
1  0.23 E     Colorless      Minute traces of color   
2  0.21 E     Colorless      Minute traces of color   
3  0.23 E     Colorless      Minute traces of color   
4  0.29 I     Near Colorless Slightly detectable color
5  0.31 J     Near Colorless Slightly detectable color
6  0.24 J     Near Colorless Slightly detectable color
# ℹ 53,934 more rows

Joining Data Frames

pd.merge()

Left join
Code for generating the data sets

import numpy as np
import pandas as pd

pop_x

        state  population
0     Alabama     4779736
1      Alaska      710231
2     Arizona     6392017
3    Arkansas     2915918
4  California    37253956
5    Colorado     5029196

elec_vote_y

          state  electoral_votes
21      Alabama                9
43       Alaska                3
13      Arizona               11
0    California               55
26  Connecticut                7
44     Delaware                3

## dplyr::left_join()
pop_x.merge(right=elec_vote_y, how='left', on='state')

        state  population  electoral_votes
0     Alabama     4779736              9.0
1      Alaska      710231              3.0
2     Arizona     6392017             11.0
3    Arkansas     2915918              NaN
4  California    37253956             55.0
5    Colorado     5029196              NaN

murders = pd.read_csv('./data/murders.csv')
pop_x = murders[0:6][['state','population']]

election = pd.read_csv('./data/results_us_election_2016.csv')
raws1 = ["Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware"]
cols1 = ["state", "electoral_votes"]
df = election[cols1]
pop = []
for i in raws1:
    mask = df["state"] == i
    pos = np.flatnonzero(mask)
    pop.append(pos)

pop = np.array(pop)
pop = np.resize(pop, 6)
elec_vote_y = df.iloc[pop]

## dplyr::right_join()
pop_x.merge(elec_vote_y, how = 'right')

         state  population  electoral_votes
0      Alabama   4779736.0                9
1       Alaska    710231.0                3
2      Arizona   6392017.0               11
3   California  37253956.0               55
4  Connecticut         NaN                7
5     Delaware         NaN                3

## dplyr::full_join()
pop_x.merge(elec_vote_y, how = 'outer')

         state  population  electoral_votes
0      Alabama   4779736.0              9.0
1       Alaska    710231.0              3.0
2      Arizona   6392017.0             11.0
3     Arkansas   2915918.0              NaN
4   California  37253956.0             55.0
5     Colorado   5029196.0              NaN
6  Connecticut         NaN              7.0
7     Delaware         NaN              3.0

## dplyr::inner_join()
pop_x.merge(elec_vote_y, how = 'inner')

        state  population  electoral_votes
0     Alabama     4779736                9
1      Alaska      710231                3
2     Arizona     6392017               11
3  California    37253956               55

Data Wrangling - two data frames 🛠

Joining data frames

Joining data frames

Setup

left_join(x, y): all rows from x

left_join() Example

right_join(x, y): all rows from y

right_join() Example

full_join(x, y): all rows from both x and y

full_join() Example

inner_join(x, y): only rows w/ keys in both x and y

inner_join() Example

Joining Data Frames

pd.merge()

pd.merge()

`left_join(x, y)`: all rows from x

`left_join()` Example

`right_join(x, y)`: all rows from y

`right_join()` Example

`full_join(x, y)`: all rows from both x and y

`full_join()` Example

`inner_join(x, y)`: only rows w/ keys in both x and y

`inner_join()` Example