MATH/COSC 3570 Introduction to Data Science
Have multiple data frames
Want to bring them together
SQL-like functions
left_join(x, y)
right_join(x, y)
full_join(x, y)
inner_join(x, y)
semi_join(x, y)
anti_join(x, y)
Data sets x
and y
share the same variable id
.
left_join(x, y)
: all rows from x## by = keys
left_join(x, y, by = "id")
# A tibble: 3 Ă— 3
id var_x var_y
<chr> <chr> <chr>
1 01 x1 y1
2 02 x2 y2
3 03 x3 <NA>
NA
is added to the id
not appearing in y
.left_join()
Examplepop_x
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
elec_vote_y
state elec_vote
1 California 55
2 Arizona 11
3 Alabama 9
4 Connecticut 7
5 Alaska 3
6 Delaware 3
pop_x |>
left_join(elec_vote_y) #<<
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
Connecticut
and Delaware
in elec_vote_y
will not be shown in the left-joined data because they are not in pop_x
.library(tidyverse)
library(dslabs)
pop_x <- murders |>
slice(1:6) |>
select(state, population)
elec_vote_y <- results_us_election_2016 |>
filter(state %in% c("Alabama", "Alaska", "Arizona",
"California", "Connecticut", "Delaware")) |>
select(state, electoral_votes) |>
rename(elec_vote = electoral_votes)
right_join(x, y)
: all rows from yright_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 3 Ă— 3
id var_x var_y
<chr> <chr> <chr>
1 01 x1 y1
2 02 x2 y2
3 04 <NA> y4
NA
is in the column coming from x
.right_join()
Examplepop_x
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
elec_vote_y
state elec_vote
1 California 55
2 Arizona 11
3 Alabama 9
4 Connecticut 7
5 Alaska 3
6 Delaware 3
pop_x |>
right_join(elec_vote_y) #<<
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 California 37253956 55
5 Connecticut NA 7
6 Delaware NA 3
Arkansas
and Colorado
in pop_x
will not be shown in the right-joined data because they are not in elec_vote_y
.full_join(x, y)
: all rows from both x and yfull_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 4 Ă— 3
id var_x var_y
<chr> <chr> <chr>
1 01 x1 y1
2 02 x2 y2
3 03 x3 <NA>
4 04 <NA> y4
NA
s.full_join()
Examplepop_x
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
elec_vote_y
state elec_vote
1 California 55
2 Arizona 11
3 Alabama 9
4 Connecticut 7
5 Alaska 3
6 Delaware 3
pop_x |>
full_join(elec_vote_y) #<<
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 Arkansas 2915918 NA
5 California 37253956 55
6 Colorado 5029196 NA
7 Connecticut NA 7
8 Delaware NA 3
full_join()
takes the union of observations of x
and y
, so it produces the data set with the most rows.inner_join(x, y)
: only rows w/ keys in both x and yinner_join(x, y)
Joining with `by = join_by(id)`
# A tibble: 2 Ă— 3
id var_x var_y
<chr> <chr> <chr>
1 01 x1 y1
2 02 x2 y2
inner_join()
Examplepop_x
state population
1 Alabama 4779736
2 Alaska 710231
3 Arizona 6392017
4 Arkansas 2915918
5 California 37253956
6 Colorado 5029196
elec_vote_y
state elec_vote
1 California 55
2 Arizona 11
3 Alabama 9
4 Connecticut 7
5 Alaska 3
6 Delaware 3
pop_x |>
inner_join(elec_vote_y) #<<
state population elec_vote
1 Alabama 4779736 9
2 Alaska 710231 3
3 Arizona 6392017 11
4 California 37253956 55
16-Joining tables
In lab.qmd ## Lab 16
section
diamond_color
.diamond_color <- readr::read_csv("the url")
left_join()
to combine the data set diamonds
in ggplot2 and diamond_color
by the key variable color
.carat
, color
, Description
, Details
.color
.# A tibble: 53,940 Ă— 4
carat color Description Details
<dbl> <chr> <chr> <chr>
1 0.23 E Colorless Minute traces of color
2 0.21 E Colorless Minute traces of color
3 0.23 E Colorless Minute traces of color
4 0.29 I Near Colorless Slightly detectable color
5 0.31 J Near Colorless Slightly detectable color
6 0.24 J Near Colorless Slightly detectable color
# ℹ 53,934 more rows
murders = pd.read_csv('./data/murders.csv')
pop_x = murders[0:6][['state','population']]
election = pd.read_csv('./data/results_us_election_2016.csv')
raws1 = ["Alabama", "Alaska", "Arizona", "California", "Connecticut", "Delaware"]
cols1 = ["state", "electoral_votes"]
df = election[cols1]
pop = []
for i in raws1:
mask = df["state"] == i
pos = np.flatnonzero(mask)
pop.append(pos)
pop = np.array(pop)
pop = np.resize(pop, 6)
elec_vote_y = df.iloc[pop]