MATH/COSC 3570 Introduction to Data Science
mutate
: create new columns from the existing1
filter
: pick rows matching criteria
slice
: pick rows using index(es)
distinct
: filter for unique rows
select
: pick columns by name
summarise
: reduce variables to valuesgroup_by
: for grouped operationsarrange
: reorder rowsFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don’t modify in place
(murders <- read_csv("./data/murders.csv"))
# A tibble: 51 Ă— 5
state abb region population total
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
# ℹ 45 more rows
mutate()
dplyr::mutate()
takes
name = values
.(murders <- murders |>
mutate(rate = total / population * 100000)) #<<
# A tibble: 51 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
5 California CA West 37253956 1257 3.37
6 Colorado CO West 5029196 65 1.29
# ℹ 45 more rows
total
and population
inside the function are not defined in our R environment.
dplyr
functions know to look for variables in the data frame provided in the 1st argument.
filter()
dplyr::filter()
takes a
# filter the data table to only show the entries for which
# the murder rate is lower than 0.7
murders |>
filter(rate < 0.7) #<<
# A tibble: 5 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Hawaii HI West 1360301 7 0.515
2 Iowa IA North Central 3046355 21 0.689
3 New Hampshire NH Northeast 1316470 5 0.380
4 North Dakota ND North Central 672591 4 0.595
5 Vermont VT Northeast 625741 2 0.320
filter()
for Many Conditions at Oncemurders |>
filter(rate > 0.1 & rate < 0.7, #<<
region == "Northeast") #<<
# A tibble: 2 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 New Hampshire NH Northeast 1316470 5 0.380
2 Vermont VT Northeast 625741 2 0.320
operator | definition | operator | definition |
---|---|---|---|
< |
less than |
x | y
|
x OR y
|
<= |
less than or equal to | is.na(x) |
if x is NA
|
> |
greater than | !is.na(x) |
if x is not NA
|
>= |
greater than or equal to | x %in% y |
if x is in y
|
== |
exactly equal to | !(x %in% y) |
if x is not in y
|
!= |
not equal to | !x |
not x
|
x & y |
x AND y
|
slice()
for Certain Rows using Indexes# 3rd to 6th row
murders |>
slice(3:6)
# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Arizona AZ West 6392017 232 3.63
2 Arkansas AR South 2915918 93 3.19
3 California CA West 37253956 1257 3.37
4 Colorado CO West 5029196 65 1.29
How do we subset rows using matrix indexing?
murders[3:6, ]
# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Arizona AZ West 6392017 232 3.63
2 Arkansas AR South 2915918 93 3.19
3 California CA West 37253956 1257 3.37
4 Colorado CO West 5029196 65 1.29
distinct()
to Filter for Unique Rows# Select only unique/distinct rows from a data frame
murders |> distinct(region) ## default
# A tibble: 4 Ă— 1
region
<chr>
1 South
2 West
3 Northeast
4 North Central
murders |> distinct(region, .keep_all = TRUE) ## keep all other variables
# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Connecticut CT Northeast 3574097 97 2.71
4 Illinois IL North Central 12830632 364 2.84
distinct()
Grabs First Row of The Unique Valuemurders |> distinct(region, .keep_all = TRUE)
# A tibble: 4 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Connecticut CT Northeast 3574097 97 2.71
4 Illinois IL North Central 12830632 364 2.84
murders |> slice(1:5)
# A tibble: 5 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL South 4779736 135 2.82
2 Alaska AK West 710231 19 2.68
3 Arizona AZ West 6392017 232 3.63
4 Arkansas AR South 2915918 93 3.19
5 California CA West 37253956 1257 3.37
select()
In dplyr::select()
, the 1st argument is a data frame, followed by variable names being selected in the data.
The order of variable names matters!
names(murders)
[1] "state" "abb" "region" "population" "total"
[6] "rate"
# select three columns, assign this to a new object
murders |> select(region, rate, state)
# A tibble: 51 Ă— 3
region rate state
<chr> <dbl> <chr>
1 South 2.82 Alabama
2 West 2.68 Alaska
3 West 3.63 Arizona
4 South 3.19 Arkansas
5 West 3.37 California
6 West 1.29 Colorado
# ℹ 45 more rows
select()
to Exclude Variables## exclude variable population
murders |> select(-population)
# A tibble: 51 Ă— 5
state abb region total rate
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 135 2.82
2 Alaska AK West 19 2.68
3 Arizona AZ West 232 3.63
4 Arkansas AR South 93 3.19
5 California CA West 1257 3.37
6 Colorado CO West 65 1.29
# ℹ 45 more rows
select()
a Range of Variablesnames(murders)
[1] "state" "abb" "region" "population" "total"
[6] "rate"
## from region to rate
murders |> select(region:rate)
# A tibble: 51 Ă— 4
region population total rate
<chr> <dbl> <dbl> <dbl>
1 South 4779736 135 2.82
2 West 710231 19 2.68
3 West 6392017 232 3.63
4 South 2915918 93 3.19
5 West 37253956 1257 3.37
6 West 5029196 65 1.29
# ℹ 45 more rows
select()
Variables with Certain Characteristicsstarts_with()
is a tidy-select helper function.murders |> select(starts_with("r"))
# A tibble: 51 Ă— 2
region rate
<chr> <dbl>
1 South 2.82
2 West 2.68
3 West 3.63
4 South 3.19
5 West 3.37
6 West 1.29
# ℹ 45 more rows
select()
Variables with Certain Characteristicsends_with()
is a tidy-select helper function.murders |> select(ends_with("ion"))
# A tibble: 51 Ă— 2
region population
<chr> <dbl>
1 South 4779736
2 West 710231
3 West 6392017
4 South 2915918
5 West 37253956
6 West 5029196
# ℹ 45 more rows
starts_with()
: Starts with a prefixends_with()
: Ends with a suffixcontains()
: Contains a literal stringnum_range()
: Matches a numerical range like x01, x02, x03one_of()
: Matches variable names in a character vectoreverything()
: Matches all variableslast_col()
: Select last variable, possibly with an offsetmatches()
: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)How do we show three variables (state, region, rate) for states that have murder rates below 0.7?
new_table
new_table <- select(murders, state, region, rate)
filter(new_table, rate < 0.7)
# A tibble: 5 Ă— 3
state region rate
<chr> <chr> <dbl>
1 Hawaii West 0.515
2 Iowa North Central 0.689
3 New Hampshire Northeast 0.380
4 North Dakota North Central 0.595
5 Vermont Northeast 0.320
How do we show three variables (state, region, rate) for states that have murder rates below 0.7?
## not so easy to read and understand
filter(select(murders, state, region, rate), rate < 0.7)
# A tibble: 5 Ă— 3
state region rate
<chr> <chr> <dbl>
1 Hawaii West 0.515
2 Iowa North Central 0.689
3 New Hampshire Northeast 0.380
4 North Dakota North Central 0.595
5 Vermont Northeast 0.320
data > select() > data after selecting > filter() > data after selecting and filtering
murders |>
select(state, region, rate) |>
filter(rate < 0.7)
# A tibble: 5 Ă— 3
state region rate
<chr> <chr> <dbl>
1 Hawaii West 0.515
2 Iowa North Central 0.689
3 New Hampshire Northeast 0.380
4 North Dakota North Central 0.595
5 Vermont Northeast 0.320
summarize()
summarize()
provides a data frame that summarizes the statistics we compute.heights <- read_csv("./data/heights.csv")
glimpse(heights)
Rows: 1,050
Columns: 2
$ sex <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Female", "Fe…
$ height <dbl> 75, 70, 68, 74, 61, 65, 66, 62, 66, 67, 72, 72, 69, 68, 69, 66,…
summarize()
(s <- heights |>
filter(sex == "Female") |>
summarize(avg = mean(height),
stdev = sd(height),
med = median(height),
min = min(height)))
# A tibble: 1 Ă— 4
avg stdev med min
<dbl> <dbl> <dbl> <dbl>
1 64.9 3.76 65.0 51
s$avg
[1] 64.9
s$min
[1] 51
summarise()
produces a new data frame that is not any variant of the original data frame.
summarize()
quans
that has 3 values. The output is a 3 by 1 data frame.group_by()
(heights_group <- heights |>
group_by(sex)) #<<
# A tibble: 1,050 Ă— 2
# Groups: sex [2]
sex height
<chr> <dbl>
1 Male 75
2 Male 70
3 Male 68
4 Male 74
5 Male 61
6 Female 65
# ℹ 1,044 more rows
class(heights_group)
[1] "grouped_df" "tbl_df" "tbl" "data.frame"
heights_group
is a grouped data frame.
Tibbles are similar, but see Groups: sex [2]
after grouping data by sex
.
summarize()
behaves differently when acting on grouped_df
.
group_by()
+ summarize()
summarize()
applies the summarization to each group separately.heights |>
group_by(sex) |>
summarize(avg = mean(height), stdev = sd(height),
med = median(height), min = min(height))
# A tibble: 2 Ă— 5
sex avg stdev med min
<chr> <dbl> <dbl> <dbl> <dbl>
1 Female 64.9 3.76 65.0 51
2 Male 69.3 3.61 69 50
murders |>
group_by(region) |>
summarize(median_rate = median(rate))
# A tibble: 4 Ă— 2
region median_rate
<chr> <dbl>
1 North Central 1.97
2 Northeast 1.80
3 South 3.40
4 West 1.29
arrange()
arrange()
orders entire data tables.## order the states by population size
murders |>
arrange(population)
# A tibble: 51 Ă— 6
state abb region population total rate
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Wyoming WY West 563626 5 0.887
2 District of Columbia DC South 601723 99 16.5
3 Vermont VT Northeast 625741 2 0.320
4 North Dakota ND North Central 672591 4 0.595
5 Alaska AK West 710231 19 2.68
6 South Dakota SD North Central 814180 8 0.983
# ℹ 45 more rows
15-dplyr
In lab.qmd ## Lab 15
section, import the murders.csv
data and
Add (mutate) the variable rate = total / population * 100000
to murders
data (as I did).
Filter states that are in region Northeast or West and their murder rate is less than 1.
Select variables state
, region
, rate
.
Print the output table after you do 1. to 3., and save it as object my_states
.
Group my_states
by region
. Then summarize data by creating variables avg
and stdev
that compute the mean and standard deviation of rate
.
Arrange the summarized table by avg
.
state region rate
1 Hawaii West 0.515
2 Idaho West 0.766
3 Maine Northeast 0.828
4 New Hampshire Northeast 0.380
5 Oregon West 0.940
6 Utah West 0.796
7 Vermont Northeast 0.320
8 Wyoming West 0.887
# A tibble: 2 Ă— 3
region avg stdev
<fct> <dbl> <dbl>
1 West 0.781 0.164
2 Northeast 0.509 0.278
state abb region population total
0 Alabama AL South 4779736 135
1 Alaska AK West 710231 19
2 Arizona AZ West 6392017 232
3 Arkansas AR South 2915918 93
4 California CA West 37253956 1257
5 Colorado CO West 5029196 65
6 Connecticut CT Northeast 3574097 97
7 Delaware DE South 897934 38
8 District of Columbia DC South 601723 99
9 Florida FL South 19687653 669
10 Georgia GA South 9920000 376
11 Hawaii HI West 1360301 7
12 Idaho ID West 1567582 12
13 Illinois IL North Central 12830632 364
14 Indiana IN North Central 6483802 142
15 Iowa IA North Central 3046355 21
16 Kansas KS North Central 2853118 63
17 Kentucky KY South 4339367 116
18 Louisiana LA South 4533372 351
19 Maine ME Northeast 1328361 11
20 Maryland MD South 5773552 293
21 Massachusetts MA Northeast 6547629 118
22 Michigan MI North Central 9883640 413
23 Minnesota MN North Central 5303925 53
24 Mississippi MS South 2967297 120
25 Missouri MO North Central 5988927 321
26 Montana MT West 989415 12
27 Nebraska NE North Central 1826341 32
28 Nevada NV West 2700551 84
29 New Hampshire NH Northeast 1316470 5
30 New Jersey NJ Northeast 8791894 246
31 New Mexico NM West 2059179 67
32 New York NY Northeast 19378102 517
33 North Carolina NC South 9535483 286
34 North Dakota ND North Central 672591 4
35 Ohio OH North Central 11536504 310
36 Oklahoma OK South 3751351 111
37 Oregon OR West 3831074 36
38 Pennsylvania PA Northeast 12702379 457
39 Rhode Island RI Northeast 1052567 16
40 South Carolina SC South 4625364 207
41 South Dakota SD North Central 814180 8
42 Tennessee TN South 6346105 219
43 Texas TX South 25145561 805
44 Utah UT West 2763885 22
45 Vermont VT Northeast 625741 2
46 Virginia VA South 8001024 250
47 Washington WA West 6724540 93
48 West Virginia WV South 1852994 27
49 Wisconsin WI North Central 5686986 97
50 Wyoming WY West 563626 5
.assign
Have to use murders.total
and murders.population
instead of total
and popution
.
.query
Conditions must be a string to be evaluated!
Cannot write murders.rate
, and should use rate
.
.filter
Have to be strings
region rate state
0 South 2.82 Alabama
1 West 2.68 Alaska
2 West 3.63 Arizona
3 South 3.19 Arkansas
4 West 3.37 California
5 West 1.29 Colorado
6 Northeast 2.71 Connecticut
7 South 4.23 Delaware
8 South 16.45 District of Columbia
9 South 3.40 Florida
10 South 3.79 Georgia
11 West 0.51 Hawaii
12 West 0.77 Idaho
13 North Central 2.84 Illinois
14 North Central 2.19 Indiana
15 North Central 0.69 Iowa
16 North Central 2.21 Kansas
17 South 2.67 Kentucky
18 South 7.74 Louisiana
19 Northeast 0.83 Maine
20 South 5.07 Maryland
21 Northeast 1.80 Massachusetts
22 North Central 4.18 Michigan
23 North Central 1.00 Minnesota
24 South 4.04 Mississippi
25 North Central 5.36 Missouri
26 West 1.21 Montana
27 North Central 1.75 Nebraska
28 West 3.11 Nevada
29 Northeast 0.38 New Hampshire
30 Northeast 2.80 New Jersey
31 West 3.25 New Mexico
32 Northeast 2.67 New York
33 South 3.00 North Carolina
34 North Central 0.59 North Dakota
35 North Central 2.69 Ohio
36 South 2.96 Oklahoma
37 West 0.94 Oregon
38 Northeast 3.60 Pennsylvania
39 Northeast 1.52 Rhode Island
40 South 4.48 South Carolina
41 North Central 0.98 South Dakota
42 South 3.45 Tennessee
43 South 3.20 Texas
44 West 0.80 Utah
45 Northeast 0.32 Vermont
46 South 3.12 Virginia
47 West 1.38 Washington
48 South 1.46 West Virginia
49 North Central 1.71 Wisconsin
50 West 0.89 Wyoming
.groupby
+ .agg
dplyr::group_by() + dplyr::summarize()
.sort_values
state abb region population total rate
50 Wyoming WY West 563626 5 0.89
8 District of Columbia DC South 601723 99 16.45
45 Vermont VT Northeast 625741 2 0.32
34 North Dakota ND North Central 672591 4 0.59
1 Alaska AK West 710231 19 2.68
dplyr::arrange(desc())
state abb region population total rate
8 District of Columbia DC South 601723 99 16.45
18 Louisiana LA South 4533372 351 7.74
25 Missouri MO North Central 5988927 321 5.36
20 Maryland MD South 5773552 293 5.07
40 South Carolina SC South 4625364 207 4.48