readr::read_lines("./data/murders.csv", n_max = 3) ## there is a header
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
MATH/COSC 3570 Introduction to Data Science
Function | Format | Typical suffix |
---|---|---|
read_table() |
white space separated values | txt |
read_csv() |
comma separated values | csv |
read_csv2() |
semicolon separated values | csv |
read_tsv() |
tab delimited separated values | tsv |
read_fwf() |
fixed width files | txt |
read_delim() |
general text file format, must define delimiter | txt |
Be careful: The suffix usually tells us what type of file it is, but no guarantee that these always match.
readr::read_lines("./data/murders.csv", n_max = 3) ## there is a header
[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"
[3] "Alaska,AK,West,710231,19"
read_csv()
prints out a column specification giving us delimiter, name and type of each column.
murders_csv <- read_csv(file = "./data/murders.csv")
# Rows: 51 Columns: 5
# ── Column specification ─────────────
# Delimiter: ","
# chr (3): state, abb, region
# dbl (2): population, total
head(murders_csv)
# A tibble: 6 × 5
state abb region population total
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
## View data in RStudio
view(murders_csv)
Which type is the column vector x
? Why?
read_csv()
only recognizes ” “ and NA as a missing value.na
.read_csv("./data/df-na.csv",
na = c("", "NA", ".", "9999", "Not applicable"))
problems()
# A tibble: 1 × 5
# row col expected actual file
# <int> <int> <chr> <chr> <chr>
# 1 7 1 a double . ""
type function | data type |
---|---|
col_character() |
character |
col_date() |
date |
col_datetime() |
POSIXct (date-time) |
col_double() |
double (numeric) |
col_factor() |
factor |
col_guess() |
let readr guess (default) |
col_integer() |
integer |
col_logical() |
logical |
col_number() |
numbers mixed with non-number characters |
col_numeric() |
double or integer |
col_skip() |
do not read |
col_time() |
time |
## Create tibbles using a row-by-row layout
(df <- tribble(
~x, ~y,
1, "a",
2, "b",
3, "c"
))
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 b
3 3 c
## same as tibble(x = 1:3, y = c(a, b, c))
## save data to "./data/df.csv"
df |> write_csv(file = "./data/df.csv")
read_rds()
and write_rds()
.Rds
in the R binary file format. 1
readr::write_rds(cars,
file = "./data/cars.rds")
# fs::dir_ls(path = "./data") |> head(10)
10-Import Data
tidyverse
package.In lab.qmd ## Lab 10
section,
read_csv()
and call them ssa_male
and ssa_female
, respectively.Age
(x-axis) vs. LifeExp
(y-axis) for Female
. The type should be “line”, and the line color is red. Add x-label, y-label and title to your plot.lines()
to add a line of Age
(x-axis) vs. LifeExp
(y-axis) for Male
to the plot. The color is blue.Function | Format | Typical suffix |
---|---|---|
read_excel() |
auto detect the format | xls, xlsx |
read_xls() |
original format | xls |
read_xlsx() |
new format | xlsx |
excel_sheets()
gives us the names of all the sheets in an Excel file.library(readxl)
excel_sheets("./data/2010_bigfive_regents.xls")
[1] "Sheet1" "Sheet2" "Sheet3"
sheet
argument to read sheets other than the first.excel_sheets("./data/2010_bigfive_regents.xls")
[1] "Sheet1" "Sheet2" "Sheet3"
(data_xls <- read_xls(path = "./data/2010_bigfive_regents.xls",
sheet = "Sheet3",
skip = 1))
# A tibble: 19 × 6
Scores `131024` `113804` `104201` `103886` `91756`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10 NA 64 8 227 34
2 11 6 83 11 217 58
3 12 23 87 7 28 67
4 13 1 54 16 230 42
5 14 3 145 18 303 57
6 15 58 151 50 192 98
7 16 1 129 13 156 125
8 17 73 214 59 163 115
# ℹ 11 more rows
pd.read_csv
pd.DataFrame.to_csv
pd.read_csv
pd.DataFrame.to_csv