Data Importing

MATH/COSC 3570 Introduction to Data Science

Dr. Cheng-Han Yu
Department of Mathematical and Statistical Sciences
Marquette University

DBI, jsonline, xml2, httr

R Data Importing

Rectangular Data

readr 📦 Functions

Function	Format	Typical suffix
`read_table()`	white space separated values	txt
`read_csv()`	comma separated values	csv
`read_csv2()`	semicolon separated values	csv
`read_tsv()`	tab delimited separated values	tsv
`read_fwf()`	fixed width files	txt
`read_delim()`	general text file format, must define delimiter	txt

Be careful: The suffix usually tells us what type of file it is, but no guarantee that these always match.

readr::read_lines("./data/murders.csv", n_max = 3)  ## there is a header

[1] "state,abb,region,population,total" "Alabama,AL,South,4779736,135"     
[3] "Alaska,AK,West,710231,19"

readr provides the following functions to read your data into R.
Be careful: The suffix usually tells us what type of file it is, but no guarantee that these always match.
If you don’t want to open your data file, you can read several lines of your data in R using the function read_lines().
For example, I check the first three lines of the murders.csv file. And yes, it is a comma-separated-value or csv file.
Fixed width text files are special cases of text files where the format is specified by column widths, pad character and left/right alignment. Column widths are measured in units of characters. For example, if you have data in a text file where the first column always has exactly 10 characters, and the second column has exactly 5, the third has exactly 12 (and so on), this would be categorized as a fixed width text file.

Reading Data

read_csv() prints out a column specification giving us delimiter, name and type of each column.

murders_csv <- read_csv(file = "./data/murders.csv")
# Rows: 51 Columns: 5
# ── Column specification ─────────────
# Delimiter: ","
# chr (3): state, abb, region
# dbl (2): population, total
head(murders_csv)

# A tibble: 6 × 5
  state      abb   region population total
  <chr>      <chr> <chr>       <dbl> <dbl>
1 Alabama    AL    South     4779736   135
2 Alaska     AK    West       710231    19
3 Arizona    AZ    West      6392017   232
4 Arkansas   AR    South     2915918    93
5 California CA    West     37253956  1257
6 Colorado   CO    West      5029196    65

## View data in RStudio
view(murders_csv)

Missing Values

Which type is the column vector x? Why?

Type coercion¹ happens and all column elements are transformed to character type.

read_csv("./data/df-na.csv")

# A tibble: 9 × 3
  x     y              z     
  <chr> <chr>          <chr> 
1 1     a              hi    
2 <NA>  b              hello 
3 3     Not applicable 9999  
4 4     d              ola   
5 5     e              hola  
6 .     f              whatup
7 7     g              wassup
8 8     h              sup   
9 9     i              <NA>

Solution 1: Explicit NAs

By default, read_csv() only recognizes ” “ and NA as a missing value.
Specify the values that are used to represent missing values by argument na.

read_csv("./data/df-na.csv", 
         na = c("", "NA", ".", "9999", "Not applicable"))

# A tibble: 9 × 3
      x y     z     
  <dbl> <chr> <chr> 
1     1 a     hi    
2    NA b     hello 
3     3 <NA>  <NA>  
4     4 d     ola   
5     5 e     hola  
6    NA f     whatup
7     7 g     wassup
8     8 h     sup   
9     9 i     <NA>

Solution 2: Specify Column Types

read_csv("./data/df-na.csv", 
         col_types = cols(col_double(), 
                     col_character(), 
                     col_character()))

# A tibble: 9 × 3
      x y              z     
  <dbl> <chr>          <chr> 
1     1 a              hi    
2    NA b              hello 
3     3 Not applicable 9999  
4     4 d              ola   
5     5 e              hola  
6    NA f              whatup
7     7 g              wassup
8     8 h              sup   
9     9 i              <NA>

# Warning message:
# One or more parsing issues, 
# call `problems()` 
# on your data frame for details

problems()
# A tibble: 1 × 5
#     row   col expected actual file 
#   <int> <int> <chr>    <chr>  <chr>
# 1     7     1 a double .      ""

Another solution is to specify column types when you import your data.
This might be handy when you know what your column types should be.
For example, here I can specify the column types using the col_types argument.
All column types are wrapped up in the cols() command, that x is double, y is character and z is character as well.
Ans now you can see x becomes double.
And here we actually get a warning. In row 6 and col x, we expect a double but it is actually a dot or period, which is a character, and so R treats it as NA, which is exactly what I want to do.
But R send a message telling us about this, and make sure this is what we want to do.
This option might be preferable if you cannot scan your data file, or don;t know what convention is being used for missing values in your data file, but you happen to know your column types.
The downside of this is that, here, Not applicable and 9999, they are all treated as valid character values, and they are not missing values. So be careful about this. You may need to manually clean your data.

Column Types

type function	data type
`col_character()`	character
`col_date()`	date
`col_datetime()`	POSIXct (date-time)
`col_double()`	double (numeric)
`col_factor()`	factor
`col_guess()`	let readr guess (default)
`col_integer()`	integer
`col_logical()`	logical
`col_number()`	numbers mixed with non-number characters
`col_numeric()`	double or integer
`col_skip()`	do not read
`col_time()`	time

Writing Data

## Create tibbles using a row-by-row layout
(df <- tribble(
  ~x, ~y,
  1,  "a",
  2,  "b",
  3,  "c"
))

# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c

## same as tibble(x = 1:3, y = c(a, b, c))

## save data to "./data/df.csv"
df |> write_csv(file = "./data/df.csv")

`read_rds()` and `write_rds()`

We save an R object (usually a data set) in .Rds in the R binary file format. ¹

readr::write_rds(cars, 
                 file = "./data/cars.rds") 
# fs::dir_ls(path = "./data") |> head(10)

my_car <- readr::read_rds(file = "./data/cars.rds") 
head(my_car, 3)

  speed dist
1     4    2
2     4   10
3     7    4

10-Import Data

If you haven’t, install and load the tidyverse package.

In lab.qmd ## Lab 10 section,

Import ssa_male_prob.csv and ssa_female_prob.Rds in the data folder using read_csv() and call them ssa_male and ssa_female, respectively.

ssa_male <- readr::read____(____________)
ssa_female <- readr::read____(____________)

Plot Age (x-axis) vs. LifeExp (y-axis) for Female. The type should be “line”, and the line color is red. Add x-label, y-label and title to your plot.

plot(x = _____, y = _____, type = ______, col = ______,
     xlab = ______, ylab = _______, main = ____________)

Use lines() to add a line of Age (x-axis) vs. LifeExp (y-axis) for Male to the plot. The color is blue.

lines(x = _____, y = _____, col = ______)

library(tidyverse)
ssa <- read_csv(file = "./data/ssa-death-probability.csv")
ssa_male <- ssa[ssa$Sex == "Male", ]
ssa_female <- ssa[ssa$Sex == "Female", ]
plot(x = ssa_female$Age, y = ssa_female$LifeExp, 
     type = "l", col = 2, lwd = 3,
     xlab = "Age", ylab = "Life Exp",
     main = "Age vs. Life Exp by Gender")
lines(ssa_male$Age, ssa_male$LifeExp, col = 4, lwd = 3)

Read Excel Data

readxl 📦 Functions

Function	Format	Typical suffix
`read_excel()`	auto detect the format	xls, xlsx
`read_xls()`	original format	xls
`read_xlsx()`	new format	xlsx

The Microsoft Excel can have more than one sheet in one file.
The functions above read the first sheet by default.
The excel_sheets() gives us the names of all the sheets in an Excel file.

library(readxl)
excel_sheets("./data/2010_bigfive_regents.xls")

[1] "Sheet1" "Sheet2" "Sheet3"

Sheet Names

The sheet names can be passed to the sheet argument to read sheets other than the first.

excel_sheets("./data/2010_bigfive_regents.xls")

[1] "Sheet1" "Sheet2" "Sheet3"

(data_xls <- read_xls(path = "./data/2010_bigfive_regents.xls", 
                      sheet = "Sheet3", 
                      skip = 1))

# A tibble: 19 × 6
  Scores `131024` `113804` `104201` `103886` `91756`
   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
1     10       NA       64        8      227      34
2     11        6       83       11      217      58
3     12       23       87        7       28      67
4     13        1       54       16      230      42
5     14        3      145       18      303      57
6     15       58      151       50      192      98
7     16        1      129       13      156     125
8     17       73      214       59      163     115
# ℹ 11 more rows

`pd.read_csv`

`pd.DataFrame.to_csv`

`pd.read_csv`

import numpy as np
import pandas as pd

py_df = pd.read_csv('./data/murders.csv')
py_df.head()

        state abb region  population  total
0     Alabama  AL  South     4779736    135
1      Alaska  AK   West      710231     19
2     Arizona  AZ   West     6392017    232
3    Arkansas  AR  South     2915918     93
4  California  CA   West    37253956   1257

`pd.DataFrame.to_csv`

w = {"x":[1, 2, 3], 
     "y":['a', 'b','c']}
wdf = pd.DataFrame(w)

wdf.to_csv("./data/wdf.csv")
mydf = pd.read_csv('./data/wdf.csv')
mydf.head()

   Unnamed: 0  x  y
0           0  1  a
1           1  2  b
2           2  3  c

## index = False means don't write row names
wdf.to_csv("./data/wdf.csv", index = False)
mydf = pd.read_csv('./data/wdf.csv')
mydf.head()

Big Data in R and Python

Apache Arrow for R and Python

Apache Arrow develops big data systems to process and move data fast.

Data Importing

DBI, jsonline, xml2, httr

R Data Importing

Rectangular Data

readr 📦 Functions

Reading Data

Missing Values

Solution 1: Explicit NAs

Solution 2: Specify Column Types

Column Types

Writing Data

read_rds() and write_rds()

Read Excel Data

readxl 📦 Functions

Sheet Names

pd.read_csv

pd.DataFrame.to_csv

pd.read_csv

pd.DataFrame.to_csv

Big Data in R and Python

Apache Arrow for R and Python

`read_rds()` and `write_rds()`

`pd.read_csv`

`pd.DataFrame.to_csv`

`pd.read_csv`

`pd.DataFrame.to_csv`