Homework 3: Probability, Statistics and Machine Learning

Spring 2025 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu

Author

Insert Your Name!!

Published

May 1, 2025

Note: For any simulation or random sampling, set the random seed at your student ID number, for example set.seed(6145678) in R or train_test_split(X, y, test_size=0.2, random_state=6145678) in Python
Please code in R for the problems starting with [R], and code in python for those starting with [Python].

1 Probability and Statistics

1.1 Monte Carlo Simulation

[R] We are in a classroom with 50 people. If we assume this is a randomly selected group of 50 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29.

Note that birthdays can be represented as numbers between 1 and 365, so a sample of 50 birthdays can be obtained like this:

n <- 50
bdays <- sample(x = 1:365, size = n, replace = TRUE)

To check if in this particular set of 50 people we have at least two with the same birthday, we can use the function duplicated(), which returns TRUE whenever an element of a vector is a duplicate. Here is an example:

duplicated(c(1, 2, 3, 1, 4, 3, 5))

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE

The second time 1 and 3 appear, we get a TRUE.

To check if two birthdays were the same, we simply use the any() and duplicated() functions like this:

any(duplicated(bdays))

[1] TRUE

In this case, we see that it did happen. At least two people had the same birthday.

To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 50 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday.

## code
# set.seed(your ID number)

1.2 Central Limit Theorem

Suppose random variables $X_1, X_2, \dots, X_n$ are independent and follow Chi-squared distribution with its parameter degrees of freedom (df) 1, $\chi^2_{df=1}$.

[R] Use dchisq() to plot $\chi^2_{df=1}$ distribution. $\chi^2_{df=1}$ takes any positive value. But let’s just consider values between 0 and 5, $x\in (0, 5)$ for plotting.

## code

[R] Consider three sample sizes $n = 2, 8, 100$, and set the sample size of the sample mean $\overline{X}_n$ be $1000$. Show the sampling distribution of $\overline{X}_n$, i.e., the collection $\{\overline{X}_n^{(m)}, m=1, 2, \dots, 1000\}$, looks more and more like Gaussian as $n$ increases by making histograms of $\overline{X}_n$ samples with $n = 2, 8, 100$.

The procedure is the following: For each $n = 2, 8, 100$,

Draw $n$ values $x_1, x_2, \dots, x_n$ using rchisq(n, df = 1).
Compute the mean of the $n$ values, which is $\overline{x}_n$.
Repeat i. and ii. 1000 times to obtain 1000 $\overline{x}_n$s.
Plot the histogram of these 1000 $\overline{x}_n$s.

## code
# set.seed(your ID number)

2 Machine Learning

2.1 Linear Regression

A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained.

[R] Import dose.csv into your working session. The data set is not tidy for us. Use pivot_longer() to make it tidy as the shown tibble below. Call the tidy data set dose_tidy for later regression analysis.

## code

## # A tibble: 15 × 3
##    dose_level tube  response
##         <dbl> <chr>    <dbl>
##  1          2 tube1        5
##  2          2 tube2        7
##  3          2 tube3        3
##  4          4 tube1       10
##  5          4 tube2       12
##  6          4 tube3       14
##  7          8 tube1       15
##  8          8 tube2       17
##  9          8 tube3       18
## 10         16 tube1       20
## 11         16 tube2       21
## 12         16 tube3       19
## 13         32 tube1       23
## 14         32 tube2       24
## 15         32 tube3       29

[R] Fit a simple linear regression with the predictor $\texttt{dose level}$ for response. Print the fitted result.

## code

[R] With (5), plot the data with a $95\%$ confidence interval for the mean response.

## code

[R] Fit a simple linear regression model with the predictor $\texttt{ln(dose level)}$ for response, where $\ln = \log_e$. Print the fitted result.

## code

[R] With (7), plot the data $(\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15$ with a $95\%$ confidence interval for the mean response.

## code

[R] Draw residual plots of Model in (5) and (7). According to the plots, which model you think is better?

## code

[Python] Import dose_tidy.csv and redo (5). Show the slope and intercept.


# code

[Python] to predict the response value when the dose level is 10 and 30.


# code

2.2 Binary Logistic Regression

[R] Import body.csv. Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule.

# code
# set.seed(your ID number)

[R] Fit a logistic regression with the predictor HEIGHT using the training sample data. Find the probability that the subject is male given HEIGHT = 165.

# code

[R] Fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.

# code

[R] Do the classification on the test set for the model (13) and (14), and compute the test accuracy rate. Which model gives us higher accuracy rate?

# code

[Python] Split the body data into a training set and a test set.


## code
# set random_state

[Python] Fit a logistic regression with the predictor BMI using the training sample data. Find the probability that the subject is male given BMI = 25.


# code

[Python] Do the classification on the test set. Compute the test accuracy rate.


# code

2.3 Multinomial Logistic Regression

[Python] We can actually use logistic regression for the responses with more than 2 categories.
- Import penguins.csv.
- Use method .dropna() to remove observations having missing values NaN.


# code

[Python] Use .value_counts() and .plot.bar() to get the frequency distribution and bar chart of species.


# code

[Python] Use the camplete-case penguins data and variable flipper_length_mm to classify the species. Save the fitted result as clf_pen. Compute the training accuracy rate.


# code

[Python] Construct the confusion matrix, and save it as the object cm. Then use the code below to show the confusion matrix as a plot.


# code

from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
disp = ConfusionMatrixDisplay(cm, display_labels=clf_pen.classes_)
disp.plot()
plt.show()

2.4 K-Nearest Neighbors (KNN)

[R] or [Python] (Choose one) Fit the KNN with $K=10$ using BMI on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix of the test data, and test accuracy rate.

# R code


# Python code

--- title: "Homework 3: Probability, Statistics and Machine Learning" subtitle: "Spring 2025 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu" format: html: code-fold: false code-tools: true date: today author: "**Insert Your Name!!**" number-sections: true from: markdown+emoji editor: source --- ```{r} #| label: setup #| include: false #################################################### ## !!!DO NOT MAKE ANY CHANGE OF THIS CODE CHUNK!!!## #################################################### # Package names packages <- c("knitr", "ggplot2", "ggrepel", "tidyverse", "formatR", "dslabs", "janitor", "ggthemes", "plotly", "tidymodels", "kknn") # Install packages not yet installed installed_packages <- packages %in% rownames(installed.packages()) if (any(installed_packages == FALSE)) { install.packages(packages[!installed_packages]) } # Packages loading invisible(lapply(packages, library, character.only = TRUE)) ``` - **Note: For any simulation or random sampling, set the random seed at your student ID number, for example `set.seed(6145678)` in R or `train_test_split(X, y, test_size=0.2, random_state=6145678)` in Python ** - Please code in R for the problems starting with **[R]**, and code in python for those starting with **[Python]**. # Probability and Statistics ## Monte Carlo Simulation            1. **[R]** We are in a classroom with 50 people. If we assume this is a randomly selected group of 50 people, what is the chance that [at least two people have the same birthday](https://betterexplained.com/articles/understanding-the-birthday-paradox/)? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29. i. Note that birthdays can be represented as numbers between 1 and 365, so a sample of 50 birthdays can be obtained like this: ```{r} n <- 50 bdays <- sample(x = 1:365, size = n, replace = TRUE) ``` ii. To check if in this particular set of 50 people we have at least two with the same birthday, we can use the function `duplicated()`, which returns `TRUE` whenever an element of a vector is a duplicate. Here is an example: ```{r} duplicated(c(1, 2, 3, 1, 4, 3, 5)) ``` The second time 1 and 3 appear, we get a `TRUE`. iii. To check if two birthdays were the same, we simply use the `any()` and `duplicated()` functions like this: ```{r} any(duplicated(bdays)) ``` In this case, we see that it did happen. At least two people had the same birthday. To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 50 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday. ```{r} ## code # set.seed(your ID number) ``` ## [Central Limit Theorem](https://chenghanyu-introstatsbook.netlify.app/prob-llnclt#central-limit-theorem) Suppose random variables $X_1, X_2, \dots, X_n$ are independent and follow [Chi-squared distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution) with its parameter degrees of freedom (df) 1, $\chi^2_{df=1}$. 2. **[R]** Use `dchisq()` to plot $\chi^2_{df=1}$ distribution. $\chi^2_{df=1}$ takes any positive value. But let's just consider values between 0 and 5, $x\in (0, 5)$ for plotting. ```{r} ## code ``` 3. **[R]** Consider three sample sizes $n = 2, 8, 100$, and set the sample size of the sample mean $\overline{X}_n$ be $1000$. Show the sampling distribution of $\overline{X}_n$, i.e., the collection $\{\overline{X}_n^{(m)}, m=1, 2, \dots, 1000\}$, looks more and more like Gaussian as $n$ increases by making histograms of $\overline{X}_n$ samples with $n = 2, 8, 100$. The procedure is the following: For each $n = 2, 8, 100$, i. Draw $n$ values $x_1, x_2, \dots, x_n$ using `rchisq(n, df = 1)`. ii. Compute the mean of the $n$ values, which is $\overline{x}_n$. iii. Repeat i. and ii. 1000 times to obtain 1000 $\overline{x}_n$s. iv. Plot the histogram of these 1000 $\overline{x}_n$s. ```{r} ## code # set.seed(your ID number) ``` # Machine Learning ## Linear Regression A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained. 4. **[R]** Import [`dose.csv`](./dose.csv) into your working session. The data set is not tidy for us. Use `pivot_longer()` to make it tidy as the shown tibble below. Call the tidy data set `dose_tidy` for later regression analysis. ```{r} ## code ## # A tibble: 15 × 3 ## dose_level tube response ## <dbl> <chr> <dbl> ## 1 2 tube1 5 ## 2 2 tube2 7 ## 3 2 tube3 3 ## 4 4 tube1 10 ## 5 4 tube2 12 ## 6 4 tube3 14 ## 7 8 tube1 15 ## 8 8 tube2 17 ## 9 8 tube3 18 ## 10 16 tube1 20 ## 11 16 tube2 21 ## 12 16 tube3 19 ## 13 32 tube1 23 ## 14 32 tube2 24 ## 15 32 tube3 29 ``` 5. **[R]** Fit a simple linear regression with the predictor $\texttt{dose level}$ for `response`. Print the fitted result. ```{r} ## code ``` 6. **[R]** With (5), plot the data with a $95\%$ confidence interval for the mean response. ```{r} ## code ``` 7. **[R]** Fit a simple linear regression model with the predictor $\texttt{ln(dose level)}$ for `response`, where $\ln = \log_e$. Print the fitted result. ```{r} ## code ``` 8. **[R]** With (7), plot the data $(\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15$ with a $95\%$ confidence interval for the mean response. ```{r} ## code ``` 9. **[R]** Draw residual plots of Model in (5) and (7). According to the plots, which model you think is better? ```{r} ## code ``` 10. **[Python]** Import [`dose_tidy.csv`](./dose_tidy.csv) and redo (5). Show the slope and intercept. ```{python} # code ``` 11. **[Python]** to predict the response value when the dose level is 10 and 30. ```{python} # code ``` ## Binary Logistic Regression 12. **[R]** Import [`body.csv`](./body.csv). Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule. ```{r} # code # set.seed(your ID number) ``` 13. **[R]** Fit a logistic regression with the predictor `HEIGHT` using the training sample data. Find the probability that the subject is male given `HEIGHT = 165`. ```{r} # code ``` 14. **[R]** Fit a logistic regression with the predictor `BMI` using the training sample data. Find the probability that the subject is male given `BMI = 25`. ```{r} # code ``` 15. **[R]** Do the classification on the test set for the model (13) and (14), and compute the test accuracy rate. Which model gives us higher accuracy rate? ```{r} # code ``` 16. **[Python]** Split the `body` data into a training set and a test set. ```{python} ## code # set random_state ``` 17. **[Python]** Fit a logistic regression with the predictor `BMI` using the training sample data. Find the probability that the subject is male given `BMI = 25`. ```{python} # code ``` 18. **[Python]** Do the classification on the test set. Compute the test accuracy rate. ```{python} # code ``` ## Multinomial Logistic Regression 19. **[Python]** We can actually use logistic regression for the responses with more than 2 categories. + Import `penguins.csv`. + Use method `.dropna()` to remove observations having missing values `NaN`. ```{python} # code ``` 20. **[Python]** Use [`.value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html) and [`.plot.bar()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.bar.html) to get the frequency distribution and bar chart of `species`. ```{python} # code ``` 21. **[Python]** Use the camplete-case penguins data and variable `flipper_length_mm` to classify the `species`. Save the fitted result as `clf_pen`. Compute the training accuracy rate. ```{python} # code ``` 22. **[Python]** Construct the confusion matrix, and save it as the object `cm`. Then use the code below to show the confusion matrix as a plot. ```{python} # code ``` ```{python} #| eval: false from sklearn.metrics import ConfusionMatrixDisplay import matplotlib.pyplot as plt disp = ConfusionMatrixDisplay(cm, display_labels=clf_pen.classes_) disp.plot() plt.show() ``` ## K-Nearest Neighbors (KNN) 23. **[R] or [Python] (Choose one)** Fit the KNN with $K=10$ using `BMI` on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix of the test data, and test accuracy rate. ```{r} # R code ``` ```{python} # Python code ```