<- 50
n <- sample(x = 1:365, size = n, replace = TRUE) bdays
Homework 3: Probability, Statistics and Machine Learning
Spring 2025 MATH/COSC 3570 Introduction to Data Science by Dr. Cheng-Han Yu
Note: For any simulation or random sampling, set the random seed at your student ID number, for example
set.seed(6145678)
in R ortrain_test_split(X, y, test_size=0.2, random_state=6145678)
in PythonPlease code in R for the problems starting with [R], and code in python for those starting with [Python].
1 Probability and Statistics
1.1 Monte Carlo Simulation
- [R] We are in a classroom with 50 people. If we assume this is a randomly selected group of 50 people, what is the chance that at least two people have the same birthday? Here we use a Monte Carlo simulation. For simplicity, we assume nobody was born on February 29.
- Note that birthdays can be represented as numbers between 1 and 365, so a sample of 50 birthdays can be obtained like this:
- To check if in this particular set of 50 people we have at least two with the same birthday, we can use the function
duplicated()
, which returnsTRUE
whenever an element of a vector is a duplicate. Here is an example:
duplicated(c(1, 2, 3, 1, 4, 3, 5))
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE
The second time 1 and 3 appear, we get a TRUE
.
- To check if two birthdays were the same, we simply use the
any()
andduplicated()
functions like this:
any(duplicated(bdays))
[1] TRUE
In this case, we see that it did happen. At least two people had the same birthday.
To estimate the probability of a shared birthday in the group, repeat this experiment by sampling sets of 50 birthdays 10000 times, and find the relative frequency of the event that at least two people had the same birthday.
## code
# set.seed(your ID number)
1.2 Central Limit Theorem
Suppose random variables \(X_1, X_2, \dots, X_n\) are independent and follow Chi-squared distribution with its parameter degrees of freedom (df) 1, \(\chi^2_{df=1}\).
- [R] Use
dchisq()
to plot \(\chi^2_{df=1}\) distribution. \(\chi^2_{df=1}\) takes any positive value. But let’s just consider values between 0 and 5, \(x\in (0, 5)\) for plotting.
## code
- [R] Consider three sample sizes \(n = 2, 8, 100\), and set the sample size of the sample mean \(\overline{X}_n\) be \(1000\). Show the sampling distribution of \(\overline{X}_n\), i.e., the collection \(\{\overline{X}_n^{(m)}, m=1, 2, \dots, 1000\}\), looks more and more like Gaussian as \(n\) increases by making histograms of \(\overline{X}_n\) samples with \(n = 2, 8, 100\).
The procedure is the following: For each \(n = 2, 8, 100\),
- Draw \(n\) values \(x_1, x_2, \dots, x_n\) using
rchisq(n, df = 1)
. - Compute the mean of the \(n\) values, which is \(\overline{x}_n\).
- Repeat i. and ii. 1000 times to obtain 1000 \(\overline{x}_n\)s.
- Plot the histogram of these 1000 \(\overline{x}_n\)s.
## code
# set.seed(your ID number)
2 Machine Learning
2.1 Linear Regression
A pharmaceutical firm would like to obtain information on the relationship between the dose level and potency of a drug product. To do this, each of 15 test tubes is inoculated with a virus culture and incubated for 5 days at 30°C. Three test tubes are randomly assigned to each of the five different dose levels to be investigated (2, 4, 8, 16, and 32 mg). Each tube is injected with only one dose level, and the response of interest is obtained.
- [R] Import
dose.csv
into your working session. The data set is not tidy for us. Usepivot_longer()
to make it tidy as the shown tibble below. Call the tidy data setdose_tidy
for later regression analysis.
## code
## # A tibble: 15 × 3
## dose_level tube response
## <dbl> <chr> <dbl>
## 1 2 tube1 5
## 2 2 tube2 7
## 3 2 tube3 3
## 4 4 tube1 10
## 5 4 tube2 12
## 6 4 tube3 14
## 7 8 tube1 15
## 8 8 tube2 17
## 9 8 tube3 18
## 10 16 tube1 20
## 11 16 tube2 21
## 12 16 tube3 19
## 13 32 tube1 23
## 14 32 tube2 24
## 15 32 tube3 29
- [R] Fit a simple linear regression with the predictor \(\texttt{dose level}\) for
response
. Print the fitted result.
## code
- [R] With (5), plot the data with a \(95\%\) confidence interval for the mean response.
## code
- [R] Fit a simple linear regression model with the predictor \(\texttt{ln(dose level)}\) for
response
, where \(\ln = \log_e\). Print the fitted result.
## code
- [R] With (7), plot the data \((\ln(\text{dose level})_i, \text{response}_i), i = 1, \dots, 15\) with a \(95\%\) confidence interval for the mean response.
## code
- [R] Draw residual plots of Model in (5) and (7). According to the plots, which model you think is better?
## code
- [Python] Import
dose_tidy.csv
and redo (5). Show the slope and intercept.
# code
- [Python] to predict the response value when the dose level is 10 and 30.
# code
2.2 Binary Logistic Regression
- [R] Import
body.csv
. Split the data into a training set and a test set. Set the random seed at your student ID number. Use 80:20 rule.
# code
# set.seed(your ID number)
- [R] Fit a logistic regression with the predictor
HEIGHT
using the training sample data. Find the probability that the subject is male givenHEIGHT = 165
.
# code
- [R] Fit a logistic regression with the predictor
BMI
using the training sample data. Find the probability that the subject is male givenBMI = 25
.
# code
- [R] Do the classification on the test set for the model (13) and (14), and compute the test accuracy rate. Which model gives us higher accuracy rate?
# code
- [Python] Split the
body
data into a training set and a test set.
## code
# set random_state
- [Python] Fit a logistic regression with the predictor
BMI
using the training sample data. Find the probability that the subject is male givenBMI = 25
.
# code
- [Python] Do the classification on the test set. Compute the test accuracy rate.
# code
2.3 Multinomial Logistic Regression
- [Python] We can actually use logistic regression for the responses with more than 2 categories.
- Import
penguins.csv
. - Use method
.dropna()
to remove observations having missing valuesNaN
.
- Import
# code
- [Python] Use
.value_counts()
and.plot.bar()
to get the frequency distribution and bar chart ofspecies
.
# code
- [Python] Use the camplete-case penguins data and variable
flipper_length_mm
to classify thespecies
. Save the fitted result asclf_pen
. Compute the training accuracy rate.
# code
- [Python] Construct the confusion matrix, and save it as the object
cm
. Then use the code below to show the confusion matrix as a plot.
# code
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
= ConfusionMatrixDisplay(cm, display_labels=clf_pen.classes_)
disp
disp.plot() plt.show()
2.4 K-Nearest Neighbors (KNN)
- [R] or [Python] (Choose one) Fit the KNN with \(K=10\) using
BMI
on the training data and do the classification on the same test set used in logistic regression. Obtain the confusion matrix of the test data, and test accuracy rate.
# R code
# Python code