Homework 1: Machine Learning Introduction

Due date: Wednesday, February 1, 2023 at 11:55pm

Turn in a PDF submission. You may include photos of handwritten solutions within your PDF, but make sure it is legible. Points will be taken off for illegible solutions. Full credit on non-coding problems will only be assigned if your work is shown.

Question 1: Model classes (1 point)

For each situation below, would you use Regression, Classification, or Clustering models?

Predicting the inches of rainfall tomorrow given the inches of rainfall over the past week (0.2 points)
Predicting the type of skin cancer from an image of the skin (0.2 points)
Determining the best grouping students into grade buckets of A+, A, A-, B+, B, B-, C+, C, C-, and F (0.2 points)
Forecasting the number of COVID-19 cases in 1 month given prior history (0.2 points)
Face ID on a smartphone (0.2 points)

Question 2: ROC Curve (3 points)

We have a logistic regression model which has been trained with the following weights and bias:

\[\hat{Y} = {1 \over 1 + e^{-(3x_1 - 4x_2 + 3)}}\]

Create an ROC curve for the following test data (2 points):

Input 1: \(x_1\)	Input 2: \(x_2\)	Expected Output: \(y\)
0	0	1
0	1	1
0	2	0
1	0	1
1	1	1
1	2	0
2	0	1
2	1	1
2	2	0
-1	0	0

Use the following decision thresholds: 0, 0.2, 0.4, 0.6, 0.8, and 1.

What is the AUROC? (1 point)

Show your work.

Question 3: Evaluation metrics (1 point)

Using the learned model and test data from Question 2, calculate the following metrics at a decision threshold of predicting 1 if p >= 0.5:

Accuracy (0.25 points)
Precision (0.25 points)
Recall (0.25 points)
Specificity (0.25 points)

Write your answer as a percentage. Show your work.

Question 4: Cross validation implementation (5 points)

(a) Implement a function for calculating accuracy. Fill in the calculate_accuracy(y_true, y_pred) function template. y_true is a NumPy array containing the true prediction class (integer) for each datapoint and y_pred is a NumPy array containing the corresponding predicted class (integer). (1 point)

(b) Implement a function for 10-fold cross validation using any provided dataset. Fill in the run_ten_fold_cross_validation(X, y) function template. X and y are NumPy arrays, where X is the entire list of data points and y is the list of corresponding labels. Your function should return the mean and standard deviation of the accuracy on the test set for the 10 folds. (4 points)

You may not import any Python library apart from the imports which are already provided in the notebook (you can use NumPy, which is already imported, for calculating mean and standard deviation). This includes sub-libraries (i.e., you cannot use sklearn beyond the svm module used in the helper functions).

To get credit for this problem, we should be able to run your notebook from top to bottom without any errors.

Extra credit (0.5 points): Modify the function to work for any number of folds (not just 10).

Create a copy of this Colab notebook. Submit a publicly accessible link to your notebook copy for this problem in your submitted homework PDF.

Submission instructions

Submit a PDF on Laulima. Make sure that your Colab notebook is publicly accessible.