Homework 4: Practical Issues

Due date: Monday, April 10, 2023 at 11:55pm

Turn in a PDF submission. You may include photos of handwritten solutions within your PDF, but make sure it is legible. Points will be taken off for illegible solutions. Full credit on non-coding problems will only be assigned if your work is shown.

Question 1: Recommendation Systems (0.5 points)

You aim to build a recommendation system for an online book store. The website has over 1 million items in its catalog but only 2,000 user ratings so far. Which of the following recommendation system choices would be best in this situation and why?

(A) User-user collaborative filtering

(B) Item-item collaborative filtering

(C) Content-based recommendation

Question 2: Effect of Tuning Parameters (0.75 points)

We have learned about several models in class. For each of the following models, we list a tuning parameter which aims to protect against overfitting. For each of the situations below, if you increase the parameter, does it lead to MORE or LESS regularization? Why? (0.1 points each)

Question 3: Unsupervised Learning (1.25 points)

We want to perform hierarchical clustering in a one-dimensional space for the following data points: 1, 4, 9, 16, 25. Show what happens at each step until there are two clusters. What are the 2 clusters?

Your answer should be a table with a row for each step. The table should contain one column for the members of the new cluster formed and another column for the corresponding centroid.

Question 4: Dimensionality Reduction (Extra Credit)

Consider a dataset with two features, x1 and x2. The data is given by the following table:

Sample x1 x2
A 4 1
B 2 3
C 5 4
D 1 0

(a) What is the eigenvector and eigenvalue of the first and second principle components of this dataset? Perform the steps of PCA and show your work. You may use an online eigenvalue/eigenvector calculator. (1 point)

(b) Plot the original datapoints on a 2-dimensional plot. Draw the first principle component on this plot. Then, draw the projections of the 4 points onto the first principle component. (1 point)

Question 5: Feature Engineering Contest (7.5 points)

In this problem, you will engineer the input features of a logistic regression model which predicts faces from an image (face recognition). We will be using the Labeled Faces in the Wild (LFW) dataset. Each picture is centered on a single face. More information about the dataset is here: http://vis-www.cs.umass.edu/lfw/.

Create a copy of this Colab notebook. While you should understand the entire notebook, your task is to fill in the transform_input_data() function only.

Grading:

Submission instructions

Submit a PDF on Laulima. Make sure that your Colab notebook is publicly accessible. All of your code must be included in your submitted PDF.