Homework 4: Practical Issues

Due date: Monday, April 10, 2023 at 11:55pm

Turn in a PDF submission. You may include photos of handwritten solutions within your PDF, but make sure it is legible. Points will be taken off for illegible solutions. Full credit on non-coding problems will only be assigned if your work is shown.

Question 1: Recommendation Systems (0.5 points)

You aim to build a recommendation system for an online book store. The website has over 1 million items in its catalog but only 2,000 user ratings so far. Which of the following recommendation system choices would be best in this situation and why?

(A) User-user collaborative filtering

(B) Item-item collaborative filtering

Question 2: Effect of Tuning Parameters (0.75 points)

We have learned about several models in class. For each of the following models, we list a tuning parameter which aims to protect against overfitting. For each of the situations below, if you increase the parameter, does it lead to MORE or LESS regularization? Why? (0.1 points each)

Logistic regression with \(\lambda \sum_j abs(w_j)\) penalty in the loss function. Higher \(\lambda\) leads to ___ regularization.
Linear regression with \(\lambda \sum_j w_j^2\) penalty in the loss function. Higher \(\lambda\) leads to ___ regularization.
Feature selection with mutual information scoring: Include a feature in the model only if its MI(feat, class) is higher than a threshold t. Higher t leads to ___ regularization.
Decision tree: n, an upper limit on number of nodes in the tree. Higher n leads to ___ regularization.
Boosting: number of iterations, n. Higher n leads to ___ regularization.
Dimension reduction as preprocessing: Take first k principle components from PCA. Higher k leads to ___ regularization.

Question 3: Unsupervised Learning (1.25 points)

We want to perform hierarchical clustering in a one-dimensional space for the following data points: 1, 4, 9, 16, 25. Show what happens at each step until there are two clusters. What are the 2 clusters?

Your answer should be a table with a row for each step. The table should contain one column for the members of the new cluster formed and another column for the corresponding centroid.

Question 4: Dimensionality Reduction (Extra Credit)

Consider a dataset with two features, x1 and x2. The data is given by the following table:

Sample	x1	x2
A	4	1
B	2	3
C	5	4
D	1	0

(a) What is the eigenvector and eigenvalue of the first and second principle components of this dataset? Perform the steps of PCA and show your work. You may use an online eigenvalue/eigenvector calculator. (1 point)

(b) Plot the original datapoints on a 2-dimensional plot. Draw the first principle component on this plot. Then, draw the projections of the 4 points onto the first principle component. (1 point)

Question 5: Feature Engineering Contest (7.5 points)

In this problem, you will engineer the input features of a logistic regression model which predicts faces from an image (face recognition). We will be using the Labeled Faces in the Wild (LFW) dataset. Each picture is centered on a single face. More information about the dataset is here: http://vis-www.cs.umass.edu/lfw/.

Create a copy of this Colab notebook. While you should understand the entire notebook, your task is to fill in the transform_input_data() function only.

Grading:

Implementation of a new feature engineering strategy: 1.25 points
At least three sentences explaining why your strategy does or does not perform better than the baseline solution provided in the notebook: 1.25 points
Code cleanliness and organization: 1.25 points
Code comments: 1.25 points
One paragraph (at least 7 sentences) description of your feature engineering strategy in your HW4 PDF report: 2.5 points
Extra credit: 0.5 extra points for each 3% F1 score above 79% on a held out test set not provided to you (minimum score for extra credit is therefore 82%)

Submission instructions

Submit a PDF on Laulima. Make sure that your Colab notebook is publicly accessible. All of your code must be included in your submitted PDF.