Housekeeping

  • Website organization update. Added materials page.
  • Start to bring headphones to class. You’ll be watching short videos & taking notes during class.
  • ISLR Logistics
    • We will use a single ISLR repo for all work. I will add files to it regularly and review your work using the feedback branch. Don’t merge this branch. Push and pull regularly.
    • Before class: Watch videos and prepare answers to selected questions
    • During class: Discuss and compare answers to questions with classmates, sometimes will watch additional videos and share out.
    • After class
      • Finish answering any remaining questions
      • Complete any practice problems and/or activities assigned
  • Project Logistics
    • Weekly wednesday updates to the project page on the ADS website.
    • First (10 min) project report out on Thursday the 10th. Then every 2 weeks. Present overhead to class and possibly client via Zoom.
  • Reading logistics
    • Reading discussion every other week
    • Learning journal updates on off weeks.

Learning Path

Where we’ve been

  • Getting orientated with the semester long project
  • Practicing data wrangling and report writing for a professional audience.

Where we’re at

  • Learning how to balance textbook learning and project based learning while keeping the broader ethical implications in mind.
  • If you didn’t have an organization schedule for your classes yet, you should do so asap. The workload in this class is going to ramp up a bit.

Where we’re going

  • Digging into mathematical models of statistical learning.
  • Learning new R code methods, practicing building models.

Learning Objectives

  • Describe the difference between training and testing data sets
  • Describe the difference between a parametric and non-parametric model
  • Identify and describe situations where classification, regression, and clustering models are appropriate.
  • Explain the concept of overfitting, and bias-variance tradeoff.

Tuesday

Prepare

👥 Discuss these questions in your group and write the answers in the ch2-statistical-learning.Rmd file in your ISRL repo. You may not finish all questions in the time allotted during class, you will have to finish outside of class.

Statistical Learning and Regression (11:41)

  1. What is f?
  2. Why do we care about estimating f?
  3. Describe the two types of errors in a model.
  4. What concept in 456 does irreducible error portion of the model correspond to?

Curse of Dimensionality and Parametric Models (11:40)

  1. Summarize the curse of dimensionality.
  2. What is the difference between a parametric & non-parametric model? Give an example of each.
  3. What are the advantages & disadvantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)?
  4. Why would we ever choose to use a more restrictive method instead of a very flexible approach?
  5. What is the difference between supervised & unsupervised learning? Give an example of each.

Thursday

Prepare

  • Finish ISLR Chapter 2
  • Open project work time

Assessing Model Accuracy and Bias-Variance Trade-off (10:04)

  1. What is the primary measure of model accuracy for regression models?
  2. Compare and contrast a smoothing spline to a linear regression line. (What is the same, what is different)
  3. What’s the difference between training and testing data? Why do we need both?
  4. What is overfitting?
  5. If we don’t have a testing data set, what method can we use to estimate the MSE of the testing data?
  6. What is the bias-variance trade-off?

Classification Problems and K-Nearest Neighbors (15:37)

  1. What is the primary measure of model accuracy for classification models?
  2. Describe the Bayes Classifier
  3. What is the Bayes error rate?
  4. What is a limitation of the Bayes classifier?
  5. Describe how the K-Nearest Neighbors classifier works.
  6. Name a benefit of using a KNN model.
  7. What happens to the accuracy/bias of the model as the K increases? Why?

Assignments

Full details in your ISLR repo.