INFO-I 415 Introduction to Statistical Learning
3 credits
- Prerequisite(s): None
- Delivery: On-Campus, Online
- Semesters offered: Fall, Spring, Summer (Check the schedule to confirm.)
Description
This course applies statistical learning methods for data mining and inferential and predictive analytics to informatics-related fields. The course also covers techniques for exploring and visualizing data, assessing model accuracy, and weighing the merits of different methods for a given real-world application. This course is an essential toolset for transforming large, complex informatics datasets into actionable knowledge.
Program Learning Outcomes Supported
- A2: Data Literacy - Recognize that data can have value and play a key role in society by providing opportunities to expand knowledge, to innovate, and to influence.
- B1: Data Science - Organize, visualize, and analyze large, complex datasets using descriptive statistics and graphs to make decisions.
- B5: Data Science - Identify, assess, and select appropriately among data analytics methods and models for solving real-world problems, weighing their advantages and disadvantages.
- B6: Data Science - Understand data science concepts, techniques, and tools to support big data analytics.
- B7: Data Science - Analyze datasets with the following supervised learning methods: for functional approximation, multiple linear regression, splines, and local regression; for classification, logistic regression, linear discriminant analysis, decision trees, bagging, random forests, and boosting, and support vector machines.
- B8: Data Science - Analyze datasets with the following unsupervised learning methods: for dimensionality reduction, principal components analysis; for grouping, k-means clustering and hierarchical clustering.
- B9: Data Science - Explore, transform, and visualize large, complex datasets with graphs in R.
- B12: Data Science - Write programs to perform data analytics on large, complex datasets in R.
- C5: Information Science - Understand critical issues associated with the storage, backup, and security of data.
Learning Outcomes
- Analyze datasets with the following supervised learning methods: for functional approximation, multiple linear regression, splines, and local regression; for classification, logistic regression, linear discriminant analysis, decision trees, bagging, random forests, and boosting, and support vector machines.
- Analyze datasets with the following unsupervised learning methods: for dimensionality reduction, principal components analysis; for grouping, k-means clustering and hierarchical clustering.
- Explore, transform, and visualize large, complex datasets with graphs in R.
- Solve real-world problems by adapting and applying statistical learning methods to large, complex datasets.
- Identify and select appropriately among statistical learning methods for a particular real-world problem; analyze each method with respect to a given dataset or research question in terms of modeling accuracy and the biasvariance tradeoff; perform model assessment (i.e., estimate test error rates) and selection by resampling: crossvalidation and bootstrapping; identify overfitting and underfitting; perform model selection and regularization by subset selection and shrinkage methods: ridge regression and Lasso; explain the relative advantages and disadvantages of each statistical learning method for the real-world problem.
- Write programs to perform data analytics on large, complex datasets in R.
- Analyze data from case studies in informatics-related fields (e.g., digital media, human-computer interaction, health informatics, bioinformatics, and business intelligence).
Profiles of Learning for Undergraduate Success (PLUS) Alignment
Instructors align their courses with the Profiles of Learning for Undergraduate Success. The profiles provide students various opportunities to deepen disciplinary understanding, participate in engaged learning, and refine what it means to be a well-rounded, well-educated person prepared for lifelong learning and success.
- P2.1 Problem Solver – Think critically.
- P2.3 Problem Solver – Analyzes, synthesizes, and evaluates.
- P3.2 Innovator – Creates/designs.
- P3.1 Innovator – Investigates
Course Overview
Module 0: Introduction to Course/ Getting Started
- Course Basics and Course Navigation
- Course Structure and Schedule
- Accessibility Acknowledgement
- Writing Resources and Student Engagement Roster
- What is Zoom @IU?
- How to Create a Video
Module 1: Overview of Statistical Learning
- Understand and use the terminology of Statistical Learning
- Estimate f
- Establish the difference between supervised and unsupervised learning.
- Establish the difference between regression and Classification.
- Assess the accuracy of models to select the best approach to perform statistical learning.
Module 2: Introduction to R - Part 1
- Installing R
- Run R in either Interactive Mode or Batch Mode
- Vectors
- Functions
Module 3: Introduction to R - Part 2
- Introduction to MapReduce
- Map Reduce example(s)
- Application of MapReduce
Module 4: Introduction to Linear Regression
- Linear Regression as a Concept
- Simple Linear Regression
- Multiple Linear Regression
- Least Squares
Module 5: Implementation of Linear Regression
- R Implementation of Linear Regression
Module 6: Introduction to Classification – Part 1
- Classification as a concept
- A First Classification Rule
- Why not Linear Regression
Module 7: Introduction to Classification – Part 2
- Linear Discriminant Analysis
- Quadratic Discriminant Analysis
- K-Nearest Neighbors
- Different situations, different methods
Module 8: Resampling Methods
- Modern Computational Era
- Cross Validation
- Validation Set Approach
- Types of Cross Validation
- Bias vs Variance
- Cross Validation in Classification
- The Bootstrap
Module 9: Linear Model Selection and Regularization
- Alternative Fitting Methodologies
- Stepwise Selection Methods
- Regularization/Shrinkage
- Dimension Reduction
Module 10: Moving Beyond Linearity
- The Move to Non-linear Forms
- Polynomial Regression
- Step Functions
- Splines
- Piecewise polynomials
Module 11: Tree-Based Methods
- Tree-Based Methods
- Regression Trees
- Classification Trees
Module 12: Hadoop Architecture
- Ecosystem
- Hadoop Components
- Hadoop Schedulers
- Cluster Management
Module 13: Support Vector Machines
- Concept of SVMs
- Maximal Margin Classifier
- Support Vector Classifier (non-separable classes)
- Support Vector Machines (non-linear kernels)
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.