Courses

INFO-I 415 Introduction to Statistical Learning

3 credits

Prerequisite(s): None
Delivery: On-Campus, Online
Semesters offered: Fall, Spring, Summer (Check the schedule to confirm.)

Description

This course applies statistical learning methods for data mining and inferential and predictive analytics to informatics-related fields. The course also covers techniques for exploring and visualizing data, assessing model accuracy, and weighing the merits of different methods for a given real-world application. This course is an essential toolset for transforming large, complex informatics datasets into actionable knowledge.

Program Learning Outcomes Supported

A2: Data Literacy - Recognize that data can have value and play a key role in society by providing opportunities to expand knowledge, to innovate, and to influence.
B1: Data Science - Organize, visualize, and analyze large, complex datasets using descriptive statistics and graphs to make decisions.
B5: Data Science - Identify, assess, and select appropriately among data analytics methods and models for solving real-world problems, weighing their advantages and disadvantages.
B6: Data Science - Understand data science concepts, techniques, and tools to support big data analytics.
B7: Data Science - Analyze datasets with the following supervised learning methods: for functional approximation, multiple linear regression, splines, and local regression; for classification, logistic regression, linear discriminant analysis, decision trees, bagging, random forests, and boosting, and support vector machines.
B8: Data Science - Analyze datasets with the following unsupervised learning methods: for dimensionality reduction, principal components analysis; for grouping, k-means clustering and hierarchical clustering.
B9: Data Science - Explore, transform, and visualize large, complex datasets with graphs in R.
B12: Data Science - Write programs to perform data analytics on large, complex datasets in R.
C5: Information Science - Understand critical issues associated with the storage, backup, and security of data.

Learning Outcomes

Analyze datasets with the following supervised learning methods: for functional approximation, multiple linear regression, splines, and local regression; for classification, logistic regression, linear discriminant analysis, decision trees, bagging, random forests, and boosting, and support vector machines.
Analyze datasets with the following unsupervised learning methods: for dimensionality reduction, principal components analysis; for grouping, k-means clustering and hierarchical clustering.
Explore, transform, and visualize large, complex datasets with graphs in R.
Solve real-world problems by adapting and applying statistical learning methods to large, complex datasets.
Identify and select appropriately among statistical learning methods for a particular real-world problem; analyze each method with respect to a given dataset or research question in terms of modeling accuracy and the biasvariance tradeoff; perform model assessment (i.e., estimate test error rates) and selection by resampling: crossvalidation and bootstrapping; identify overfitting and underfitting; perform model selection and regularization by subset selection and shrinkage methods: ridge regression and Lasso; explain the relative advantages and disadvantages of each statistical learning method for the real-world problem.
Write programs to perform data analytics on large, complex datasets in R.
Analyze data from case studies in informatics-related fields (e.g., digital media, human-computer interaction, health informatics, bioinformatics, and business intelligence).

Profiles of Learning for Undergraduate Success (PLUS) Alignment

Instructors align their courses with the Profiles of Learning for Undergraduate Success. The profiles provide students various opportunities to deepen disciplinary understanding, participate in engaged learning, and refine what it means to be a well-rounded, well-educated person prepared for lifelong learning and success.

P2.1 Problem Solver – Think critically.
P2.3 Problem Solver – Analyzes, synthesizes, and evaluates.
P3.2 Innovator – Creates/designs.
P3.1 Innovator – Investigates

Course Overview

Module 0: Introduction to Course/ Getting Started

Course Basics and Course Navigation
Course Structure and Schedule
Accessibility Acknowledgement
Writing Resources and Student Engagement Roster
What is Zoom @IU?
How to Create a Video

Module 1: Overview of Statistical Learning

Understand and use the terminology of Statistical Learning
Estimate f
Establish the difference between supervised and unsupervised learning. 
Establish the difference between regression and Classification.
Assess the accuracy of models to select the best approach to perform statistical learning.

Module 2: Introduction to R - Part 1

Installing R
Run R in either Interactive Mode or Batch Mode
Vectors
Functions

Module 3: Introduction to R - Part 2

Introduction to MapReduce
Map Reduce example(s)
Application of MapReduce

Module 4: Introduction to Linear Regression

Linear Regression as a Concept
Simple Linear Regression
Multiple Linear Regression
Least Squares

Module 5: Implementation of Linear Regression

R Implementation of Linear Regression

Module 6: Introduction to Classification – Part 1

Classification as a concept
A First Classification Rule
Why not Linear Regression

Module 7: Introduction to Classification – Part 2

Linear Discriminant Analysis
Quadratic Discriminant Analysis
K-Nearest Neighbors
Different situations, different methods

Module 8: Resampling Methods

Modern Computational Era
Cross Validation
Validation Set Approach
Types of Cross Validation
Bias vs Variance
Cross Validation in Classification
The Bootstrap

Module 9: Linear Model Selection and Regularization

Alternative Fitting Methodologies
Stepwise Selection Methods
Regularization/Shrinkage
Dimension Reduction

Module 10: Moving Beyond Linearity

The Move to Non-linear Forms
Polynomial Regression
Step Functions
Splines
Piecewise polynomials

Module 11: Tree-Based Methods

Tree-Based Methods
Regression Trees
Classification Trees

Module 12: Hadoop Architecture

Ecosystem
Hadoop Components
Hadoop Schedulers
Cluster Management

Module 13: Support Vector Machines

Concept of SVMs
Maximal Margin Classifier
Support Vector Classifier (non-separable classes)
Support Vector Machines (non-linear kernels)

Policies and Procedures

Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.

Luddy School of
Informatics, Computing, and Engineering

Courses

INFO-I 415 Introduction to Statistical Learning

Description

Program Learning Outcomes Supported

Learning Outcomes

Profiles of Learning for Undergraduate Success (PLUS) Alignment

Course Overview

Policies and Procedures

Additional links and resources

Explore

Happening at Luddy

Information For

Courses

INFO-I 415 Introduction to Statistical Learning

Description

Program Learning Outcomes Supported

Learning Outcomes

Profiles of Learning for Undergraduate Success (PLUS) Alignment

Course Overview

Policies and Procedures

Luddy School of Informatics, Computing, and Engineering resources and social media channels

Additional links and resources

Explore

Happening at Luddy

Information For