INFO-H 510 Statistics for Data Science
3 credits
- Prerequisite(s): None
- Delivery: On-Campus
Description
This course introduces statistical inference for big data. It covers distributions, confidence intervals, hypothesis testing, ANOVA, linear models, bias, model critique, and effective data communication. Students learn data analysis, wrangling, and visualization through hands-on programming projects.
Topics
Data basics
- Understanding data types
- Data collection methods
- Sample vs. Population
Programming in R
- Data frames and vectors
- Tidyverse for data summarization
- Basic visualization
Probability concepts
- Sets and combinatorics
- Frequentist vs. Bayesian perspectives
- Monty Hall problem
Statistical distributions
- Random variables
- Normality
- Central limit theorem
Ethics and epistemology
- Discussing biases
- Outliers and missing values
- Ethics of statistical practice
Inferential statistics
- Confidence intervals
- Covariance and correlation
- Multivariate analysis
Hypothesis testing
- Type I/II errors
- P-values
- T-tests and sample size calculations
- Bonferroni correction
- F-test
- ANOVA
Regression analysis
- Ordinary least squares
- Simpson's paradox
- Model diagnostics
Generalized linear models
- Variable transformation
- Link functions
- Interpretation
Model critique
- Evaluating diagnostic plots
- Ethical considerations
- Peer reviews
Data communication
- Problem formulation
- Business use cases
- Risk evaluation
Learning Outcomes
- Assess the limitations of data as a representation of nature, and evaluate how statistics can mitigate those limitations (e.g., biases).
- Write programs in R that load, summarize, and effectively visualize data represented in all supported data types.
- Analyze a dataset using various methods of statistical inference like bootstrapping, investigate useful relations among variables, and address any epistemological or ethical concerns with the results.
- Devise a testable hypothesis about some phenomenon, and apply an appropriate statistical test to measure the strength of evidence against it.
- Evaluate the results of performing an inferential regression analysis on various types of response variables (e.g., continuous, count, or proportional/rate-based data), including the model's reliability and validity.
- Draw valuable and actionable conclusions from statistical analyses, and present results to an audience of stakeholders with transparency and confidence.
- Review the statistical analyses of others critically, evaluate the level of efficacy of each component of their analysis, and propose how it could be improved.
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.