CSCI-B 365 Introduction to Data Analysis and Mining
3 credits
- Prerequisite(s): CSCI-C 310 or CSCI-C 343
- Delivery:
Description
The course objective is to study computational aspects of discovering patterns and relationships in large data. This course is designed to introduce fundamental concepts of data mining and provide hands-on experience in data collection, preprocessing, analysis, clustering and prediction.
Topics
Data representation
- Types of variables
- Data matrix
- Boxplots
- Scatterplots and pairs plots
- Visualizing spatial data and maps
- R Coding
Probability theory
- Basic discrete probability
- Probability through counting and simulation
- Joint probability and conditional probability
- Independence and conditional independence
- Bayes rule
- Tables and Simpson's paradox
Classification
- Bayes classifier
- Nearest neighbor classifiers
- Decision tree classifiers
- Quantifying classifier performance
Regression
Clustering
- K-means clustering algorithm
- Vector quantization
Model evaluation
Learning Outcomes
- Understand the fundamental concepts and principles of data mining, including various types of data, data preprocessing, and data transformation techniques. CS 4
- Apply various data visualization techniques to effectively represent and interpret complex datasets. CS 4
- Apply counting, simulation, and probability theory to solve data-driven problems. CS 4
- Analyze joint and conditional probability, the application of Bayes’ rule, independence and conditional independence, and recognize instances of Simpson’s paradox in data analysis. CS 4
- Implement and evaluate classification algorithms, such as the Bayes classifier, nearest neighbor classifiers, and decision tree classifiers. CS 4
- Quantify and interpret classifier performance metrics to assess the accuracy and reliability of classification results. CS 4
- Apply regression analysis to model and predict relations between variables within datasets. CS 4
- Explain the concepts of clustering and different clustering algorithms, particularly the k-means clustering algorithm. CS 4
- Use R coding to implement data mining techniques and algorithms, enhancing practical skills in data analysis. CS 4
- Evaluate the appropriateness of different data mining techniques for various real-world applications. CS 4
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.