INFO-I 428 Web Mining
3 credits
- Prerequisite(s): INFO-B 210 or CSCI-A 204 or CSCI 23000 or CSCI-C 200
- Delivery: On-Campus, Online
- Semesters offered: Fall, Spring (Check the schedule to confirm.)
Description
This course covers concepts and methods used to search the web and other sources of unstructured text from a human-centered standpoint. These include document indexing, crawling, classification, and clustering; distance metrics; analyzing streaming data, such as social media; link analysis; and system evaluation.
Program Learning Outcomes Supported
- A1: Data Literacy - Distinguish between data, information, and knowledge.
- A2: Data Literacy - Recognize that data can have value and play a key role in society by providing opportunities to expand knowledge, to innovate, and to influence.
- B1: Data Science - Organize, visualize, and analyze large, complex datasets using descriptive statistics and graphs to make decisions.
- B5: Data Science - Identify, assess, and select appropriately among data analytics methods and models for solving real-world problems, weighing their advantages and disadvantages.
- B6: Data Science - Understand data science concepts, techniques, and tools to support big data analytics.
Learning Outcomes
- Implement web search concepts and methods to return documents automatically based on user queries.
- Design and implement a crawler application to collect and index documents from the web.
- Design computational methods to classify documents by topic.
- Use distance metrics to compute the similarity of pairs of documents.
- Create a system to collect and analyze streaming data.
- Use link analysis to rank web search results.
- Evaluate the performance of web search systems.
- Analyze text to determine the reliability of the information including potential bias.
Profiles of Learning for Undergraduate Success (PLUS) Alignment
Instructors align their courses with the Profiles of Learning for Undergraduate Success. The profiles provide students various opportunities to deepen disciplinary understanding, participate in engaged learning, and refine what it means to be a well-rounded, well-educated person prepared for lifelong learning and success.
- P2.1 Problem Solver – Think critically.
- P2.3 Problem Solver – Analyzes, synthesizes, and evaluates.
- P3.2 Innovator – Creates/designs.
Course Overview
Module 0: Introduction to Course/ Getting Started
- Course Basics and Course Navigation
- Course Structure and Schedule
- Accessibility Acknowledgement
- Writing Resources and Student Engagement Roster
- What is Zoom @IU?
- How to Create a Video
Module 1: Overview of Data Mining
- Why Data Mining?
- Predictive User Modeling
- Types of Web Mining
- The KDD Process
Module 2: Introduction to Web Mining
- What is Web Mining?
- Types of Web Mining
- Web Content and Web Structure Mining
- E-Commerce Data
Module 3: The KDD Process (Preparation & Preprocessing)
- Data Preprocessing and Cleaning
- Smoothing Noisy Data
- Data Integration & Normalization
- Data Discretization Methods
- Principal Component Analysis
Module 4: Understanding Basic Characteristics of Data
Module 5: Mining Frequent Patterns
- Association Rule Discovery
- The Apriori Algorithm
Module 6: Intro to Classification and Prediction
- Sequential pattern mining
- GSP mining algorithm
Module 7: Intro to Classification and Prediction
- What Is Classification?
- Prediction, Clustering, Classification
Module 8: Classification Methods
- Decision Tree Learning
- Bayesian Classification
Module 9: Prediction Modeling for Personalization & Recommender Systems
- What Is Prediction?
- Recommender Systems: Common Approaches
- Content-based recommendation
- Collaborative Recommender Systems
Module 10: Clustering
- Basic Concepts and Algorithms
- Applications of Cluster Analysis
- Major Clustering Approaches
- K-Means Algorithm
- Hierarchical Clustering Algorithms
Module 11: Applications in Web Mining and Web Personalization
- PACT
- Content Enhanced Transactions
- Clustering and Collaborative Filtering
Module12: Data Preparation for Web Usage Analysis
- Web Usage Mining Revisited
- Common Clickstream Data Sources
- Simplified Web Access Layout
- HTTP Protocol
- User Tracking via Cookies & Web Bug
- Sessionization Heuristics
Module 13: E-Metrics and E-Business Analytics
- Web Usage Mining & E-Business Analytics
- Different Levels of Analysis
- Basic Site Metrics
- Shopping Pipeline Analysis
- E-Customer Life Cycle
Module 14: Personalizing & Recommender Systems
- Content-Based Recommenders
- Personalized Search Agents
- Social / Collaborative Tags
- Collaborative Filtering Process
- Data sparsity problems
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.