INFO-I 416 Cloud Computing for Data Science
3 credits
- Prerequisite(s): INFO-B 211 OR CSCI-A 205 OR CSCI-C 200 OR CSCI 23000; Recommended: INFO-I 308
- Delivery: On-Campus, Online
- Semesters offered: Fall, Spring (Check the schedule to confirm.)
Description
This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.
Topics
- Data intensive sciences and the data center model
- Clouds with infrastructure, platform, and software as a service
- Virtualization technologies and tools
- Introduction to FutureGrid (or Openstack) as an experimental testbed
- Parallel programming using MapReduce vs. MPI
- MapReduce and data parallel applications using Hadoop
- Iterative MapReduce and data mining algorithms using Twister (expectation maximization, clustering, multidimensional scaling, latent Dirichlet allocation, Bayes networks)
- MapReduce on multicore/graphics processing unit (CUDA)
- NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
- High level language (Hive and Pig)
- Amazon EC2 and Microsoft Azure and their applications
Program Learning Outcomes Supported
- B1: Data Science - Organize, visualize, and analyze large, complex datasets using descriptive statistics and graphs to make decisions.
- B5: Data Science - Identify, assess, and select appropriately among data analytics methods and models for solving real-world problems, weighing their advantages and disadvantages.
- B6: Data Science - Understand data science concepts, techniques, and tools to support big data analytics.
- B13: Data Science - Analyze data from case studies in informatics related fields.
- C5: Information Science - Understand critical issues associated with the storage, backup, and security of data.
Learning Outcomes
- Explain the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
- Examine the technical capabilities and commercial benefits of hardware virtualization.
- Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
- Explain the core challenges of cloud computing deployments, including public, private, and community clouds, in terms of privacy, security, and interoperability.
- Create cloud computing infrastructure models.
- Demonstrate and compare the use of cloud storage vendor offerings, such as Amazon S3, Microsoft Azure, OpenStack, and Hadoop distributed file system.
- Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing Pig, Hive, and other cloud-computing frameworks and libraries.
- Apply the MapReduce programming model to data analytics in informatics-related domains.
- Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).
Profiles of Learning for Undergraduate Success (PLUS) Alignment
Instructors align their courses with the Profiles of Learning for Undergraduate Success. The profiles provide students various opportunities to deepen disciplinary understanding, participate in engaged learning, and refine what it means to be a well-rounded, well-educated person prepared for lifelong learning and success.
- P2.1 Problem Solver – Think critically.
- P2.3 Problem Solver – Analyzes, synthesizes, and evaluates.
- P3.2 Innovator – Creates/designs.
- P4.3 Community Contributor – Behaves ethically.
Course Overview
Module 0: Introduction to Course/ Getting Started
- Course Basics and Course Navigation
- Course Structure and Schedule
- Accessibility Acknowledgement
- Writing Resources and Student Engagement Roster
- What is Zoom @IU?
- How to Create a Video
Module 1: Introduction to Cloud Computing
- Introduction to Cloud Computing
- Characteristics of Cloud Computing
- Cloud Computing Models
Module 2: Python Programming Review
- Intensive review of Python programming
- Introduction to Lambda Functions
- Using Google Colaboratory
Module 3: Introduction to MapReduce
- Introduction to MapReduce
- Map Reduce example(s)
- Application of MapReduce
Module 4: Introduction to Apache Spark
- What is Apache Spark?
- Spark's design philosophy
- Spark's Unified Analytics
- Spark's Distributed Execution
Module 5: Getting Started with Spark’s Applications
- Introduction to Pyspark
- Spark’s Directories and Files
- Understanding Spark Application Concepts
- Transformations, Actions, and Lazy Evaluation
Module 6: Introduction to RDDs
- What are RDDs?
- Why RDDs? Partitions
- Creating RDDs
- RDD Operations
- Working with Key-Value Pairs
Module 7: DataFrames and Spark SQL
- What are dataframes?
- What is Spark SQL?
- Spark SQL: Tables
- Spark SQL: Creating SQL Databases and Tables
- Spark SQL Application (Structured data)
- Spark SQL Application (Unstructured data)
Module 8: Spark MLlib (Part 1)
- What is Spark MLlib?
- Overview of Machine Learning
- Supervised Learning
- Unsupervised Learning
Module 9: Spark MLlib (Part 2)
- Applications of Spark MLlib
- Linear Regression
- Linear Regression
Module 10: Spark GraphX
- What is GraphX?
- What are graphs?
- Types of Graphs
- Degree of Vertices
- Graph Representations
- Application of GraphX
Module 11: Spark Streaming
- Introduction to Spark Streaming
- Streaming Operation
- Application of Spark Streaming
Module 12: Hadoop Architecture
- Ecosystem
- Hadoop Components
- Hadoop Schedulers
- Cluster Management
Module 13: Amazon Web Services
Module 14: Final Project Preparation
Module 15: Final Project Presentation and Final Exam
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.