INFO-H 516 Cloud Computing for Data Science
3 credits
- Prerequisite(s): CSCI 54100, LIS S511, INFO B512, or INFO B556; prior programming experience required
- Delivery: On-Campus
Description
This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.
Topics
- Clouds with infrastructure, platform, and software as a service
- Virtualization technologies and tools
- MapReduce and data parallel applications using Apache Spark
- Apache Hadoop Distributed File System
- YARN cluster resource management and Mesos distributed system kernel
- Large-scale data storage: NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
- Large-scale machine learning: Classification, regression, and clustering using MLlib
- Spark streaming
- Amazon AWS (EC2 and S3) and its applications
- Exploring large spatiotemporal datasets
Learning Outcomes
- Research the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
- Examine the technical capabilities and commercial benefits of hardware virtualization.
- Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
- Evaluate the core challenges of cloud computing deployments, including public, private, and community clouds, with respect to privacy, security, and interoperability.
- Create cloud computing infrastructure models.
- Demonstrate and compare the use of cloud storage vendor offerings.
- Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing cloud-computing frameworks and libraries.
- Apply the MapReduce programming model to data analytics in informatics-related domains.
- Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).
- Overcome difficulties in managing very large datasets, both structured and unstructured, using nonrelational data storage and retrieval (NoSQL), parallel algorithms, and cloud computing.
- Apply the MapReduce programming model to data-driven discovery and scalable data processing for scientific applications.
Policies and Procedures
Please be aware of the following linked policies and procedures. Note that in individual courses instructors will have stipulations specific to their course.