Data Science with Apache Spark

Description: Apache Spark is an open source processing engine for large-scale data analysis. It has been adopted by a wide range of industries, such as Facebook, Hotels.com, Cisco, Microsoft and Netflix, etc. and has been deployed at massive scale, e.g. processing data in multiple petabytes on clusters of over 8,000 nodes. Spark analytics platform unifies data science and engineering across the machine-learning lifecycle from data preparation, to experimentation and deployment of machine-learning applications. This course will provide an overview of Spark and parallelizing machine-learning algorithms at a conceptual level. While limiting time spend on machine-learning theory and the internal workings of Spark, we will focus on using Spark to explore large datasets, develop machine-learning pipelines, and use the algorithms available in the Spark MLlib DataFrames API. We will work through examples to show you how to apply Spark to iterate faster and develop models on datasets. These building blocks will enable you to use Spark with related documentation to solve a variety of data-analysis and machine-learning tasks. We will also learn how to work and run your Spark jobs on Canada’s national HPC resources, e.g. Graham on SHARCNET. Though Spark supports applications in Java, Scala, Python, R or SQL, the examples and demos in this course will be provided in Python.

Instructor: Jinhui Qin, SHARCNET, Western University

Prerequisites: Some programming experience in Python, a background in statistics and machine-learning would be helpful.