Big Data Modeling

Description: Description: Big Data is a term that describes the large volume of (structured or unstructured) data. Big Data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, we need an optimal way to do so. In this course, we will use Apache Spark as data processing engine in order to analyze large amount of data. Spark is able to manipulate structured and unstructured datasets, and it also extends the mapreduce concept. Spark also has a machine learning (ML) library that can help us predict outcomes from the data that we are analyzing. We will use Spark ML for giving predictions of real data. We will learn how to use Spark on the Unix command line, and also on notebooks using SHARCNET resources. Other aspects of Big Data will be discussed, such as Hadoop, and its applications in Sciences and Industry.

Instructor: Jose Nandez, SHARCNET, Western University.

Prerequisites: Basic Python and Unix.

Course materials: Course materials can be downloaded here.