Course description

The emphasis of this course is on mastering the most important big data technology: Spark 2 and its various application programming interfaces (APIs). Spark is an evolution of Hadoop and Map/Reduce with massive speedup and scalability improvements. The explosion of social media and the computerization of every aspect of social and economic activity results in the creation of large volumes of semi-structured data: web logs, videos, speech recordings, photographs, e-mails, Tweets, and similar data. In a parallel development, computers keep getting ever more powerful and storage ever cheaper. Today, with Spark 2, we can reliably and cheaply store huge volumes of data, efficiently analyze it, and extract business and socially relevant information. In this course, students learn to use Spark Core, Spark SQL, and Spark Streaming API. They learn how to organize data in massive Delta (data) Lakes and create massive data pipelines, using SQL and Spark in batch mode or in flight. Students learn how to analyze highly connected data using Spark GraphX, an in-memory graph database. Students acquire practical skills in scalable messaging systems like Kafka and learn to integrate Spark with NoSQL systems. Students conduct exercises in Amazon Web Services (AWS) and/or Google Cloud Platform (GCP), and master the most important AWS and GCP services.

Learn More