Description

In this course, we will process massive streams of real-time data using Spark Streaming and create Spark applications using the Scala programming language (v2.12). We will also get our hands-on with some real live Twitter data, simulated streams of Apache access logs, and even data used to train machine learning models.

In this course, you will learn the basics of the Scala programming language; learn how Apache Spark operates on a cluster; set up discretized streams with Spark Streaming and transform them as data is received; analyze streaming data over sliding windows of time; maintain stateful information across streams of data; connect Spark Streaming with highly scalable sources of data, including Kafka, Flume, and Kinesis; dump streams of data in real-time to NoSQL databases such as Cassandra; run SQL queries on streamed data in real-time; train machine learning models in real-time with streaming data, and use them to make predictions that keep getting better over time; and also, package, deploy, and run self-contained Spark Streaming code to a real Hadoop cluster using Amazon Elastic MapReduce. This course is very hands-on, filled with achievable activities and exercises to reinforce your learning. By the end of this course, you will be confidently creating Spark Streaming scripts in Scala and be prepared to tackle massive streams of data in a whole new way. You will be surprised at how easy Spark Streaming makes it! All the codes and supporting files for this course are available at https://github.com/packtpublishing/streaming-big-data-with-spark-streaming-scala-and-spark-3-

What You Will Learn

Process large amounts of real-time data using the Spark Streaming module
Create efficient Spark applications using the Scala programming language
Integrate Spark Streaming with various data sources
Integrate Spark Streaming with Spark SQL to query your data in real time
Train machine learning models with streaming data, and use for real-time predictions
Maintain stateful data across a continuous stream of input data

Audience

If you are a student who wants to learn how to use Apache Spark or a big data professional who wants to process large amounts of data on a real-time basis, this course is for you. Some basic programming and scripting experience is required to get the most out of the course.

Approach

This course is very hands-on, filled with achievable activities and exercises to reinforce your learning. In this course, we will cover some real live Twitter data, simulated streams of Apache access logs, and even data used to train machine learning models! You will write and run real Spark Streaming jobs right at home on your own PC.

Key Features

Process streams of real-time data from various sources with Spark Streaming * Query your streaming data in real-time using Spark SQL * A comprehensive tutorial with practical examples to help you develop real-time Spark applications

Github Repo

https://github.com/packtpublishing/streaming-big-data-with-spark-streaming-scala-and-spark-3-

About the Author

Frank Kane

Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.

Course Outline

1. Getting Started

In this first section, we will introduce the course and set up the development environment for Spark and Scala. We will then create our first Spark Streaming application to display live Twitter feeds.

1. Introduction, and Getting Set Up

A brief introduction to the course, and then we will get your development environment for Spark and Scala all set up on your desktop. A quick test application will confirm Spark is working on your system!

2. [Activity] Stream Live Tweets with Spark Streaming!

Get set up with a Twitter developer account and run your first Spark Streaming application to listen to and print out live Tweets as they happen!

2. A Crash Course in Scala

In this section, we will take a deep dive into Scala. Even if you've never used Scala before, we will cover all of the syntax and structure that you will need to know in order to begin mining your data. We will run through flow control, functions, and data structures with numerous examples throughout so that you can see everything in action and make sure you're understanding it.

1. [Activity] Scala Basics

We start our crash course in the Scala programming language by covering some basics of the language: types and variables, printing, and Boolean comparisons.

2. [Exercise] Flow Control in Scala

Our Scala crash course continues, illustrating various means of flow control in Scala. For loops, do/while loops, while loops and so on.

3. [Exercise] Functions in Scala

Scala is a functional programming language, and so understanding how functions work and are treated in Scala is hugely important! This lecture covers the fundamentals, and lets you put it into practice.

4. [Exercise] Data Structures in Scala

We wrap up our Scala crash course with commonly used data structures using in Spark with Scala. Tuples, lists, and maps.

3. Spark Streaming Concepts

In this section, we will introduce the core concepts of Spark and Spark Streaming. We will define and work with RDDs (Resilient Distributed Datasets) and show you examples of what they can do. RDDs are the key to understanding Spark, and therefore also Spark Streaming. This is also where we will dive deeper into Spark Streaming itself, understanding its inner workings and how the concept of windowing will enable you to aggregate and analyze data over time. Finally, we will take a look at reliability and fault tolerance with regards to Spark Streaming and a few ways in which you can maximize reliability.

1. Introduction to Spark

Before you can learn about Spark Streaming, you need to understand how Spark itself works at a high level! This video covers the why and how of Apache Spark, of which Spark Streaming is a component.

2. The Resilient Distributed Dataset (RDD)

The fundamental object of Spark programming is the Resilient Distributed Dataset (RDD), and this is used not just in Spark but also within Spark Streaming scripts. This lecture explains what they are and what you can do with them.

3. [Activity] RDD's in Action: Simple Word Count Application

Let's walk through and actually run a simple Spark script that counts the number of occurrences of each word in a book.

4. Introduction to Spark Streaming

We finally have all the pr-requisite knowledge to start talking about Spark Streaming itself in more detail! We will cover how it works, what it's for, and its architecture.

5. [Activity] Revisiting the PrintTweets application

Now that we know more, let's go revisit that first Spark Streaming application we ran in lecture two, and dive into how it really works.

6. Windowing: Aggregating data over longer time spans

Windowing allows you to analyze streaming data over a sliding window of time, which lets you do much more than just transform streaming data and store it someplace else. We will cover the concepts of the batch, window, and slide intervals, and how they work together to let you aggregate streaming data over some period of time.

7. Fault Tolerance in Spark Streaming

How can Spark Streaming do so much work continuously in a reliable manner? We will uncover some of its tricks for reliability, as well as tips for configuring Spark Streaming to be as reliable as possible.

4. Spark Streaming Examples with Twitter

Now that we have seen the theory behind Spark Streaming, this section is dedicated to putting all of that into action. We will look at three practical examples of taking Twitter data and applying our new learning to it with file output, windowing, and live tracking, and analysis.

1. [Exercise] Saving Tweets to Disk

We will build on our "print tweets" example to actually store the incoming tweets to disk and illustrate how Spark Streaming can handle file output.

2. [Exercise] Tracking the Average Tweet Length

Compute the average length of a tweet, using windowing in Spark Streaming.

3. [Exercise] Tracking the Most Popular Hashtags

This is a fun one! We will track the most popular hashtags in Twitter over time and watch how they change in real time!

5. Spark Streaming Examples with Clickstream / Apache Access Log Data

This section will continue to look at some of the more advanced aspects of Spark Streaming, this time using Apache access logs to monitor website traffic in real time, showing the most accessed pages and flagging up and log errors as they occur. We will also use SparkSQL to query our data live. Finally, we will look at Spark 2's new structured streaming approach, and how we can use DataSets to streamline our analysis.

1. [Exercise] Tracking the Top URL's Requested

We will simulate an incoming stream of Apache access logs and use Spark Streaming to keep track of the most-requested web pages in real time!

2. [Exercise] Alarming on Log Errors

This example will listen to an Apache access log stream and raise an alarm if too many errors are returned by the server in real time.

3. [Exercise] Integrating Spark Streaming with Spark SQL

We will integrate Spark Streaming with Spark SQL, allowing us to run SQL queries on data as it is streamed in! Again, we will use Apache logs as an example.

4. Introduction to Structured Streaming

Spark 2.0 introduced experimental support for Structured Streaming, a new DataSet-based API for Spark Streaming that is bound to become increasingly important. Learn how it works.

5. [Activity] Analyzing Apache Log files with Structured Streaming

As an example, we will stream Apache access logs in from a directory and use Structured Streaming to count up status codes over a one-hour moving window.

6. Integrating with Other Systems

In this section, we will examine how Spark Streaming can be integrated with other popular data science systems. We will see how it can work with Apache Kafka to listen to and process messages in real time; how Apache Flume can work with Spark Streaming to process huge amounts of log data; using Amazon Kinesis with Spark Streaming to process live data on an AWS/EC2 cluster; and using Cassandra to store massive datasets and analyze them at speed. We will also cover how to write your own Data Receiver class in order to connect Spark Streaming to other proprietary systems.

1. Integrating with Apache Kafka

Apache Kafka is a popular and robust technology for publishing messages across a cluster on a large scale. We will show how to get Spark Streaming to listen to Kafka topics and process them in real time.

2. Integrating with Apache Flume

Flume is a popular technology for publishing log information at large scale, especially on a Hadoop cluster. We will illustrate how to set up both push-based and pull-based Flume configurations with Spark Streaming and discuss the tradeoffs of each.

3. Integrating with Amazon Kinesis

Amazon's Kinesis Streaming service is basically Kafka on AWS. If you're working with an AWS/EC2 cluster, you will want to know how to integrate Spark Streaming with Kinesis - and that's what this lecture covers.

4. [Activity] Writing Custom Data Receivers

What if you need to integrate Spark Streaming with some proprietary system that does not have an existing connection library? Well, you can always write your own Receiver class. This example shows you how and actually lets you build and run one.

5. Integrating with Cassandra

Cassandra is a popular "NoSQL" database that can be used to provide fast access to massive data sets to real-time applications. Dumping data transformed by Spark Streaming into a Cassandra database can expose that data to your larger, real-time services. We will show you how, and actually run a simple example.

7. Advanced Spark Streaming Examples

In this section, we will run through some more advanced real-world examples of Spark Streaming. We will examine tracking stateful information, machine learning with K-Means clustering, and feeding data into linear regression models.

1. [Exercise] Stateful Information in Spark Streams

Spark has the ability to track arbitrary state across streams of data as they come in, such as web sessions, running totals and so on. This example shows you how it all works, and challenges you to track your own state using our example as a baseline.

2. [Activity] Streaming K-Means Clustering

Spark Streaming integrates with some of Spark's MLLib (Machine Learning Library) capabilities. This example creates a real-time K-Means clustering example; it does unsupervised machine learning that continually gets better as more training data feeds into it.

3. [Activity] Streaming Linear Regression

Spark Streaming can also feed data in real-time to linear regression models, that get better over time as more data is fed into them. This example shows linear regression in action with Spark Streaming.

8. Spark Streaming in Production

This section covers how to move your applications into a production environment, using the spark-submit command to process scripts from the command line. We will also look at packaging scripts and dependencies with SBT and running code on a real cluster with Amazon's Elastic MapReduce (EMR). Finally, we will troubleshoot some code. We will see what can go wrong, and how you can identify and debug issues and tune your Spark applications for peak performance.

1. [Activity] Packaging and Running Spark Code in Production

Your production applications won't be run from within the Scala IDE; you will need to run them from a command line, and potentially on a cluster. The spark-submit command is used for this. We will show you how to package up your application and run it using spark-submit from a command prompt.

2. [Activity] Packaging Your Code with SBT

If your Spark Streaming application has external library dependencies that might not be already present on every machine in your cluster, the SBT tool can manage those dependencies for you, and package them into the JAR file you run with spark-submit. We will show you how it works with a real example.

3. Running on a Real Hadoop Cluster with EMR

We will run our simple word count example on a real cluster, using Amazon's Elastic MapReduce service! This just shows you what's involved in running a Spark Streaming job on a real cluster as opposed to your desktop; there are a few parameters to Spark-submit you need to worry about and getting your scripts and data in the right place is also something you need to deal with.

4. Troubleshooting and Tuning Spark Jobs

Spark jobs rarely run perfectly, if at all, on the first try - some tuning and debugging is usually required and arriving at the right scale of your cluster is also necessary. We will cover some performance tips, and how to troubleshoot what's going on with a Spark Streaming job running on a cluster.

9. You Made It!

This final section runs through some recommendations on where to learn more about Spark Streaming.

1. Learning More

Want to learn more about Spark Streaming? Here are a few books and other resources I've found valuable.

Course Images

Streaming Big Data with Spark Streaming, Scala, and Spark 3!

By Packt

Booking options

Highlights