Description

This course will show you why Hadoop is one of the best tools to work with big data. With the help of some real-world data sets, you will learn how to use Hadoop and its distributed technologies, such as Spark, Flink, Pig, and Flume, to store, analyze, and scale big data.

Understanding Hadoop is a highly valuable skill for anyone working at companies that work with large amounts of data. Companies such as Amazon, eBay, Facebook, Google, LinkedIn, IBM, Spotify, Twitter, and Yahoo, use Hadoop in some way to process huge chunks of data. This video course will make you familiar with Hadoop's ecosystem and help you to understand how to apply Hadoop skills in the real world. The course starts by taking you through the installation process of Hadoop on your desktop. Next, you will manage big data on a cluster with Hadoop Distributed File System (HDFS) and MapReduce, and use Pig and Spark to analyze data on Hadoop. Moving along, you will learn how to store and query your data using applications, such as Sqoop, Hive, MySQL, Phoenix, and MongoDB. Next, you will design real-world systems using the Hadoop ecosystem and learn how to manage clusters with Yet Another Resource Negotiator (YARN), Mesos, Zookeeper, Oozie, Zeppelin, and Hue. Towards the end, you will uncover the techniques to handle and stream data in real-time using Kafka, Flume, Spark Streaming, Flink, and Storm. By the end of this course, you will become well-versed with the Hadoop ecosystem and will develop the skills required to store, analyze, and scale big data using Hadoop. All the codes and supporting files for this course are available at - https://github.com/packtpublishing/the-ultimate-hands-on-hadoop

What You Will Learn

Become familiar with Hortonworks and the Ambari User Interface (UI)
Use Pig and Spark to create scripts to process data on a Hadoop cluster
Analyze non-relational data using HBase, Cassandra, and MongoDB
Query data interactively with Drill, Phoenix, and Presto
Publish data to your Hadoop cluster using Kafka, Sqoop, and Flume
Consume streaming data using Spark Streaming, Flink, and Storm

Audience

This video course is designed for people at every level; whether you are a software engineer or a programmer who wants to understand the Hadoop ecosystem, or a project manager who wants to become familiar with the Hadoop's lingo, or a system architect who wants to understand the components available in the Hadoop system. To get started with this course, a basic understanding of Python or Scala and ground-level knowledge of the Linux command line are recommended.

Approach

With the help of a real-world movie dataset, hands-on activities, and interesting exercises, this video course will help you to learn Hadoop and its associated distributed systems. You will find out how easy it is to analyze, store, and scale big data using Hadoop technology.

Key Features

Get to grips with the high-level architecture of Hadoop * Understand the components available in the Hadoop ecosystem, and how they fit together * Get ready to manage big data using Hadoop and related technologies

Github Repo

https://github.com/packtpublishing/the-ultimate-hands-on-hadoop

About the Author

Frank Kane

Frank Kane has spent nine years at Amazon and IMDb, developing and managing the technology that automatically delivers product and movie recommendations to hundreds of millions of customers all the time. He holds 17 issued patents in the fields of distributed computing, data mining, and machine learning. In 2012, Frank left to start his own successful company, Sundog Software, which focuses on virtual reality environment technology and teaches others about big data analysis.

Course Outline

1. Learning All the Buzzwords and Installing the Hortonworks Data Platform Sandbox

1. Introduction and Installation of Hadoop

This video introduces Hadoop. You will learn how to install the Hortonworks Sandbox in a virtual machine on a PC, which is the quickest way to get up and running with Hadoop, so you can start learning and experimenting with it. You will also learn how to download some real movie ratings data and use Hive to analyze it.

2. The Hortonworks and Cloudera Merger and its Effects on the Course

This video focuses on the use of the Hortonworks Data Platform (HDP) and explains why Hortonworks merged with Cloudera works on a Cloudera Data Platform (CDP).

3. Hadoop Overview and History

This video presents an overview of Hadoop history to explain what it is used for and the type of problem it solves.

4. Overview of the Hadoop Ecosystem

This video highlights the concept of the Hadoop ecosystem. You will learn about all the technologies and how they all fit in together, along with the buzzwords used in this course.

2. Using the Hadoop's Core: Hadoop Distributed File System (HDFS) and MapReduce

1. Hadoop Distributed File System (HDFS): What it is and How it Works

This video provides an explanation for storage of massive data sets in Hadoop's Distributed File System (HDFS) across a cluster of commodity computers, in a reliable and scalable manner.

2. Installing the MovieLens Dataset

This video demonstrates how to import real movie ratings data into HDFS using the web-based UI provided by Ambari.

3. Activity - Installing the MovieLens Dataset into Hadoop's Distributed File System (HDFS) using the Command Line

This video demonstrates how to import data from the command prompt and interact with HDFS through the command line interface.

4. MapReduce: What it is and How it Works

This video provides an explanation of how mappers and reducers provide a clever way to analyze massively distributed datasets quickly and reliably.

5. How MapReduce Distributes Processing

This video demonstrates how to scale horizontally across a cluster of computers to find out the power of MapReduce.

6. MapReduce Example: Breaking Down the Movie Ratings by Rating Score

This video demonstrates how to count each rating type that exists in the movie ratings data.

7. Activity - Installing Python, MRJob, and Nano

This video demonstrates how to set up and start with MapReduce using the Python's MRJob package, which allows you to use the MapReduce's streaming feature to write a MapReduce code in Python instead of Java.

8. Activity - Coding Up and Running the Ratings Histogram MapReduce Job

This video demonstrates how to code for building a breakdown of movie ratings and running it on a system.

9. Exercise - Ranking Movies by Their Popularity

This video demonstrates how to write a MapReduce script that sorts movies by how many ratings they have received.

10. Activity - Checking Results

This video demonstrates how to change a script to count movies instead of ratings and review and run a solution for sorting by using the rating count.

3. Programming Hadoop with Pig

1. Introducing Ambari

This video introduces you to the Ambari-Hortonworks' web-based UI as an easy way to experiment with Pig.

2. Introducing the Pig

This video presents an overview of what Pig is used for, who it is for, and how it works.

3. Example - Finding the Oldest Movie with Five-Star Rating Using the Pig

This video demonstrates how to use Pig to script a chain of queries on MovieLens to solve a more complex problem.

4. Activity - Finding the Old Five-Star Movies with Pig

This video demonstrates how to run a program on the Hadoop sandbox to find good and old movies.

5. More Pig Latin

This video focuses on what else Pig Latin can do.

6. Exercise - Finding the Most-Rated One-Star Movie

This video demonstrates how to write a Pig script that finds the most popular "really bad" movie.

7. Pig Challenge - Comparing Results

This video demonstrates how to compare your code results with those of the authors.

4. Programming Hadoop with Spark

1. Why Spark?

This video provides an explanation of how Spark's efficiency and versatility makes Apache Spark one of the hottest Hadoop-related technologies right now, and how it achieves this under the hood.

2. The Resilient Distributed Datasets (RDD)

This video presents an overview of the Resilient Distributed Datasets (RDD) and the functions available in it.

3. Activity - Finding the Movie with the Lowest Average Rating with the Resilient Distributed Datasets (RDD)

This video demonstrates how to write a Spark script to find the movie with the lowest average rating using RDD.

4. Datasets and Spark 2.0

This video provides an explanation of how datasets can make your Spark scripts even faster and easier to write.

5. Activity - Finding the movie with the Lowest Average Rating with DataFrames

This video demonstrates how to write a Spark script to find the movie with the lowest average rating using DataFrames.

6. Activity - Recommending a Movie with Spark's Machine Learning Library (MLLib)

This video demonstrates how to use Spark's machine learning library to produce movie recommendations using the Alternating Least Square (ALS) algorithm.

7. Exercise - Filtering the Lowest-Rated Movies by Number of Ratings

This video demonstrates how to filter the results of movies with a given number of ratings.

8. Activity - Checking Results

This video demonstrates how to review the solution and run the resulting scripts.

5. Using Relational Datastores with Hadoop

1. What is Hive?

This video introduces Apache Hive. You will learn how it enables relational queries on the HDFS-hosted data.

2. Activity - Using Hive to Find the Most Popular Movie

This video demonstrates how to import the MovieLens data set into Hive using the Ambari UI and runs a simple query to find the most popular movies.

3. How Hive Works?

This video provides an explanation of how Hive works under the hood, while using the SQL commands.

4. Exercise - Using Hive to Find the Movie with the Highest Average Rating

This video demonstrates how to use the same Hive database to find the best-rated movie.

5. Comparing Solutions

This video demonstrates how to compare solutions with that of authors.

6. Integrating MySQL with Hadoop

This video presents an overview of MySQL and explains how it might fit into a Hadoop-based work.

7. Activity - Installing MySQL and Importing Movie Data

This video demonstrates how to import the MovieLens data set into MySQL and run a query.

8. Activity - Using Sqoop to Import Data from MySQL to HFDS/Hive

This video demonstrates how to use Sqoop to transfer data from an existing Relational Database Management System (RDBMS) similar to MySQL into Hadoop.

9. Activity - Using Sqoop to Export Data from Hadoop to MySQL

This video demonstrates how to build a new table with Hive and export it back into MySQL.

6. Using Non-Relational Data Stores with Hadoop

1. Why NoSQL?

This video provides an explanation of why NoSQL databases are important for efficiently and scalably vending data.

2. What is HBase?

This video presents an overview of what the HBase, a NoSQL columnar data store, is used for and how it works.

3. Activity - Importing Movie Ratings into HBase

This video demonstrates how to use the RESTful service interface to import movie ratings into HBase.

4. Activity - Using HBase with Pig to Import Data at Scale

This video demonstrates how to integrate HBase with Pig to store big data into HBase.

5. Cassandra - Overview

This video presents an overview of Cassandra, a popular NoSQL database that is appropriate for vending data at massive scale outside of Hadoop.

6. Activity - Installing Cassandra

This video demonstrates how to install Cassandra as it is not part of Hortonworks.

7. Activity - Writing Spark Output into Cassandra

This video demonstrates how to modify an HBase example to write results into a Cassandra database.

8. MongoDB - Overview

This video presents an overview of MongoDB, a popular alternative to Cassandra with different features.

9. Activity - Installing MongoDB and Integrating Spark with MongoDB

This video demonstrates how to install MongoDB on a virtual machine using Ambari, study and run a script to load up a Spark DataFrame of user data and store it into MongoDB.

10. Activity - Using the MongoDB Shell

This video demonstrates how to set up an index using MongoDB's command-line interface.

11. Choosing Database Technology

This video demonstrates how to take an informed decision by analyzing the various attributes of databases.

12. Exercise - Choosing a Database for a Given Problem

This video demonstrates how to compare the decision to choose a database with that of the author.

7. Querying Data Interactively

1. Overview of Drill

This video presents an overview of Drill and the problems it solves.

2. Activity - Setting Up Drill

This video demonstrates how to install the Drill.

3. Activity - Querying Across Multiple Databases with Drill

This video demonstrates how to use Drill to execute a query that spans data on MongoDB and Hive at the same time.

4. Overview of Phoenix

This video presents an overview of Phoenix.

5. Activity - Installing Phoenix and Querying HBase

This video demonstrates how to work with Phoenix and use it to query an HBase Database.

6. Activity - Integrating Phoenix with the Pig

This video demonstrates how to use Phoenix with Pig to store, load the MovieLens user's data, and accelerate queries on it.

7. Overview of Presto

This video presents an overview of Presto and provides its comparison with Drill and Phoenix.

8. Activity - Installing Presto and Querying Hive

This video demonstrates how to install Presto and issue queries on Hive through it.

9. Activity - Querying Both Cassandra and Hive Using Presto

This video demonstrates how to perform a JOIN query that spans both - data in Cassandra and Hive.

8. Managing Your Cluster

1. Yet Another Resource Negotiator (YARN)

This video provides an in-depth explanation on the workings of YARN that controls and allocates the resources of your Hadoop cluster.

2. Tez

This video provides an explanation of the use of Directed Acyclic Graphs by Tez to optimize tasks on a cluster.

3. Activity - Using Hive on Tez and Measuring the Performance Benefit

This video demonstrates how to execute a Hive on Tez.

4. Mesos

This video presents the difference between Mesos and the alternative cluster manager to Hadoop YARN.

5. ZooKeeper

This video focuses on the workings of Zookeeper and the systems that depend on Zookeeper for reliable operation.

6. Activity - Simulating a Failing Master with ZooKeeper

This video demonstrates how to use ZooKeeper's command line interface to explore its workings.

7. Oozie

This video introduces Oozie that allows you to set up complex workflows on a cluster using multiple technologies and schedule them. You will also learn about a few examples of its workings.

8. Activity - Setting Up a Simple Oozie Workflow

This video demonstrates how to use Oozie to import movie data into the Hadoop Distributed File System (HDFS) from MySQL using Sqoop and analyze that data using Hive.

9. Zeppelin - Overview

This video presents an overview of Apache Zeppelin, which provides a notebook-based environment for importing, transforming, and analyzing data.

10. Activity - Using Zeppelin to Analyze Movie Ratings - Part 1

This is the first part of a two-part video that demonstrates how to set up a Zeppelin notebook to load movie ratings and titles into the Spark dataframes and interactively query and visualize them.

11. Activity - Using Zeppelin to Analyze Movie Ratings - Part 2

This is the second part of a two-part video that demonstrates how to set up a Zeppelin notebook to load movie ratings and titles into the Spark dataframes and interactively query and visualize them.

12. Hue - Overview

This video presents an overview of what Hue offers and how it is different than Ambari.

13. Other Technologies Worth Mentioning

This video focuses on Chukwa and Ganglia.

9. Feeding Data to Your Cluster

1. Kafka

This video provides an explanation of how Kafka provides a scalable and reliable means for collecting data across a cluster of computers and broadcasting it for further processing.

2. Activity - Setting Up Kafka and Publishing Data

This video demonstrates how to set up Kafka to publish and consume data from a new topic.

3. Activity - Publishing Web Logs with Kafka

This video demonstrates how to use a Kafka connector to monitor the Apache log files.

4. Flume

This video presents an overview of the sinks and Flume's architecture. You will also learn about how it differs from Kafka.

5. Activity - Setting up Flume and Publishing Logs

This video demonstrates how to connect a source listening to a telnet connection to a sink that logs the information received.

6. Activity - Setting Up Flume to Monitor a Directory and Store its Data in Hadoop Distributed File System (HDFS)

This video demonstrates how to configure Flume to monitor a directory on a local file system for new files, and publish their data into HDFS, organized by the time the data was received.

10. Analyzing Streams of Data

1. Spark Streaming: Introduction

This video provides an explanation of how Spark Streaming allows you to write "continuous applications" that process micro-batches of information in real time.

2. Activity - Analyzing Web Logs Published with Flume using Spark Streaming

This video demonstrates how to write and run a Spark Streaming application that analyzes web logs as they are streamed-in from Flume.

3. Exercise - Monitor Flume-Published Logs for Errors in Real Time

This video demonstrates how to look for status codes in the web log and aggregates on how often different status codes appear.

4. Exercise Solution: Aggregating the Hypertext Transfer Protocol (HTTP) Access Codes with Spark Streaming

This video demonstrates how to review the author's solution to the previous exercise.

5. Apache Storm: Introduction

This video introduces Storm Streaming, which is a better solution to Spark Streaming.

6. Activity - Counting Words with Storm

This video demonstrates how to walk through and run the word count topology sample including Storm.

7. Flink: Overview

This video presents an overview of the difference between Apache Flink and Storm.

8. Activity - Counting Words with Flink

This video demonstrates how to install Flink.

11. Designing Real-World Systems

1. The Best of the Rest

This video focuses on other systems, such as Impala, NiFi, Falcon, Accumulo, Amazon Web Services (AWS), Kinesis, Redis, Ignite, Elasticsearch, and Slider.

2. Review: How the Pieces Fit Together?

This video provides an explanation of how other systems are categorized and fit together.

3. Understanding Your Requirements

This video focuses on the right questions about performance, reliability, and complexity.

4. Sample Application: Consuming Web Server Logs and Keeping Track of Top-Sellers

This video presents an overview of the technologies needed. You will also learn how they will fit together for the sample application.

5. Sample Application: Serving Movie Recommendations to a Website

This video introduces you to the tools needed for the sample application.

6. Exercise - Designing a System to Report Web Sessions Per Day

This video demonstrates how to handle web analytics data for large websites.

7. Exercise Solution: Designing a System to Count Daily Sessions

This video demonstrates how to compare your system design with that of the author.

12. Learning More

1. Books and Online Resources

This video focuses on some of the author's favorite tools and websites for reference.

Course Images

The Ultimate Hands-On Hadoop

By Packt

Booking options

Highlights