4 to 8 weeks Practical Hands-On Big Data Hadoop Developer Certification training in Rochester, MN, MN | Big Data Training | Hadoop training | Big Data analytics training | Hortonworks, Cloudera, HDFS, Map Reduce, YARN, Pig, Hive, Sqoop, Flume, Ambari trai

The first 16 hours of this course we will cover foundational aspects with Big Data technical essentials where you learn the foundations of hadoop, big data technology technology stack, HDFS, Hive, Pig, sqoop, how to set up Hadoop Cluster, how to store Big Data using Hadoop (HDFS), how to process/analyze the Big Data using Map-Reduce Programming or by using other Hadoop ecosystems.

The next 16 hours of the course will cover all the course topics in-depth with Hands-on lab exercises mentioned in the comprehensive course outline below.

Course Schedule for First 8 sessions (First 16 Hours)

Course Schedule for Next 8 sessions (Next 16 Hours)

  • Weekdays October 1 - October 24, 2019 US Pacific Time

  • The class sessions will be held-Tuesday, Thursday every week

  • 6:30-8:30 PM US Pacific time, each day.

View Detailed Weekly Training Schedule at the bottom of this event listing.

Couse Objectives

  • Knowledge of Hadoop components such as MapReduce, Sqoop, HBase, Hive, Pig, HDFS, Flume, ZooKeeper, Oozie, etc.

  • Ability to work on Hadoop related Projects as an individual contributor or as part of a team.

  • Setup, Install and Configure Hadoop in Different environments - Development, Support and Test environments

  • Hadoop architecture and various operations performed on it

  • Familiarity with various Hadoop Solutions.


Desired but not required - Exposure to, Working proficiency of BI, sql, scripting, how to handle and manage data and databases, using Excel, java programming language, basic UNIX commands.

Course Features

  • 4-8 weeks, 8-16 sessions, 16-32 hours of total LIVE Instruction

  • Training material, instructor handouts and access to useful resources on the cloud provided

  • Practical Hands on Lab exercises on cloud workstations provided

  • Actual code and scripts provided

  • Real-life Scenarios

Course Outline

  • This is a comprehensive course outline. It is also a guideline, indicative of what topics might be covered during the class. This outline and the actual course content covered during the class by the instructors may be adjusted based on the skills, experience and background of the students when introductions are done during the beginning of the first session.

  • We strive to teach and cover as many topics from this course outline as possible during this training. If enough students are interested in learning additional topics in addition to the 32 hours of training delivery in even more comprehensive and in-depth manner, we can hold additional sessions for an extra charge. 1 on 1 tutorship is also available which may be slighlty expensive than a group training.

Big Data Basics

  • An introduction to Big Data?

  • Why is Big Data? Why now?

  • The Three Dimensions of Big Data (Three Vs)

  • Evolution of Big Data

  • Big Data versus Traditional RDBMS Databases

  • Big Data versus Traditional BI and Analytics

  • Big Data versus Traditional Storage

  • Key Challenges in Big Data adoption

  • Benefits of adoption of Big Data

  • Introduction to Big Data Technology Stack

  • Apache Hadoop Framework

  • Introduction to Microsoft HDInsight – Microsoft’s Big Data Service

  • Hands-On Lab Exercises

The Big Data Technology Stack

  • Basics of Hadoop Distributed File System (HDFS)

  • Basics of Hadoop Distributed Processing (Map Reduce Jobs)

  • Hands-On Lab Exercises

Deep dive into Hadoop Distributed File System (HDFS) 

  • HDFS

  • Reading files with HDFS

  • Writing files with HDFS

  • Error Handling

  • Design and Concepts of HDFS

  • Blocks, Name nodes, Data nodes

  • HDFS High-Availability

  • HDFS Federation

  • HDFS Command-Line Interface

  • Basic File System Operations

  • Anatomy of File Read and Write

  • Block Placement Policy and Modes

  • Configuration files - Detailed explanation

  • Metadata

  • FS image

  • Edit log

  • Secondary Name Node

  • Safe Mode

  • How to add New Data Node dynamically

  • How to decommission Data Nodes dynamically without stopping cluster

  • FSCK Utility

  • How to override default configuration at Programming level and system level

  • ZOOKEEPER Leader Election Algorithm

  • Hands-On Lab Exercises

Processing Big Data –MapReduce and YARN

  • How MapReduce works

  • Handling Common Errors

  • Bottlenecks with MapReduce

  • How YARN (MapReduceV2) works

  • Difference between MR1 and MR2

  • Error Handling

  • Running a simple MapReduce application (word count)

  • Running a custom MapReduce application (census data)

  • Running MapReduce via PowerShell

  • Running a MapReduce application using PowerShell

  • Monitoring application status

  • Hands-On Lab Exercises

Big Data Development Framework

  • Introduction to HIVE

  • Introduction to PIG

  • HBase

  • Loading the data into HIVE

  • Submitting Pig jobs using HDInsight

  • Submitting Pig jobs via PowerShell

  • Hands-On Lab Exercises

Big Data Integration and Management

  • Big Data Integration using Polybase

  • Big Data Management using Ambari

  • Fetching HDInsight data into SQL

  • Using Ambari for managing HDInsight cluster

  • Hands-On Lab Exercises

Map Reduce

  • Basics of Functional Programming

  • Map Reduce Basics

  • How Map Reduce Works

  • Anatomy of Map Reduce Job

  • Legacy Architecture: Job Submission, Job Initialization, Task Assignment, Task Execution, Progress

  • Status Updates

  • Job Completions and Failures

  • Shuffling, Sorting

  • Splits, Record reader, Partition, Types of partitions and Combiner

  • Optimization Techniques -> Speculative Execution, JVM Reuse

  • Schedulers, Counters

  • Comparisons between Old, New API at code and Architecture Level

  • Getting data from RDBMS into HDFS using Custom data types

  • Distributed Cache and Hadoop Streaming (Python, Ruby, and R)

  • Hands-On Lab Exercises


  • Sequential Files and Map Files

  • Enabling Compression Codec’s

  • Map side Join with distributed Cache

  • Types of I/O Formats: Multiple outputs, NLINE input format

  • Handling small files using Combine File Input Format

  • Hands-On Lab Exercises

Map Reduce and Java Programming

  • Hands-on “Word Count” in Map Reduce in standalone and Pseudo distribution Mode

  • Sorting files using Hadoop Configuration API discussion

  • Emulating “grep” for searching inside a file in Hadoop

  • DBInput Format

  • Job Dependency API discussion

  • Input Format API discussion, Split API discussion

  • Custom Data type creation in Hadoop

  • Hands-On Lab Exercises


  • ACID in RDBMS and BASE in NoSQL

  • CAP Theorem and Types of Consistency

  • Types of NoSQL Databases in detail

  • Columnar Databases in Detail (HBASE and CASSANDRA)

  • TTL, Bloom Filters and Compensation

  • Hands-On Lab Exercises


  • Concepts

  • Installation

  • Data Model of HBase and Comparison between RDBMS and NOSQL

  • Master and Regional Servers

  • DDL and DML HBase Operations

  • Architecture of HBase

  • HBase Catalog Tables

  • HBase Block Cache and sharding

  • HBase SPLITS

  • HBase DATA Modeling (Sequential, Salted, Promoted and Random Keys)

  • JAVA API’s and Rest Interface

  • Client-Side Buffering and Process 1 million records using Client-side Buffering

  • HBase Counters

  • Enabling Replication and HBase RAW Scans

  • HBase Filters

  • Bulk Loading and Co processors (Endpoints and Observers with programs)

  • Hands-On Lab Exercises


  • Introduction to Hive

  • Hive Architecture

  • Hive Installation

  • Hive Services, Shell, Server, Web Interface (HWI)

  • Meta store, Hive QL

  • OLTP vs. OLAP

  • Working with Tables

  • Primitive data types

  • Complex data types

  • Working with Partitions

  • User-Defined Functions

  • Hive Bucketed Tables and Sampling

  • External partitioned tables

  • Map the data to the partition in the table

  • Write the output of one query to another table, Multiple inserts

  • Dynamic Partition

  • Differences between ORDER BY, DISTRIBUTE BY and SORT BY

  • Bucketing and Sorted Bucketing with Dynamic partition

  • RC File



  • Compression on hive tables and Migrating Hive tables

  • Dynamic substation of Hive and Different ways of running Hive

  • How to enable Update in HIVE

  • Log Analysis on Hive

  • Access HBASE tables using Hive

  • Hands-on Lab Exercises


  • Installation

  • Execution Types

  • Grunt Shell

  • Pig Latin

  • Data Processing

  • Schema on read

  • Primitive data types and complex data types

  • Tuple schema, BAG Schema, and MAP Schema

  • Loading and Storing

  • Filtering, Grouping, and Joining

  • Debugging commands (Illustrate and Explain)

  • Validations, Type casting in PIG

  • Working with Functions

  • User-Defined Functions

  • Types of JOINS in pig and Replicated Join in detail

  • SPLITS and Multiquery execution

  • Error Handling, FLATTEN and ORDER BY

  • Parameter Substitution

  • Nested For Each

  • User-Defined Functions, Dynamic Invokers, and Macros

  • How to access HBASE using PIG, Load and Write JSON DATA using PIG

  • Piggy Bank

  • Hands-on Lab Exercises


  • Installation

  • Import Data. (Full table, Only Subset, Target Directory, protecting Password, file format other than CSV, Compressing, Control Parallelism, All tables Import)

  • Incremental Import (Import only New data, Last Imported data, storing Password in Metastore, Sharing Metastore between Sqoop Clients)

  • Free Form Query Import

  • Export data to RDBMS, HIVE, and HBASE

  • Hands-on Lab Exercises


  • Introduction

  • Installation

  • About Hcatalog with PIG, HIVE, and MR

  • Hands-on Lab Exercises


  • Introduction and Oveview

  • Installation

  • Flume Agents: Sources, Channels, and Sinks

  • Log User information using Java program into HDFS using LOG4J and Avro Source, Tail Source

  • Log User information using Java program into HBASE using LOG4J and Avro Source, Tail Source

  • Flume Commands

  • Hands-on Lab Exercises

Different Hadoop Ecosystems

  • Hortonworks

  • Cloudera


  • Workflow (Action, Start, Action, End, Kill, Join and Fork), Schedulers, Coordinators and Bundles., to show how to schedule Sqoop Job, Hive, MR and PIG

  • Real-world Use case which will find the top websites used by users of certain ages and will be scheduled to run for every one hour

  • Zoo Keeper

  • HBASE Integration with HIVE and PIG

  • Phoenix

  • Proof of concept (POC)

  • Hands-on Lab Exercises


  • Spark Overview

  • Linking with Spark, Initializing Spark

  • Using the Shell

  • Resilient Distributed Datasets (RDDs)

  • Parallelized Collections

  • External Datasets

  • RDD Operations

  • Basics, Passing Functions to Spark

  • Working with Key-Value Pairs

  • Transformations

  • Actions

  • RDD Persistence

  • Which Storage Level to Choose?

  • Removing Data

  • Shared Variables

  • Broadcast Variables

  • Accumulators

  • Deploying to a Cluster

  • Unit Testing

  • Migrating from pre-1.0 Versions of Spark

Detailed Weekly Schedule for First 8 sessions (1st 16 Hours)

Detailed Weekly Schedule for Next 8 sessions (Additional 16 Hours)

Refund Policy

  • All Sales are Final. There are no Refunds.

  • If a student is not happy with the training experience, we strive to listen, take the feedback and implement honest and sincere measures to meet and exceed student expectations. 

  • If a class is rescheduled/cancelled by the organizer, registered students will be offered a credit towards any future course