Apache Hadoop - Big Data
May 27, 2017

What is Hadoop ?
Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
How Hadoop helps us ?
Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality — nodes manipulating the data they have access to— to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Future of Hadoop and where its Going ?
One of the other areas where Big Data technologies are headed is towards some kind of architectural revamp in the area of data security, that allows massive data sets to be collected, streamed or analyzed in a relatively secure way. A lot of organizations that require the implementation of real time analytics would undoubtedly require such advanced data security layers or capabilities.
Our Course Structure:
1. Introduction to BigData & Hadoop - 6 Hrs• Introduction to Big Data
• Introduction to Hadoop
• Why Hadoop?
• History of Hadoop
• Components of Hadoop
• Brief idea about HDFS, MapReduce, PIG, Hive, SQOOP, HBASE, OOZIE, Flume, Zookeeper and so on…
• Scope of Hadoop
2. HDFS (Storing the Data) - 12 Hrs
• Introduction of HDFS
• HDFS Design
• HDFS role in Hadoop
• Features of HDFS
• Daemons of Hadoop and its functionality Name Node Secondary Name Node Job Tracker Data Node Task Tracker
• Anatomy of File Wright
• Anatomy of File Read
• Network Topology Nodes Racks Data Center
• Parallel Copying using DistCp
• Basic Configuration for HDFS
• Data Organization Blocks Replication
• Rack Awareness
• Heartbeat Signal
• How to Store the Data into HDFS
• How to Read the Data from HDFS
• Accessing HDFS (Introduction of Basic UNIX commands)
• CLI commands
3. MapReduce using Java (Processing the Data) - 15 Hrs
• The introduction of MapReduce.
• MapReduce Architecture
• Data flow in MapReduce Splits Mapper Portioning Sort and Shuffle Combiner Reducer
• Understand Difference Between Block and InputSplit
• Role of RecordReader
• Basic Configuration of MapReduce
• MapReduce life cycle Driver Code Mapper Reducer
• How MapReduce Works
• Writing and Executing the Basic MapReduce Program using Java
• Submission & Initialization of MapReduce Job
• File Input/Output Formats in MapReduce Jobs Text Input Format Key Value Input Format Sequence File Input Format NLine Input Format
• Joins Map-side Joins Reducer-side Joins
• Word Count Example
• Partition MapReduce Program
• Side Data Distribution Distributed Cache (with Program)
• Counters (with Program) Types of Counters Task Counters Job Counters User Defined Counters Propagation of Counters
• Job Scheduling
4. PIG - 10 Hrs
• Introduction to Apache PIG
• Introduction to PIG Data Flow Engine
• MapReduce vs. PIG
• When PIG should be used?
• Data Types in PIG
• Basic PIG programming
• Modes of Execution in PIG Local Mode MapReduce Mode
• Execution Mechanisms Grunt Shell Script Embedded
• Operators/Transformations in PIG
• PIG UDF’s with Program
• Word Count Example in PIG
• MapReduce and PIG: Comparison
5. SQOOP - 6 Hrs
• Introduction to SQOOP
• Use of SQOOP
• Connect to MySql database
• SQOOP commands Import Export Eval Codegen etc…
• Joins in SQOOP
• Export to MySQL
• Export to HBase
6. HIVE - 8 Hrs
• Introduction to HIVE
• HIVE Meta Store
• HIVE Architecture
• Tables in HIVE Managed Tables External Tables
• Hive Data Types Primitive Types Complex Types
• Partition
• Joins in HIVE
• HIVE UDF’s and UADF’s with Programs
• Word Count Example
7. HBASE - 12 Hrs
• Introduction to HBASE
• Basic Configurations of HBASE
• Fundamentals of HBase
• NoSQL
• HBase Data Model Table and Row Column Family and Column Qualifier Cell and its Versioning
• Categories of NoSQL Data Bases Key-Value Database Document Database Column Family Database
• HBASE Architecture HMaster Region Servers Regions MemStore Store
• SQL vs. NoSQL
• HBASE and RDBMS: Comparison
• HDFS vs. HBase
• Client-side buffering or bulk uploads
• HBase Designing Tables
• HBase Operations Get Scan Put Delete
8. MongoDB - 6 Hrs
• What is MongoDB?
• Where to Use?
• Configuration On Windows