Big Data And Hadoop

Posted on 2017-06-01 |

Big data and hadoop for beginners
Big data Overview
Hadoop fundamental
Hadoop Ecosystem
Hadoop distributed components
Hadoop data processing blocks
Name Node / Secondary Name Node

Big data and hadoop for beginners

This is based on,
https://www.udemy.com/big-data-and-hadoop-for-beginners/learn/v4/overview

Big data Overview

structured data (database, excel) , semi structured data (Xml , Json), unstructured data (log)
5 Vs of big data: Volume (Terabytes, ZettaBytes) ; Vilocity (the speed of data generating or moving around); Variety ( structured~unstructured); Veracity (data quality); Value (value etracted from data)

big data use case

mobile advertisement company
telco, finiance, retailer

big data jobs

big data analyst; hadoop administrator ; big data engineer; big data scentist;
big data manager; big data solution archi ; chief data officer

ETL vs ELT

Traditional: ETL : Extract Transform Load -->data warehouse
Hadoop: ELT: Extract -> Load -> Transform

Major Commercial Distributers of Hadoop

Amazon Elastic Map Reduce (EMR)
Cloudera
Hortonworks
MapR Technology(support network files)
Pivotal
TereData

Hadoop fundamental

HDFS
MapReduce
In hadoop, data is distributed and the network is used to transport the data processing method and processed result ,thus saved the effort of moving data.
A mapper is trying to find the correct data locally and process them and then send the result to reducer.

Hadoop Ecosystem

hadoopecosystem

Hadoop distributed components

Components distributed (note who and who are on the same node)
Task Tracker act as slave to Job Tracker
Data Node demaen act as slave to Name Node
Name Node store meta data (where’s the data what data) , runs on master node.

Hadoop data processing blocks

How Hadoop split big data and spread into different nodes
Unix System default data block 4k, in Hadoop default data block is 64MB.
Why is hadoop define the data block unit as 64MB ? To eliminate the network overhead when request and locate data.
By default , hadoop persist 3 copies for same data block.

Name Node / Secondary Name Node

1. Failover
1. Run at background to do the data sync between datalog and FS image that master node only do when start up.

Reward Makes Perfect

Post author: Rachel Rui Liu
Post link: https://racheliurui.github.io/2017/06/01/markdown/Trending/BigData/BigDataAndHadoop/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 3.0 unless stating additionally.

0%