- Big data and hadoop for beginners
- Big data Overview
- Hadoop fundamental
- Hadoop Ecosystem
- Hadoop distributed components
- Hadoop data processing blocks
- Name Node / Secondary Name Node
Big data and hadoop for beginners
This is based on,
https://www.udemy.com/big-data-and-hadoop-for-beginners/learn/v4/overview
Big data Overview
- structured data (database, excel) , semi structured data (Xml , Json), unstructured data (log)
- 5 Vs of big data: Volume (Terabytes, ZettaBytes) ; Vilocity (the speed of data generating or moving around); Variety ( structured~unstructured); Veracity (data quality); Value (value etracted from data)
big data use case
- mobile advertisement company
- telco, finiance, retailer
big data jobs
big data analyst; hadoop administrator ; big data engineer; big data scentist;
big data manager; big data solution archi ; chief data officer
ETL vs ELT
- Traditional: ETL : Extract Transform Load -->data warehouse
- Hadoop: ELT: Extract -> Load -> Transform
Major Commercial Distributers of Hadoop
- Amazon Elastic Map Reduce (EMR)
- Cloudera
- Hortonworks
- MapR Technology(support network files)
- Pivotal
- TereData
Hadoop fundamental
- HDFS
- MapReduce
In hadoop, data is distributed and the network is used to transport the data processing method and processed result ,thus saved the effort of moving data.
A mapper is trying to find the correct data locally and process them and then send the result to reducer.
Hadoop Ecosystem
Hadoop distributed components
- Components distributed (note who and who are on the same node)
- Task Tracker act as slave to Job Tracker
- Data Node demaen act as slave to Name Node
- Name Node store meta data (where’s the data what data) , runs on master node.
Hadoop data processing blocks
- How Hadoop split big data and spread into different nodes
- Unix System default data block 4k, in Hadoop default data block is 64MB.
Why is hadoop define the data block unit as 64MB ? To eliminate the network overhead when request and locate data.
By default , hadoop persist 3 copies for same data block.
Name Node / Secondary Name Node
-
- Failover
-
- Run at background to do the data sync between datalog and FS image that master node only do when start up.