Big Data And Hadoop

Big data and hadoop for beginners

This is based on,
https://www.udemy.com/big-data-and-hadoop-for-beginners/learn/v4/overview

Big data Overview

  • structured data (database, excel) , semi structured data (Xml , Json), unstructured data (log)
  • 5 Vs of big data: Volume (Terabytes, ZettaBytes) ; Vilocity (the speed of data generating or moving around); Variety ( structured~unstructured); Veracity (data quality); Value (value etracted from data)

big data use case

  • mobile advertisement company
  • telco, finiance, retailer

big data jobs

big data analyst; hadoop administrator ; big data engineer; big data scentist;
big data manager; big data solution archi ; chief data officer

ETL vs ELT

  • Traditional: ETL : Extract Transform Load -->data warehouse
  • Hadoop: ELT: Extract -> Load -> Transform

Major Commercial Distributers of Hadoop

  • Amazon Elastic Map Reduce (EMR)
  • Cloudera
  • Hortonworks
  • MapR Technology(support network files)
  • Pivotal
  • TereData

Hadoop fundamental

  • HDFS
  • MapReduce
    In hadoop, data is distributed and the network is used to transport the data processing method and processed result ,thus saved the effort of moving data.
    A mapper is trying to find the correct data locally and process them and then send the result to reducer.

Hadoop Ecosystem

hadoopecosystem

image

Hadoop distributed components

  • Components distributed (note who and who are on the same node)
  • Task Tracker act as slave to Job Tracker
  • Data Node demaen act as slave to Name Node
  • Name Node store meta data (where’s the data what data) , runs on master node.

Hadoop data processing blocks

  • How Hadoop split big data and spread into different nodes
  • Unix System default data block 4k, in Hadoop default data block is 64MB.
    Why is hadoop define the data block unit as 64MB ? To eliminate the network overhead when request and locate data.
    By default , hadoop persist 3 copies for same data block.

Name Node / Secondary Name Node

image

    1. Failover
    1. Run at background to do the data sync between datalog and FS image that master node only do when start up.
Reward Makes Perfect
0%