AWS - AWS EMR best practise

Posted on 2018-05-06 |

Introduction

Moving data to AWS --> Data Collection --> Data Aggregation --> Data Processing --> Cost and Performance Optimizations

1. Moving data to AWS

Means moving bulk of the existing data to AWS

Moving Direction
- Local Storage <–> AWS S3
  - Local HDFS --> S3
    - using S3DistCp: an extension of DistCp with optimizatios to AWS
    - using DistCp
  - Local Filesystem --> AWS S3
    - opensourced tools support multi-threading: Jets3t / GNU Parallel
    - Aspera Direct-to-S3: a file transfer protocol based on UDP and with optimizations to AWS
    - Device based import/export
    - AWS Direct Connect
      - One-time direct connection: once bulk data trasferred, stop the direct connection
      - On going direct connection: always connected
  - S3 --> Local HDFS
    - using S3DistCP or DistCP
- AWS S3 --> AWS EMR
- AWS S3 --> HDFS
With good optimization: several Terabytes a day

2. Data Collection

Means streaming data to AWS

Apache Flume : collected data can be sent to S3, HDFS and more
Fluentd: collected data can be sent to S3, SQS , MongoDB, Redis and more

3. Data Aggregation

Means, aggregated collected data at proper size before sending to target storage (S3, HDFS, EMR).

AWS - Summarzie

Posted on 2018-05-04 |

Service Saling Summary

Service Name	Sacling/Failover capability	Comments
AutoScaling	AZ --Yes ; Region – No	horizental scaling EC2 in same autoscaling group
Elastic Cache	Multi AZ failover;
Route53	Global Service
CloudFront	Global Service
VPC	Span Multi AZ; Region – No
RDB	Muti AZ; Multi Region (read replica)

Calcuation

EBS --> GP2 typed SSD --> flexible IOPS

AWS - Storage Gateway best practise

Posted on 2018-04-26 |

Introduction

使用storage gateway帮助实现Hibrid Architecture

使用gateway实现 NFS到Amazon S3的转换：翻译NAS协议到S3到API call
云端的AWS EMR或者Athena可以直接访问备份到S3的数据

Architecture

Storage Gateway是virtual applicance
支持NFS v3.0 or 4.1
local storage is used to provide read/write cache
"Bucket Share": 一个share代表一个S3到NFS的mount point映射（s3到bucket share是一对一的关系）
一个gateway至多支持10个bucket share

File to object mapping

通常unix文件的读写权限（owner，group， permission）和时间戳会映射到s3d的object的metadata中

Read/Write操作和本地cache

LRU（least recently used）算法用来evict data
Read Operation （read through cache），先读cache，没有则访问网络
Write Operation （write-back cache），先写cache（parallel writes到local），然后异步写变化的部分回网络

NFS Security in LAN

AWS - Data Security

Posted on 2018-04-23 |

Encryption and Key Management in AWS

Encryption and key management in AWS --2015
https://youtu.be/uhXalpNzPU4

Encryption Primer

Encrypt the data using Symetric key and store the encrypted data
Use Master Key to encrypt Symetric Key and store the encrypted key
Use Master master key to encrypt master key, and store the encrypted master key
…, these keys are called Key Hierarchy, and store in HSM
- Reduce the Blast Radius about losting a single key

Client Side Encryption

customer encrypt the data and manage key
For client side encrypted data targeting to store in S3, you can use AWS SDK to simplify the approach
- Using AWS SDK, your master key will be on premise, but symetric key and encrypted data will saved in S3

Server Side Encryption

AWS encrypt the data and manage key
- upload raw data via TLS to AWS (S3, Glacier, EBS, Redshift, RDS etc), then enable encryption
  - Use AWS key, AWS will generate dynamic unique key for each object, then manage the key using aws s3 internal service
  - Use customer key, AWS will encrypt the data using customer’s key and throw away the key after encryption
    - when request the encrypted data, you need to provide the key , aws decrypt the data and return it back
For S3/EBS/RDS/Redshift server side encryption, you have 2 options,
- use S3/EBS/RDS/Redshift service master key (who ever have access to bucket will be able to decrypt the data)
- use AWS KMS service, then you can specify which master key you want to use when encrypt the data

Key management Options

self manage
AWS Key Management Service
- Use API to generate the key, encrypted and plain text key will be returned
  - Plaintext is used to encrypt the data, encrypted key is stored locally
  - when decryption needed, client need to submit the locally stored encrypted key
  - Master key is alwasy stored in KMS.
  - Benefit: KMS have more fine-grained access control, so encrypted data can only be decrypted by user who have access to the key.
  - Better auditing
  - Plain text never exist in any persist storage; AWS Service operate team is fully separated with KMS team; Multi-Party control
AWS Partner solutions
- Browse AWS marketplace for security
AWS CloudHSM – HSM is hardware Security module
- A box used to store the keys
- Only the user have access to the module
- You can use offical cloudformation to provision it
- support oracle/sql server( run on EC2 ) encrytion with HSM
- support EBS storage encryption
- support redshift (the only one )
KMS vs HSM
- KMS underlyingly using HSM platform but not dedicated HSM
- HSM is useful to comply government standards

AWS - Direct Connect

Posted on 2018-04-23 |

104.mp4 105.mp4 – AWS direct connect

Dedicated network connection between on-primises network and AWS
- 1 gigbit or 10 gigbit fiber
- 8.2.1q VLANS

AWS - Architecture Design

Posted on 2018-04-22 |

100.mp4 101.mp4 – Architecture Design

Architecture Design

Security
Reliability
Performance
Cost Optimization
- Costs
- Suboptimal Resources
Operational Excellence
- Max Business value
- Continious Improvement
Production Scale Testing
Data-Driven Architecture
CHAOS MONKEY
Forensic Clean ???
WAF : web application firewall
Penetration Testing : need to inform aws

Design Principles

Mechanical Sympathy
- https://github.com/jjfidalgo/mechanical-sympathy
Storage : select from Block , File , Object

Cost Optimization

Analysis attribute expenditure
AWS Trusted Adviser : feature
Runbook (how to run daily operations) and playbook (how to handle specific situation)

aws.amazon.com/architecture

AWS - CloudTrail

Posted on 2018-04-22 |

Reference

https://youtu.be/vtMCjyE5nms

AWS CloudTrail OFF IR (Incident Response) Runbook

When someone want to turn off the CloudTrail, it will automatically being turned on and automated report and reminder being generated.

Automation Steps,

Turn CloudTrail backon
- using python/lambda to handle the turn off event and turn it back on.
Gather data related to “TURN OFF” incident
Extract principal, date, time, source IP from event data
Map principal who assumed the role
Lookup human contact info
Contact human provide guidance
Generate event summary for report

questionair before implementation

What’s my expressed security objective in words
Is it configuration or behavior related ?
What data, where could help inform me ?
Do I have requisite ownership or visibility ?
What are my performance requirements ?
What mechanisms support the above ?
What is my expressed security objective in code?
Am I done?
Does a human need to look at this? When?

Demo - S3:PutBucketPolicy IR Runbook

When someone changed the S3 policy in a bad way, check the policy and restore it if needed.This runbook is making use of Stepfunction to implement it.

AWS - Elastic Cache hands on

Posted on 2018-04-22 |

093.mp4 094.mp4 – Elastic Cache hands on

An important use scenario is to maintain application session state ( sesson replication)

Create Security Group for Redis service as RedisSG.
- allow inbound 6379 port from webserverSG
- allow outbound 6379 to everywhere
Cache subnet group
- similiar like RDB subnet group.
Using wizard to create Elastic Cache cluster
- option: enable replication ; enable multiAZ
- instance type: cache.t2.micro
- option: file location in S3 bucket
- select cache subnet group using newly created; select security group using newly created;
- optional : maintain window , SNS
review the result ( 1 node )

AWS - Elastic Cache

Posted on 2018-04-22 |

Elastic Cache

Managed in-memory cache service
key value stores
sub-mili sec latency to data
Redis / Memcached data store options
Multi-AZ capability
increase application throughput : 20M reads /sec ,4.8M write /sec
Scaling DB layer is much more expensive compared to scaling caching layer

Compare the two options

depending on project language & framework support
Redis feature is superset of Memcached

Memcached Vs Redis

Store json
- in Memcached , use serialized string
- in Redis , use hash

Memcached (mem cache d) store Option

Free and opensource
Object max size 1MB
Total max size 7 TiB
No persistence ; easy to adding node

Redis Store option

Free and open Source
Object max size 512M
Total max size 3.5 TiB
persistence; read replica
Support Notification from Redis Pub/Sub channel
Support more data structures including : bitmaps, hyperlogs,GeoSpacial command; geo indexes with radius queries ; and also those supported by Memcached
Support auto sorting of data
Support HA and Failover
- Failover is automatic, will chose the read replica with lowest latency and will switch the dns automatically
API provided to query all read replica endpoints
Sharding : 16384 sharding slots (automatic client sharding; developer must use Redis Cluster Client )
Standard Redis use case
Leaderboard
Counters ; like & dislike

AWS - High Available and Fault Tolerant Architecture

Posted on 2018-04-19 |

085.mp4 — HA and Fault Torerant Architecture hands on overview

086.mp4 – focusing on VPC

Advanced VPC Architecture

The Advanced VPC Architecture can use CloudFormer to duplicate into different regions
- Route53 (Global) handle cross Region requests to IGW sitting in each region and the CloudFront Distribution
  - Route53 is Global service
- CloudFront (Global) caching source linked to S3 bucket.
  - CloudFront is Global service
- 2 VPC : sitting in different regions.
  - VPC can’t span region
- 2 S3 : each region has 1 S3 bucket and using “Service Endpoint” linking to VPC in same rigion; one of the S3 bucket is used as the other S3’s replica.
  - S3 can’t span region
  - S3 can connect to VPC via “service endpoint”
  - S3 can have replica in another region
- 2 ELB : each region has one ELB, receiving request from IWG and banlance the request to Instances sitting accross AZs.
  - ELB can span AZ and balancing request accross AZ
  - ELB can’t span Region
- 4 Availability Zones : each region has 2 AZ. EC2 instances and Aurora DB services are splitted into 2 Regions and 2 AZs in each region.
- 2 AutoScaling Group : each region has 1 auto scaling group to contain the EC2 clusters. Each AutoScaling Group is spanning 2 AZs.
- 4 Public Subnet: each region has 1 VPC and each VPC has 2 public subnet sitting in different AZs.
  - subnet can’t span AZs. (Subnet has property to specify its AZ id)
  - each of the public subset contains partial of the EC2 cluster ( which belongs to the AutoScaling Group)
- 2 NAT Service : each region has 1 NAT service sits in one of the public subnet to provide NAT service to instances sitting in Private subnet.
- 4 Private Subnet : Each region has 2 private subnet sitting in different AZs. These subnet used to contain data services.
  - subnet can’t span AZs.
- 5 Aurora service : each AZ has at least one Aurora service.
  - The main Aurora and standby Aurora sits in one Region but splitted into different AZ.
  - One AZ has the main Aurora, and all the other AZ has the read replica.
  - Aurora support cross Region Read Replica
    - https://aws.amazon.com/blogs/aws/new-cross-region-read-replicas-for-amazon-aurora/
- 2 DB Subnet Group : each region has one

hands on by creating one VPC in one region that meet the Advanced VPC archi design

Use wizard to create the VPC and then review the config
- choose the wizard to create VPC with 1 pubic subnet and 1 private subnet
- select CIDR for each subnet and select same AZ for both subnet
- Specify NAT instance type and key pair
- Specify S3 Service Endpoint access level (none / public only / private only / both; full access / custom )
understanding the Route Table being used / created in Public and Private Subnet
- public subnet的路由表，0.0.0.0/0指向IWG; Private subnet，0.0.0.0/0指向NAT的ENI
- 一条s3的请求指向特定的VPC（service endpoint）
- 一条本地VPC内部的局部路由
- route table is explicitly associated to public subnet; route table is implicitly associated to private subnet
Check默认的ACL ：allow inbound/outbound everything
Check the NAT service being created via wizard
- Virtulization is using paravirtual ( new EC2 instances are more using HVM )
- NAT security group by defaut is allow all
To fix above issues,
- Change the network interface to “not being deleted after termination”
- terminate the current NAT instance and check the network interface’s status become “available”
- create a new VPC security group to be used by NAT instance
  - allow inbound http(s) from private subnet
  - allow inbound ssh from current client ip
  - allow outbound http(s) to anywhere
  - allow all inbound traffic from current security group (!!! ???)
- change the network interface’s security group to use the new security group we just created.
  - VPC security group is binding with instance (attach to network interface is the same to attach to instance)
- review the VPC security group vs ACL (access control list)

087.mp4 – going on with re-create the NAT instance

when creating the new NAT instance, search from community AMI with key word “NAT HVM” to select the existing Image
select the existing public subnet to contain this new NAT instance
Disable “Assign public IP” because we will attach existing network interface to it
Network Interfaces section: attach existing one to this new NAT instance
Select the newly created Security Group , the NAT security group
Review and launch the instance

088.mp4 – In the same region, create subnet in another AZ and create ACL for all subnets

Create the new private subnet sitting in Same VPC but different AZ. (design the size accordingly)
Create the new public subnet sitting in Same VPC but different AZ. (design the size accordingly)
Review the route table being created for both new Subnets
- the route table attached by default with public VPC is wrong , it’s the route table being used by private subnet; change it to the other one that routing internet traffic to iwg.
Create new ACL called “Public NACL” which sits inside existing VPC.
- ACL rule has an number, it will be applied using Number sequence
- allow inbound http(s) from internet,inbound ssh from client
- 1024-65535 (by ELB health check) from internet
  - https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-groups.html
- allow outbound http(s) to internet, outbound ssh to all private subnets, outbound 1024-65535 to internet, port 3306(mysql ) to all private subnets
- associate newly created ACL to 2 public subnets
Create new ACL called “private NACL” which sits inside existing VPC
- Allow inbound MySQL (3306) from both public subnets ; allow inbound ssh from both public subnets; inbound 32768-61000 from internet (NAT)
- Allow outbound http(s) to internet; allow outbound 32768-61000 (mysql response) to public subnets;
- Associate newly created ACL to 2 private subnets
the ACL inbound and outbound rule will have a default deny rule at the end.
If ping doesn’t work, check the ICMP protocol at security group and ACL level
If ssh doesn’t work, check ACL outbound protocol allow 32768-61000 to internet: which means the ssh respond to the ssh client.