Introduction
Moving data to AWS --> Data Collection --> Data Aggregation --> Data Processing --> Cost and Performance Optimizations
1. Moving data to AWS
Means moving bulk of the existing data to AWS
- Moving Direction
- Local Storage <–> AWS S3
- Local HDFS --> S3
- using S3DistCp: an extension of DistCp with optimizatios to AWS
 - using DistCp
 
 - Local Filesystem --> AWS S3
- opensourced tools support multi-threading: Jets3t / GNU Parallel
 - Aspera Direct-to-S3: a file transfer protocol based on UDP and with optimizations to AWS
 - Device based import/export
 - AWS Direct Connect
- One-time direct connection: once bulk data trasferred, stop the direct connection
 - On going direct connection: always connected
 
 
 - S3 --> Local HDFS
- using S3DistCP or DistCP
 
 
 - Local HDFS --> S3
 - AWS S3 --> AWS EMR
 - AWS S3 --> HDFS
 
 - Local Storage <–> AWS S3
 - With good optimization: several Terabytes a day
 
2. Data Collection
Means streaming data to AWS
- Apache Flume : collected data can be sent to S3, HDFS and more
 - Fluentd: collected data can be sent to S3, SQS , MongoDB, Redis and more
 
3. Data Aggregation
Means, aggregated collected data at proper size before sending to target storage (S3, HDFS, EMR).