Introduction
Moving data to AWS --> Data Collection --> Data Aggregation --> Data Processing --> Cost and Performance Optimizations
1. Moving data to AWS
Means moving bulk of the existing data to AWS
- Moving Direction
- Local Storage <–> AWS S3
- Local HDFS --> S3
- using S3DistCp: an extension of DistCp with optimizatios to AWS
- using DistCp
- Local Filesystem --> AWS S3
- opensourced tools support multi-threading: Jets3t / GNU Parallel
- Aspera Direct-to-S3: a file transfer protocol based on UDP and with optimizations to AWS
- Device based import/export
- AWS Direct Connect
- One-time direct connection: once bulk data trasferred, stop the direct connection
- On going direct connection: always connected
- S3 --> Local HDFS
- using S3DistCP or DistCP
- Local HDFS --> S3
- AWS S3 --> AWS EMR
- AWS S3 --> HDFS
- Local Storage <–> AWS S3
- With good optimization: several Terabytes a day
2. Data Collection
Means streaming data to AWS
- Apache Flume : collected data can be sent to S3, HDFS and more
- Fluentd: collected data can be sent to S3, SQS , MongoDB, Redis and more
3. Data Aggregation
Means, aggregated collected data at proper size before sending to target storage (S3, HDFS, EMR).