Project Description:
This project is about creating a datalake in Hadoop and real time data transformation in Kafka. The souces of data were Oracle, MySQL and Mongo. The source data was highly transactional and with huge history.
Project Details:
Number of databases : 50+ (Oracle, MySQL and Mongo included)
Data Size: 600+ TB
Team size: 40+
Methodology:
The data was streamed real time into Kafka from different sources and the portion of staging data and final output was taken into Hadoop. This was further processed real time and taken to cloud for final consumption.
Challenges:
Multiple sources
Volume of data
Speed/lag issues
Data quality
Data Types
Security requirements
Character set used at database level
Comments