This project is about creating a datalake in Hadoop and real time data transformation in Kafka. The souces of data were Oracle, MySQL and Mongo. The source data was highly transactional and with huge history.
Number of databases : 50+ (Oracle, MySQL and Mongo included)
Data Size: 600+ TB
Team size: 40+
The data was streamed real time into Kafka from different sources and the portion of staging data and final output was taken into Hadoop. This was further processed real time and taken to cloud for final consumption.
Volume of data
Character set used at database level