top of page
  • Writer's picturepuru Nichesoft

Setup of Datalake (Hadoop and Kafka) from Different Sources

Updated: May 9, 2023

Project Description:

This project is about creating a datalake in Hadoop and real time data transformation in Kafka. The souces of data were Oracle, MySQL and Mongo. The source data was highly transactional and with huge history.


Project Details:

  • Number of databases : 50+ (Oracle, MySQL and Mongo included)

  • Data Size: 600+ TB

  • Team size: 40+


Methodology:

The data was streamed real time into Kafka from different sources and the portion of staging data and final output was taken into Hadoop. This was further processed real time and taken to cloud for final consumption.



Peron working in server room
Data engineer

Challenges:

  • Multiple sources

  • Volume of data

  • Speed/lag issues

  • Data quality

  • Data Types

  • Security requirements

  • Character set used at database level



19 views0 comments
bottom of page