Many time when I work with kafka I feel tempted to use kafka to store the data but it should never be used as datastore instead we should use our data lake to store the topics. I was working on a poc to create a stream which consumes the events coming from kafka and store it in our S3 bucket (Data lakes) .

Here I am running kafka service on ec2 instance and I have written a producer in scala which takes care of sending messages according to our need . My message has three part txn_id,amount,staus. My streaming application…


While working with any data pipeline projects most of times programmer deals with slowly changing dimension data .

Here in this post I will try to jot down implementation of SCD Type 2 using apache spark.All these operations will be done in memory after reading your source and target data. The code is implemented using Scala by you can implement similar logic in python too.

Over View of Data:

Source Data

Rajesh

An Engineer By profession . Like to explore new technology

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store