Many time when I work with kafka I feel tempted to use kafka to store the data but it should never be used as datastore instead we should use our data lake to store the topics. I was working on a poc to create a stream which consumes the events coming from kafka and store it in our S3 bucket (Data lakes) .
Here I am running kafka service on ec2 instance and I have written a producer in scala which takes care of sending messages according to our need . My message has three part txn_id,amount,staus. My streaming application…
While working with any data pipeline projects most of times programmer deals with slowly changing dimension data .
Here in this post I will try to jot down implementation of SCD Type 2 using apache spark.All these operations will be done in memory after reading your source and target data. The code is implemented using Scala by you can implement similar logic in python too.
Over View of Data:
An Engineer By profession . Like to explore new technology