Merging small parquet files in aws lambda

Rajesh
2 min readOct 20, 2021

I was working on a use case where We need to capture logs from datascience model .So we were getting many small files from kinesis fire-hose .We have configured fire-hose buffer limit to 128mb and buffer time as 900 seconds as we can tolerate latency on our downstream application The architecture looks like below

Firehose supports to attach lambda for transformation but due to payload hard limit in lambda i.e 6mb and firehose buffer has limit of 128mb which will create issue .So we wanted to trigger our lambda function once firehose put files inside a s3 bucket .

Lambda Function:

we have used sam cli to init the initial lambda body .Sam cli provides way to pass events which will trigger lambda function inside a docker container it will be similar to triggering inside aws environment .More info on sam cli can be found here .Below is my requirements.txt which consists the dependency my lambda will have

pyarrow
boto3
s3fs

To upload these dependency inside lambda we have used lambda layer as we can reuse it in different lambda function and the size limit here is 250mb which will help us to put bigger dependencies like apache arrow

Function:

Lambda function

The above function is self explanatory .We are reading the new files which comes from s3 life cycle event and merge the files with exiting file until it reaches 64 mb .

Finally we add s3 life cycle events on s3:ObjectCreated:Put and s3:ObjectCreated:CompleteMultipartUpload. we need completemultipart event as bigger files uploaded in parts to s3 and we are done .I hope this article has helped you to get insights on dealing with parquet files with lambda .

--

--

Rajesh

An Engineer By profession . Like to explore new technology