Dockerize Spark Jobs with Databricks Container Services

Rajesh
2 min readJul 16, 2021

Many times as a developer after changing code in spark jobs ,To test the changes inside databricks cluster we need to follow quite a number of steps

  • create jar using sbt/maven
  • create cluster with all required arn and put the created jar
  • copy all dependencies to databricks cluster

These are quite a number of steps and we still get error due to missing dependencies or incorrect arn.

In this blog I decided to jot down the small configurable shell script which I have created to automate all the above steps and launch a data bricks cluster with all dependencies with correct arn.

Before starting with details here is some info on Databricks Container Services which we are using .

Databricks Container Services:

Databricks Container Services lets us specify a Docker image when we create a cluster. Some example use cases include:

  • Library customization: We have full control over the system libraries you want installed.
  • Golden container environment: your Docker image is a locked down environment that will never change.
  • Docker CI/CD integration: We can integrate Databricks with your Docker CI/CD pipelines.

Build Base Image:

There are several requirements for Databricks to launch a cluster successfully.For this databricks has already provided a docker base i.e databricksruntime/standard .This run time has different flavours .we are using standard for my requirements here .Below is my Dockerfile which takes care of creating docker image .

FROM databricksruntime/standard:latestARG SCALA_VERSION# update ubuntuRUN apt-get update \&& apt-get install -y \build-essential \python3-dev \&& apt-get cleanADD ./target/$SCALA_VERSION/*.jar /databricks/jars/ADD ./lib/*.jar /databricks/jars/ADD ./lib/jars/*.jar  /databricks/jars/# clean upRUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Inside Dockerfiles we have instruction to copy all the jars which are in lib and all dependent lib which we have in build.sbt .To get all the dependent jars we have created one task as below

val copyJarsTask = taskKey[Unit]("copy-jars")
copyJarsTask := {
val folder = new File("lib/jars")
//Find the relevant Jars
val requiredLib= libraryDependencies.value.filter(v=>(!v.toString().contains("test"))
&&(!v.toString().contains("org.apache.spark"))
&&(!v.toString().contains("scala-library")))
.map(v=>
{val arr=v.toString().split(":")
(arr(1)+"_"+scalaVersion.value.substring(0,4)+"-"+arr(2)+".jar",arr(1)+"-"+arr(2)+".jar")
})
(managedClasspath in Compile).value.files.map { f =>
requiredLib.map(name=>{
if(f.getName.equalsIgnoreCase(name._1)||f.getName.equalsIgnoreCase(name._2))
IO.copyFile(f, folder / f.getName)
})
}
}

val deleteJarsTask = taskKey[Unit]("delete-jars")
deleteJarsTask := {
val folder = new File("lib/jars")
IO.delete(folder)

}

In this task we read all specified dependencies and ignore scala lib and spark lib as this is already part of databricks cluster and we don’t want a big docker image .

To make it configurable user can provide all the configuration in a yaml file like below

dockerusername : "demodocker"
databricks_profile: "default"
roles: {
instance_profile_arn: "arn:/Development_Role",
assume_role_arn: ""
}
zone_id: "eu-west-1b"
min_workers: 2
max_workers: 8
spark_version: "7.3.x-scala2.12"
node_type_id : "i3.2xlarge"
autotermination_minutes: 120
driver_node_type_id : "i3.2xlarge"

We have written three shell scripts (databricks_deploy.sh,deploy.sh and parse-yaml.sh) .User can run just run databricks_deploy.sh from the repo and provide the name of the yaml file on prompted .This will help to maintain one config yaml file per environment supporting many projects.Below is full git hub repo

https://github.com/Rajesh2015/automated-databricks-deploy

--

--

Rajesh

An Engineer By profession . Like to explore new technology