Apache Spark Internals : As Easy as Baking a Pizza!
A fun take, drawing parallels between Pizza baking and running a Spark application
Even if you aren’t into pizzas, I am sure you have a vague idea of what goes into making them. This quick (and hopefully) easy read is a fun take at how you can draw parallels between baking a pizza and running of an Apache Spark application!
Okay! So here we go, let’s me introduce, Dug. Making “DAGs” is his speciality; He is good with schedules and coming up with a good overall plan for any “job” given to him. He is into the fresh pizza making business with two other partners (Tessy who is quite the Task Master and Buster who is a good “Cluster Manager”)
DAG Scheduler (Planning)
When a pizza job order comes up, as the planner in-charge, Dug, comes up with a set of predefined stages and the operations that goes into each stage for making the pizza. For a very simplistic pizza, he will come up with a DAG (Directed Acyclic Graph — just read that aloud again — it is quite intuitive name and graph, isn’t it?)
So Dug has optimized on Stages to making the pizza:
- what operations should go into a pipeline (Stage1 and Stage 2 has a set of sequenced pipelined operations);
- what can go in parallel (Stage 0, Stage 1, Stage 2 ) ;
- what sequencing is involved, for example, Stage 3 and Stage 4 cannot commence unless the previous stages are done.
But hold on, remember he is the planning in-charge, so he has the additional responsibility of further creating the specific tasks associated with this job. For example :-
- If the job order was for a small pizza, he would perhaps determine that this job can be done by 5 people; so he comes up with 5 tasks (1 person each for Stages 0 through 4)
- If it is large pizza, he probably come up with 8 tasks (2 people for Stage 0+ 2 people for Stage1+ 2 people for Stage2 + 1 person each for stage 3 and Stage4)
So essentially, Dug is considering each unit of work as a task;The bigger the pizza, the larger the number of tasks to be done! And so Dug comes up with a set of tasks that can be then handed over to Tessy, the Task Scheduler.
Task Scheduler & Cluster Manager (Execution)
Now that the exact plan is laid out, the actual execution of these tasks comes into the picture.
In the pizza shop, assume you have a pool of worker rooms, each room has can hold some limited set of pizza shop chefs, depending on the worker room capacity. (let us call the chefs as executors as they execute a task given to them). Only when there is a task needed to be executed, the chef is called in to execute the given task in that room.
Executor1 will probably make the pizza base,
Executor 2 will make the sauce,
Executor 3 will make the toppings,
Executor 4 will come in his/her timeline to layer the pizza and finally,
Executor 5 will be called into put the pizza in the oven.
In short, when it finally comes to execution, Tessy the Task Scheduler takes the set of tasks. She takes the help of Buster the cluster manager and they work together. Tessy decides what task gets scheduled when. And Buster decides when an “executor” should be summoned in which “worker” room. To reiterate, Tessy only knows when to hand out the tasks in the right order and Buster only knows which worker rooms are free to hold the executors and how to call upon them.
How many tasks can an executor execute?
One employee with one pair of hands (1 core vCPU), can execute one task at a time. But let’s say we can attach, say 4 extra pairs of hands (4 core vCPU), to the employee, they can execute 4 tasks at the same time!
Stretch this imagery a bit more. Let’s say we have the feature of attaching hands of different sizes to the executor. Now they are capable of handling more dough (data) per task!
Repartition and Coalesce
If it happens that during one stage that a one or two executors gets to knead 10 Kg dough each for the pizza base and rest of the executors gets to knead 100 gms dough each. Is it efficient? Obviously not, the whole stage of pizza base preparation is stalled for those one or two executors who have a lot more to do. Just do a repartition() which does a proper redistribution and we are good.
Consider the opposite case, a 100 executors kneading 1gm of dough each. Again not efficient! It does not take as much time for the executor to walk in to the room as it does for them to knead the dough. So what should we do? Call the coalesce() cop on the dough/data. And we get the right number of portions (partitions) and therefore executors.
Shuffle Spill, Shuffle Write & Read
Shuffle Spill (To Disk From Memory)
Imagine the executor hands are not large enough to knead the dough all at once. They need to keep a bit aside into the bowl, knead as much as they can and then pull in the part kept aside. So it is better here to have “bigger hands” or smaller portions to avoid the back and forth.
When the “make toppings” task or the “make pizza base” task is completed, the executor puts keeps it in nearby plate, ready to be transferred.
Next, the executor in the Stage2 takes the contents from that plate and proceeds to the “assemble pizza” task.
Side Note:- It is always in this order — Shuffle Write, followed by a Shuffle Read.
Client, Spark Driver & Spark Session
In the pizza world, you can consider the client (spark-submit/pyspark/spark-shell) as the final end user who has submits an application/program (two or three pizzas) to the Swiggy Delivery boy (Driver)
Spark Session/Context is the front desk service guy who takes the orders, and coordinates with the Dug, Tessy, Buster and also with the Swiggy guy in the pizza shop.
Dynamic Allocation vs Static Allocation
Let’s assume the pizza shop had a new pricing model, based on resources consumed (hands and heads attached to the executor).
If you had the resource/price-conscious customer who doesn’t mind waiting a bit more for the pizza to arrive, they would let the scheduler gods decide how many executors should be dynamically called in to execute a job. It might take a little longer because it takes some time for the executor to walk in only when needed.
On the other hand, the customer who doesn’t care for the resources/price and has an upfront idea of how many executors they need, then they would specify the exact number of executors per pizza job. They may have some executors lying idle during the execution, but so what, they got the job done fast!
Spark Master + Cluster Manager
The role of Cluster Manager (the one in charge of handing out the cluster resources(executors)) can be taken on by different personas, depending on where the on where the worker rooms are located (Spark cluster is hosted).
a) Spark Standalone Cluster Manager
This is what we have seen so far. Buster is in owner and in charge of the worker rooms; He is the main in-charge when the Tessy needs to run her tasks.
b) Mesos, Kuberbernetes, Yarn Cluster Manager
Here the the role of cluster manager is taken on by other major player in the system — the one who owns the rooms, It is like you have rented out / outsourced the rooms. Under those circumstances, Buster takes on the role of “Spark Master”, who negotiates with the actual ClusterManager for resources required for the pizza job; You can imagine why — it is because the actual cluster manager is now responsible for handing out resources to many other processes in they system. They maybe using the worker rooms and executors to make, maybe cupcakes and other things!
In the case of Spark running in Standalone mode, the Master process also performs the functions of the Cluster Manager. Effectively, it acts as its own Cluster Manager.
Putting it all together — Who is the Master Chef?
Let us now step back and list down the actors in the pizza shop.
- The customer (client)
- The Swiggy Delivery Boy (Driver/Application)
- The front desk guy (Spark Session)
- Dug the scheduler guy (DAG Scheduler)
- Tessy the task master (Task Scheduler)
- Buster the Master / Cluster Manager (depending on what the Resource Manager is)
- Executor (Chefs who actually do the work)
So who do you think is the is the grand orchestrator — the Master Chef?
It is the Driver! It is actually a “Make Your Own Pizza” shop!
The Driver is responsible for creating the SparkSession object (the entry point for any Spark application) and planning an application by creating a DAG consisting of tasks and stages. The Spark Driver is responsible for creating the SparkSession. (When we submit a spark application, the first process to come up is the Application Master ) The Driver communicates with a Master, which in turn communicates with a Cluster Manager to allocate application runtime resources (containers) on which Executors will run. The Driver also coordinates the running of stages and tasks defined in the DAG. Key driver activities involved in the scheduling and running of tasks include the following : Keeping track of available resources to execute task and Scheduling tasks to run “close” to the data where possible (the concept of data locality)
- One pizza application can have many pizza orders/jobs
- One pizza application will use one exclusive SparkSession/SparkContext/DAG scheduler/TaskScheduler/ Spark Master
- One pizza order/job will get its own set of dedicated executors
- The Cluster Manager and Worker rooms are for all — They are used by all the pizza applications!
- The Master Chef (Driver) talks to all other allotted chefs directly and they exchange stuff (the location of the raw ingredients to the final pizza)
Cluster Mode vs Client Mode (Driver)
This can be compared to the Driver working from home and giving instructions on the phone vs Driver sitting inside one of the worker rooms of the pizza shop and coordinating the show. If the line gets cut, the pizza order is cancelled. If the driver is in the shop itself, it becomes a more efficient process for the driver. (No more of.. Hello, are you there? Can you hear me? Your voice is breaking..no network instabilities)
The pizza order is ready!
Apache Spark is a vast topic and there are several knobs out there to tune your large applications to make it work smoothly. If you are a newbie to Spark, you can get easily thrown off with all the jargon. Understanding the internals and basics will help you take that first step to that journey. After reading this, go and read that Spark Staged Execution chapter once again and hopefully it is becomes an easier read. If this has been a good bite, let’s have a show of hands :-) !
“You do not really understand something unless you can explain it to your grandmother.”
― Albert Einstein