Building Pipelines for Serverless Spark

Table Of Contents

Overview

Sample Workflow: Weekly Sales Data Analysis by Region, followed by overall trend analysis
  • Define the sequencing / upstream-downstream dependencies of tasks
    Trends in Sales Data should be calculated after all of the regional sales data have been analyzed…
    OR
    “If America Sales Data takes more time, wait or keep polling until it is done before running the next job...
  • Schedule or Trigger tasks
    I want to run this weekly sales data report at 1:00 A.M every Sunday
    OR
    I want to run the analysis as soon as data lands on my S3 bucket sent by my partner org
  • Monitor tasks
    I want to see when the task t1 was triggered and how long it took to finish...
    “I want an email alert if task t2 fails”
  • Automatically retry failed tasks
    “If task t1 failed because of some environmental issue — retry up to 5 times…”
  • Compare historical information for task run
    “America Sales Data used to run faster about three months ago, the downstream application has slowed down now. Is there a way to see the run information from two months back…?”

Introduction To Apache Airflow

See the beautiful historical runs by date, time and breakdown of tasks for each run

Serverless Spark Submission — How does Airflow fit?

Serverless Pipeline Sample DAG

P.S — This code does not consider the case of ‘failed’ but you get the idea…
Order of execution from left to right
You can see the progress of application with the different color legends
Also see the duration and other details per task

Couple of Beginner Tips

A Realistic Workflow — Sales Data Analysis

Tasks Sequencing

  • Let us consider the Sales Data Analysis application which requires 3 parallel regional analysis (say, America, Asia, Europe) to happen and then overall trends analysis on final data.
  • This is how I would sequence the tasks in my DAG — it should be fairly intuitive to understand.
  • The outcome of this DAG is the first picture shown at the top of this article
DAG Sequencing for parallel tasks and join

Python Operator/Sensor

  • Here is the code for all of the PythonOperators/Tasks.
  • Essentially they are all calling the same _submit_spark_appliction python callable and _track_application python callable.
  • But what will change is the parameters passed to the callable — in this case I am only passing the region using the op_kwargs. In a more real DAG, you would maybe pass the application name and other input arguments

Tracking Application ID for Parallel Submissions

  • A slight modification has been done for the key that is used to push and pull the application id so that the right application_ids get tracked to completion

Branching

Side Note

Source Code

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store