Apache Airflow scaling a DAG

Chetan Hirapara
5 min readJun 24, 2021

--

image: by EJ Strat on unsplash

Apache airflow is workflow management tool developed by Airbnb in 2014 for programmatically author, schedule and monitor workflows. Apache airflow is one of the integral parts of modern data pipeline now days due to its inbuilt capabilities like Monitoring, Scaling, Extensibility, Shiny UI etc.

In this blog I am going to discuss about scaling out DAGs or tasks in Airflow.

To scale up the data pipeline, there are few settings in airtflow configuration that should need to modify. While Airflow comes with lots of power for orchestration which required to setup number of parameters for optimize performance of DAG and Tasks.

These parameters are broadly categories in 3 parts.

· Environment level settings

· DAG level settings

· Task level settings

Environment level settings

Environment level settings will be applied on overall full instance of your airflow. This settings need to modify airflow.cfg file which typically placed at root location of your DAG folder. This file contains lots of different parameter with default values which user can modify according to requirements.

In older version of Airflow there were feature for view default setting on UI itself but due to security reasons in Airflow 2.0 this feature is disabled. You can directly open airflow.cfg file which will looks like this.

What are thease settings?

  1. parallelism : The at most number of tasks can run concurrently in single Airflow environment. Think of this as maximum number of task can run at a time across all of your DAGs in this environment. If you set it down t0 1 it will became a bottleneck to performance
  2. dag_concurrency: This parameter determines how many task instances can schedule at once par DAG. You can think of this as at most this number of task instances can be scheduled at once, per DAG by Airflow scheduler.
  3. max_active_runs_per_dag: is maximum number of DAGs can be active per DAG that can be scheduled by the Airflow scheduler a any given time. If this is set to that means the scheduler can handle only single DAG runs per DAG despite of parallelism=32 and dag_concurrency=16.

Examples

Let’s try to understand this parameters in different combination

Case 1

parallelism=2

dag_concurrency =2

max_active_runs_per_dag =1

In this case total max 2 tasks can run concurrently and at most 2 task instance can scheduled at once par DAG.

Case 2

parallelism=4

dag_concurrency =2

max_active_runs_per_dag =1

In this case total max 4 tasks can run concurrently and at most 2 task instance can scheduled at once par DAG.

Case 3

parallelism=2

dag_concurrency =4

max_active_runs_per_dag =1

In this case total max 2 task instance can run concurrently and at most 4 task instance can scheduled at once par DAG but here due to parallelism sets to 2 task2 and task5 can not run concurrently

DAG level settings

Airflow allows setting to scale a DAGs independently from the entire airflow environment. There may be requirement of some should be run at higher scale and some are not. In such cases user can set scaling parameter at DAG levels. There are only 2 settings at DAG levels

  1. max_active_runs: determines maximum number of active DAG runs allowed per DAG, If you are doing a backfilling (catchup=True) which required many number of DAG runs in this cases you can limit it to prevent from memory issue.

In below exmaple I have assigned max_active_runs=3, so when i’ll run this DAG it will add 3 DAGs together for runs.

Here a output of tree view on Airflow UI

Adding 3 DAG runs together

2. concurrency: determines maximum number of task instances allowed to run concurrently across all the active DAG. If this value is not defined then it will consider value of dag_concurrency defined in airflow.cfg file.

In below code allowed a maximum 5 concurrent task across 3 active runs

Task level settings

There are primarily 2 settings

  1. pool: This is useful when you have lots of workers and tasks which are hitting same resource or API which is tends to overwhelmed. In this case Airflow pools can come to rescue and prevent from server failure. Airflow pools can be used to limit the execution parallelism on arbitrary sets of tasks.
  2. task_concurrency: is a limit to the amount of times the same task can execute across multiple DAG Runs.

In this code task1 can execute maximum 10 times concurrenct across muliple DAG runs with specified pool.

Conclusion

In this post, we had discussed so many things for scaling out DAG runs and task instances in Airflow.

Do you have any question? if yes, please reach out to me on LinkedIn — Chetan Hirapara. Happy to chat!

Any feedback would be much appreciated and if you liked what you’ve read please hold the clap button!

--

--

Chetan Hirapara
Chetan Hirapara

Written by Chetan Hirapara

I am passionate data scientist/engineer

No responses yet