Everything you need to know about Apache Airflow
As a Data Engineer managing and scheduling a data pipeline is nightmare job. In today’s world the amount of data we are producing every day is truly mind-boggling. There are 2.5 quintillion bytes of data created each day at our current pace and it is growing daily. To manage this huge data we have to establish some autonomous data pipelines. In this post, I am going to discuss Apache Airflow, a workflow management system developed by Airbnb.
What is Apache Airflow?
As par the official site:
Airflow is a platform to programmatically author, schedule and monitor workflows.
Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
When workflows are defined as code, they become more maintainable, version-able, testable, and collaborative.
Apache airflow is one of the integral parts of modern data pipeline now days. Data Engineers or Data Scientist programmatically orchestrate and schedule data pipelines and also set retry and alert when a task fails. A single task can be a wide range of operators like bash script, PostgreSQL function, Python function, SSH, Email, etc… and even a Sensor which waits (polls) for a certain time, file, database row, S3 key, etc.
Now, the one of the most important question arise in mind is that there are lots of similar option available to schedule the job, which one I should use?
Which scheduler I should use to run spark job?
Here a list of various option that one can used based on underlying infrastructure
- If spark is in stand-alone running directly via shell use Cron job
- If spark is in cluster managed via Mesos use Chronos
- If spark is in cluster having DAGs using existing Hadoop cluster, then use Oozie Spark Action Extension
- If spark is in cluster having complex DAGs and workflow manager itself needed to scale then go for Apache Airflow
- Other workflow manager like Azkaban and Luigi can also be used.
Now, if you are motivate enough to go for Airflow, then let’s first understand Airflow use cases.
What are the use cases of Airflow?
There are lots of use cases where tasks should be executed in certain order once or periodically without human intervention and at scale
- Collecting Sensor data and move to Data Lake/Data warehouse
- Logs data from devices to Data Lake/Data warehouse
- Machine Learning Pipelines
- Data processing for prediction of weather forecast
- Periodically sync different databases
- Continuously collecting data from wearable devices and transfer it to Data Lake/Data warehouse.
- Transfer data based on evaluating at a time interval if a criteria/condition is met (Received file — FileSensor, API is up — HttpSensor)
- Feed data to Data Studio dashboard with Google BigQuery
Possibilities are endless.
If you still have doubt why you should use Airflow instead of traditional scheduler then here an answer.
Why Airflow is better than traditional scheduler?
- Open source: Airflow is open-source workflow management platform. Initially started by Airbnb in October 2014. You can download (Local or Docker) and start developing your DAGs instantly.
- Monitoring: Airflow provides lots of way to monitor your workflows. You can see status of your tasks, timelines, task durations, task tries, etc. from UI. There is feasibility send an email if task fails or taking longer time.
- Lineage: In Airflow you can track the origins of data, data movements etc. It will indeed when we have multiple data tasks reading and writing into storage.
- Sensors: These are operators which trigger a task based on certain criteria or conditions met. For example, FileSensor which will wait for file to land, there is feature called poking which will check for every given time interval (poke_interval=30) if a file exists.
- Extensibility: Airflow provides lots of commonly used Bigdata tools like Hive, HDFS, Postgres, S3, Presto etc. On top of these base modules are designed to extend very easily. User can create custom Operators, Hooks, Executors and UI as plugins.
- Shiny UI: Airflow comes with beautiful user interface. User can easily visualize pipelines dependencies, progress, view logs, DAG code, trigger a task, re-run tasks, manually change status of task, timing monitoring for each task/DAGs etc. In addition to this user can also run SQL queries against registered connections, view the results and create simple charts on top of the results.
Before moving further, you should also need to understand that what are the limitations of Airflow.
What Airflow is not?
- Airflow is not Data processing tool like Spark and Hadoop. So, if you are planning to do data processing then create a jobs and schedule it on Airflow.
- It is not a data streaming solution.
- If you need to share a lot of data between tasks then please excuse. There is limitation in data sharing due to Metadata database(SQLite, MySQL, Postgres, etc.)
- Airflow is tied to a python language. Workflows are defined in python only.
- Airflow is not a ETL framework.
Conclusion
In this blog we have discussed Apache Airflow as a workflow orchestration solution for Data engineers and Data scientist. I had tried to cover what is airflow, when to use and pros and cons.
Post this analysis, we could infer that Airflow is a great choice if we want to orchestrate work that is executed on external systems such as Apache Spark, Hadoop, Druid, cloud services, external servers (ex. distributed with Celery queues) or when submitting the SQL code to high-performance distributed databases such as Snowflake, Exasol or Redshift.
Stay tune for more on Apache Airflow
Thank you for reading!
Do you have any question? if yes, please reach out to me on LinkedIn — Chetan Hirapara. Happy to chat!
Any feedback would be much appreciated and if you liked what you’ve read please hold the clap button!