What is Apache Airflow?
Apache Airflow is a platform that will help you programmatically to design, schedule, and monitor big data pipelines, with a rich number of tasks you can execute and link together you can almost design any pipeline you have no matter how it is complicated
In this article, we discover what are the main components of Apache Airflow and how it is working, the incoming articles we will go into more technical details about how to set up your environment and execute your workflows.
What Special about Apache Airflow?
Airflow programmatically through Python applications allows you to defined your workflow dynamically which makes your pipeline so dynamic and gives you the power to adjust your pipeline based on certain conditions or use cases.
Customizable, beside a rich set of operator types (Task Types) available, you can define your own operators (Tasks) and customize them according to your needs.
A scalable platform, Apache Airflow is modular in its nature, so it allows you to scale as you need to meet your performance and workload demands.
Apache Airflow Components
A workflow of tasks in Airflow called DAG (Direct Acyclic Graphs) DAG is a series of tasks connected together that can be executed in sequence or in parallel based on your design for the pipeline. DAG defines the relations and dependencies between tasks, in the below example, Task #2 is depending on Task #1, and for Task #3, and Task #4 no dependency so it can execute in parallel.
DAG flow starts with one or more tasks and ends with one or more tasks too, but loops are not allowed, so, for example, the below DAG is not valid, since we have a loop between Tasks 2,3,4.
The airflow operator represents the actual task and what will be executed in this task, Airflow has many available operators for different tasks and also Airflow community has provided many other operators to cover certain needs, example of operators will be:
- PythonOpretor –> execute Python code
- BashOperator –> execute bash commands
- MySQLOperator –> execute MySQL query
- HiveOperator –> execute Hive Query
- SparkSubmitOperator –> submit Spark Application
For example, in this DAG we will execute bash then execute Hive query and after that, we will execute in parallel MySQL query and send an email to receipt to inform that pipeline is executed successfully.
Sensors are a special type of operators that waits for a certain event to happen to trigger the DAG, examples of sensors will be:
- FileSystem Sensor –> Waits for a file or folder to land in a filesystem.
- HttpSesnor –> Executes an HTTP GET statement and returns False on failure caused by 404 Not Found or response_check returning False.
- HDFS Sensor –> Waits for a file or folder to land in HDFS
In some cases, you have a repeated pattern of tasks, that you use frequently in most of your pipelines, in this case, Task Groups will become handy, as Task groups group tasks visually and represent as a single task
Apache Airflow Architecture
Now let’s discuss the high-level architecture of Apache Airflow and how DAG is executing
This is the UI of Airflow, which can be used to get an overview of the overall health of different Directed Acyclic Graphs (DAG) and also help in visualizing different components and states of each DAG. The Web Server also provides the ability to manage users, roles, and different configurations for the Airflow setup.
Orchestrates various DAGs and their tasks, taking care of their interdependencies, limiting the number of runs of each DAG so that one DAG doesn’t overwhelm the entire system, and makes it easy for users to schedule and run DAGs on Airflow. Scheduler in general manages everything related to the DAG execution, plus other tasks such as SLA, Pool Management, ..etc.
The execution arm of the Apache Airflow which actually executes the Operators (Tasks) in your DAG, we have different kind of executors, such as:
SequentialExecutor: executes the tasks one by one at a time in a sequential way, most suitable for debugging and testing the DAG, for example, in the following DAG the tasks will be executed in order as shown, Task #1, then Task #2, and so on.
LocalExecutor: Launches processes to execute different tasks in parallel, this engine is suitable for executing small workloads in parallel, in this executor all of your tasks running on one machine, and each task executed on a separate Operating System process, SequentialExecutors are LocalExecutors with limited parallelism to 1, but this simplified approach has a disadvantage that if your machine goes down, all your DAG will go down, however you can execute your tasks in parallel which is an advantage in this type of executors.
CeleryExecutor: is based on python celery which is widely used to process asynchronous tasks. Celery is an asynchronous task queue/job queue based on distributed message passing. For CeleryExecutor, one needs to set up a queue (Redis, RabbitMQ, or any other task broker supported by Celery) on which all the celery workers running keep on polling for any new tasks to run. During execution Scheduler will submit a task to the queue and each woker node will pull a task from the queue and execute it and report task status back.
KubernetesExecutor: provides a way to run Airflow tasks on Kubernetes, Kubernetes launch a new pod for each task. While Kubernetes takes care of the pod lifecycle (as Celery took care of task processing) and the Scheduler keeps on polling for task status from Kubernetes. With KubernetesExecutor, each task runs in a new pod within the Kubernetes cluster.
This database stores metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections. The Web Server shows the DAGs’ states and their runs from the database. The Scheduler also updates this information in this metadata database.
So what Metadata means, it means data about data, in another words in our case here, Metadata database will have information about:
- DAG Information such as last status, description, configurations, ..etc.
- User Roles
- User Permissions
- Connections information used in our tasks for example, for database connection, we will have Host, Port, and User name, and so on.
Web server and scheduler reads the information from Metadata database to present on UI or to trigger, and execute the jobs as configured.
This was an introduction to Apache Airflow and its main components and why its a great platform with great advantages for data engineering pipeline designing and monitoring, in the next articles we will go into more technical details and how to install and setup your first DAG in Apache Airflow.