As more companies migrate to the cloud, they are increasingly relying on cloud-native and open source technologies. One of the core technologies needed in any solution is scheduling and orchestration, of which the leading open-source tool is Apache Airflow. At Maven Wave, we have implemented this technology at many clients utilizing multiple different types of implementations. This document lays out the key techniques that can be employed so that your Airflow implementation can scale to your business needs, as well as simplify the migration process to another type of Airflow implementation if that ever becomes necessary.
Types of Airflow Implementations
As an open-source tool, Airflow code is readily available and can be manually installed and run on almost any server. While this is a straightforward way of initially utilizing it, there are many moving parts to consider in a hardened implementation including the scheduler, workers, backend database, web UI, and security. For this reason, there are multiple companies that have created push-button implementations. These include both cloud-specific managed services, such as Google Cloud Platform (GCP) Composer, and cloud-agnostic Airflow platforms, such as Astronomer. All of these build off of the base Apache Airflow but include extra functionality and ease of use.
Utilizing these push-button implementations is a great initial step in future-proofing your Airflow since they all allow for easy upgrades and managing your environment. In the case of Astronomer, the engineering team manages the core Airflow project releases and builds their own distribution for enterprise customers to ensure responsive SLAs and expert support.
While it is tempting to take advantage of some of the unique aspects of these implementations, the downside is doing so can tie you into the provider’s service network. For example, GCP Composer will automatically run tasks on GCP as the service account Composer is built with. If you just start up Composer, this is the default service account for compute engines. If you migrate to the Composer alternatives, you would need to create a new service account, set it up as a connection, modify default properties in Airflow, and possibly modify code as well. Similarly on GCP Composer, the path to the DAG files locally is /home/airflow/gcs/dags/. We can use this path to point to files in our code, but if we migrate to another implementation we would need to repoint everything. In general, it is better to use explicit connections and relative paths that are applicable across any implementation, so you can migrate with few or no code changes.
Best Practices for Airflow Implementations
Once you have determined the implementation that you are going to use for Airflow, the next step is designing your Airflow DAGs. When designing our Airflow DAGs, it is important to understand how Airflow works and how that can drive our architecture. Within Airflow there are two main processes that we need to understand: parsing of the DAG structure and running of the DAG tasks. As we think about a future-proof architecture, we need to plan for how increasing the number of DAGs and tasks in each DAG impacts the Airflow service. In addition, we need to understand what the common functionalities are across all implementations of Airflow and which are unique to a specific type of implementation. The below best practices take into account all of these considerations. Utilizing these best practices will enable an implementation that can both grow with your company, and possibly migrate with it, should that ever be necessary.
Minimize processing in parsing DAG structure
With Airflow being written in Python, it can be tempting to build dynamic and elaborate configuration driven code to create DAG structures. While this is possible and may reduce development time in the future, it creates extra workload for Airflow to continuously evaluate this code. In addition, if you have any queries or utilize variables in this part of the code, they will need to be evaluated each time the code is parsed. In the case of queries, this can easily reach into thousands of calls per day per DAG. By minimizing this processing, we can ensure that Airflow will be able to support large numbers of DAGs with many tasks in them.
Push Data Processing outside of Airflow workers
Airflow has the ability to run python programs locally and manipulate data on the worker nodes. This creates extra pressure on the worker nodes when running on celery workers and can cause them to fail if they process large data quantities and run out of memory. This in turn would cause other tasks running on those workers to be lost. In addition to putting pressure on the worker nodes, this also would mean if we moved our Airflow instance to another provider or cloud, then our data would need to be transferred there to process it.
A better solution is to create some form of ephemeral compute and execute our tasks on them. Some GCP examples of ephemeral compute are GKE clusters, DataProc or DataFlow workers when brought up on-demand in an Airflow process. This makes it so that the Airflow worker nodes are actually just kicking off the tasks and monitoring their status and not executing the processing of data. By taking this approach, we can run many more tasks and enable Airflow to be stabler, with smaller Airflow implementations.
Utilize Airflow connections and explicitly define them
When utilizing Airflow to run tasks on IAAS or external managed services, those tasks will be run using a connection. Some implementations of Airflow have a default account that can be used for this. For example, GCP Composer utilizes the service account that Composer was initially set up as, which in most cases is the default service account for compute engines in that project. We can run anything on GCP, without defining a connection, that the service account has access to. This is convenient, but does not allow us to easily migrate the implementation and can hide which privileges are actually being used by what connection. It is better to explicitly define the connection within Airflow and point each task at a specific connection. This will enable us to easily migrate the implementation and will give easier insight into which service account is being utilized to execute which task.
Define Logic in supporting scripts and not in the core DAG
When we first start creating DAGs, it is tempting to include and define the logic for each task in the DAG. This allows for less code, and when looking at the code through Airflow you can see all the task logic in the main DAG. As we build larger DAGs and have more of them, this can then become cumbersome, difficult to manage across many developers, and create large DAG files that need to be parsed. Instead of including task logic in the DAG itself, it is better to spread that out across different files, such as SQL scripts being stored separately and then referenced by the DAG. This allows for simpler management in GIT and for more tasks to be in a DAG without it becoming overly complicated to read. Additionally, Airflow provides a rich library of built-in operators. It is advised to use these built-in operators over custom coding where possible.
These steps will help you to not only get started with Airflow, but also get started in a way that will allow your Airflow implementation to grow and move with your company. If you have any questions or would like to discuss any help setting up your Airflow Implementation, contact us to connect with our experts.
Get the latest industry news and insights delivered straight to your inbox.