Blog Post Airflow Event-Based Triggers with Stonebranch
How to enable Apache Airflow to use event-based triggers for real-time data pipelines via the API.
Apache Airflow is a very common workflow management solution that is used to create data pipelines. At its core, Airflow helps data engineering teams orchestrate automated processes across a myriad of data tools.
End-users create what Apache calls Directed Acyclic Graphs (DAG), or a visual representation of sequential automated tasks, which are then triggered using Airflow’s scheduler.
While there are many benefits to using Airflow, there are also some important gaps that large enterprises typically need to fill. This article will explore the gaps and how to fill them with the Stonebranch Universal Automation Center (UAC).
Event-Based Triggers with Airflow
Since its inception, Airflow has been designed to run time-based, or batch, workflows. However, enterprises recognize the need for real-time information. To achieve a real-time data pipeline, enterprises typically turn to event-based triggers. Starting with Airflow 2, there are a few reliable ways that data teams can add event-based triggers. But each method has limitations. Below are the primary methods to create event-based triggers in Airflow:
- TriggerDagRunOperator: Used when a system-event trigger comes from another DAG within the same Airflow environment.
- Sensors: Used when you want to trigger a workflow from an application outside of Airflow, and you're directionally sure of when the automation needs to happen. A practical example is if you need to process data only after it arrives in an AWS bucket.
- Deferrable Operators: An option available to use when sensors, explained above, are ideal but the time of the system event is unknown. Deferrable operators are put in place so you don’t have to leave a long-running sensor up all day, or forever, which would increase compute costs.
- Airflow API: Used when the trigger event is truly random. In other words, it’s the most reliable and low-cost method of monitoring system events in third-party applications outside of Airflow. It’s worth noting that In Airflow 2, the API is fully supported. In the original Airflow, it was considered experimental.
A few examples of what you might automate using sensors, deferrable operators, or Airflow’s API include:
- Trigger a DAG when someone fills in a website form
- Trigger a DAG when a data file is dropped into a cloud bucket
- Trigger a DAG when a Kafka or ASW SQS event is received
Limitations to Event-Based Automation in Airflow
Triggering a DAG based on a system event from a third-party tool remains complex. Each of the above-described methods typically requires a third-party scheduler to send the trigger.
For example, if you’re a developer who wants to trigger a DAG when a file is dropped into an AWS S3 bucket, you may opt to use AWS Lambda to schedule the trigger. In a one-off scenario, this approach will work.
But what happens when you’re not exclusively using AWS for your data pipeline? Often, you wind up needing a different job scheduler for each data tool used along your pipeline.
For example, let’s say your pipeline runs across AWS, Azure, Informatica, Snowflake, Databricks, and PowerBI. Each of these tools in your pipeline would need to use that tool’s associated job scheduler. As you can imagine, managing a bunch of schedulers in addition to the Airflow scheduler can really get complex.
A few other complexities include:
- If a sensor or deferrable operator does not yet exist, you’ll have to write one from scratch.
- If you use the Airflow API, you have a handful of API configurations to complete and security to worry about.
- Constantly checking for system events requires polling via the API. Most cloud providers charge for API calls. While it’s a minuscule amount per call, it really starts to add up over time.
All of these above factors equal extra lift and multiple points of failure — which you can’t really see if it breaks down.
Generally, larger organizations with complex data pipelines will opt for the Airflow API. But that brings us to an important question that we’ll cover below.
How Do I Connect to and Trigger Events with the Airflow API?
With roots in the workload automation world, UAC is what Gartner refers to as a DataOps tool. More specifically, UAC is used for DataOps orchestration. Much like Airflow, UAC includes a scheduler and workflow engine. However, unlike Airflow, UAC is designed with enterprise-grade capabilities that enable building and operating data pipelines as mission-critical systems.
Because UAC is vendor agnostic, it’s designed to work across all the different tools used along the data pipeline. Thus, you are able to avoid vendor lock-in, and orchestrate your entire data pipeline from a single platform.
As it relates to Airflow, enterprises use UAC for one of two scenarios:
- The data teams want to continue using Airflow but need to level up the features and overall management of their data pipelines. UAC has a pre-made integration with Airflow. This integration uses the Airflow API. From the UAC, enterprises can trigger DAGs based on system events. Because UAC is an enterprise-grade solution, there is a library of integrations for data and cloud-focused applications. Essentially, any application you connect to the UAC can be used to trigger a DAG based on a system event. Click here to explore the UAC Apache Airflow integration.
- The data team wants to replace Airflow with an enterprise-grade platform. UAC can also serve as a replacement for Airflow. Oftentimes, data teams will turn to the UAC for enterprise features that include additional observability, security and governance, self-service and UX improvements, 24/7 support, and the need to operationalize the management of data pipelines by their IT Ops team.
Net, you can either integrate Airflow with the UAC, or you may completely replace it.
Sometimes we’ll see our customers take more of a hybrid approach. In these situations, they’ll use the UAC to improve how they use Airflow. But they begin to build new pipelines using UAC while they take their time to re-create Airflow workflows in the UAC.
Additional Benefits of Leveraging UAC with or without Airflow
Multiple Ways to Create Workflows
Tasks and workflows can be easily created via a visual workflow designer that involves low- or no-code drag-and-drop capabilities.
Optionally, developers can also write python scripts (similar to Airflow). Plus, those who like to code in IDEs — like Visual Studio Code — may write workflows and leverage Git repositories (or any version control repository) to apply DevOps (or DataOps) lifecycle processes.
Starting in V7.2, UAC includes Universal Monitors and Universal Triggers out of the box. This important feature allows you to add an always-on monitor to a third-party tool that sends an automatic trigger when a system event takes place. The big benefit here is that you can remove the need to poll an API all day long, thus removing the associated costs.
Additionally, you may trigger workflows with external inbound webhooks. Any application that supports outbound webhooks can generate an event in the UAC. Webhook support makes event-driven automation even easier and more efficient.
Collaboration and Lifecycle Management
UAC is built for collaboration between data teams, IT Ops, DevOps, and business user teams. Plus, embedded DataOps lifecycle management capabilities enable teams to simulate workflows and promote pipelines between dev, test, and production environments.
Reporting and Forecasting
Forecasts can be viewed from the Universal Controller GUI in two ways:
- List of future tasks and workflows to be executed
- Calendar with tasks and workflows to be executed for each day.
Additionally, an extensive report generator allows the creation of specific user-based forecast reports that can be shared or displayed via dashboards.
UAC provides extensive monitoring capabilities for the jobs and workflows triggered for execution.
Proactive alerts can be sent to common enterprise tools like Slack, MS Teams, PagerDuty, ServiceNow, JIRA, Zendesk, and WhatsApp, as well as email, SMS, and voice channels.
Plus, UAC can broadcast and execute the same job across different systems at the same time using agent clusters.
Built-in Functionality and Capabilities
- Inbuilt managed file transfer and multi-cloud data transfer capabilities.
- Out-of-the-box version management for tasks, workflows, calendars, scripts, variables, etc.
- Ability to set up in HA using the Active/Passive cluster node functionality.
- Scalability to run millions of automated tasks daily.
- Connectivity to Microsoft Active Directory SSO tools for user authentication.
- Granular role-based access control (RBAC) for users and groups can be defined using the web-based GUI. Users can also be associated with predefined roles.
Beyond Data Pipeline Orchestration
Many enterprises use the UAC as a stand-alone data pipeline orchestrator. However, UAC is also categorized by Gartner as a service orchestration and automation platform (SOAP). This means the UAC has a broader set of capabilities. In the context of SOAP, the UAC is also used by many organizations for cloud automation, cloud infrastructure orchestration, DevOps automation, and traditional workload automation.
Ultimately, data professionals have adopted Airflow because it’s a solid open-source solution that’s good at what it does. Enterprises use the UAC to either supplement or replace Airflow capabilities. Their goal is to gain extended features required to operate data pipelines as part of a broader, mission-critical business process.
Start Your Automation Initiative Now
Schedule a Live Demo with a Stonebranch Solution Expert