The data pipeline is filled with a cornucopia of legacy and emerging terms. The good news is that it is not hard to understand once you can get the core concepts down. This article aims to define and break down some of the most relevant keywords, phrases, and difficult-to-understand concepts.
DataOps (data operations) is an emerging discipline that provides the tools, processes, and organizational structures to support an automated, process-oriented methodology used by analytic and data teams. DataOps teams are typically created by applying DevOps methodologies to a centralized team made up of data engineers, integration, security, data quality, and data scientist roles to improve quality and reduce the cycle time of data analytics.
Figure 1: Overview - DataOps Team Definition
Data Pipeline Defined:
A data pipeline follows a workflow of stages or actions (often automated) that move and combine data from various sources to prepare data insights for end-user consumption. The stages within an end-to-end pipeline consist of collecting disparate raw source data, integrating and ingesting data, storing data, computation/analysis of the data, and then delivering insights to the business via methods that include analytics, dashboards, or reports.
Stages within a Data Pipeline:
- Data Source: This is the data created within a source system, which includes applications or platforms. Within a data pipeline, there are multiple source systems. Each source system has a data source that likely takes the shape of a database or data stream.
- Data Integration & Ingestion: Data integration is the process of combining data from different sources into a single, unified view. Integration begins with the ingestion process and includes steps such as cleansing, ETL mapping, and transformation. Between the two, data is extracted from the sources, then consolidated into a single, cohesive data set.
- Data Storage: This stage represents the "place" where the cohesive data set lives. Data lakes and data warehouses are two common solutions to store big data, but they are not equivalent technologies. A data lake is typically used to store raw data, the purpose for which is not yet defined. A data warehouse is used to store data that has already been structured and filtered for a specific use. A good way to remember the difference is to think of a "lake" as a place where all the rivers and streams pour into without being filtered.
- Analyze & Computation: This is where analytics, data science, and machine learning happen. Tools that support data analysis and computation pull raw data from the data lake or data warehouse. New models and insights (from both structured data and streams) are then stored in the Data Warehouse.
- Delivery: Insights from the data are shared with the business. The insights are delivered through dashboards, emails, SMSs, push notifications, and microservices. The machine learning model inferences are exposed as microservices.
Figure 2: Overview - Big Data Pipeline Orchestration
Additional Terminology and Solution Types:
Streaming Data represents data that is generated all the time by many data sources. Sources often range in the thousands, all of which send data records simultaneously in small (kilobyte) sizes.
Extract, Transform, Load (ETL) is a type of data integration approach used to ingest data from multiple sources. For example, data is first extracted from its source, or sources. Then the data is transformed based on business logic. Finally, data is typically loaded into a data warehouse.
Extract, Load, Transform (ELT) is an alternative approach to ETL (above) used with a data lake implementation, where you may not need to transform data before being stored. Instead, the raw data is pushed into the data lake, which produces faster loading times.
DataOps Enabled: Technology that allows data teams to manage a data pipeline with a DevOps-like approach to support pipelines-as-code. This approach uses standard lifecycle methodologies (Dev/Test/Prod) that include versioning and simulated tests for end-to-end orchestration.
Data Pipeline Orchestration: A solution that DataOps teams use to centralize the management and control of end-to-end data pipelines. Data teams leverage integrations connecting the orchestration solution to each data tool they use along the data pipeline. The Data Pipeline Orchestration solution then automates the actions within the data tools required to move data through each stage of the entire pipeline reliably. Benefits of an orchestration approach often include monitoring and reporting, proactive alerts, DataOps enablement, and the real-time movement of data with event-based system triggers.
Service Orchestration and Automation Platform (SOAP): As a category coined by Gartner in April of 2020, SOAPs evolved from traditional workload automation solutions. Today, SOAPs help I&O teams to provide automation as a service to the business. One primary solution built within SOAPs is Data Pipeline Orchestration. SOAPs are ideal platforms to orchestrate the data pipeline because of their graphical workflow designer, ability to integrate with any 3rd party tools, and built-in managed file transfer capabilities. SOAPs became prominent because of the general move to the cloud, where organizations required automation to be orchestrated across both on-prem and cloud environments.
The data pipeline is complex. However, orchestrating and automating the flow of data through a data pipeline is an entirely obtainable objective. With the definitions above, you are one step closer to figuring out how to adapt your strategy. To take another step, check out this article which summarizes the big picture view of why enterprises are focused on automating the data pipeline.