Eckerson Group: Using DataOps Methodologies to Orchestrate Your Data Pipelines
Hello, StoneBranch community. Welcome back and thanks for tuning in to the second session of StoneBranch Online. StoneBranch Online is a global IT forum where we talk about all things hybrid IT automation and orchestration for whatever comes next. I'm Nadia Davis, and I will be the moderator for today's event, broadcasting from the Stonebridge North American headquarters in Atlanta, Georgia. All right, let's begin today's session. For those of you who tuned in last time, you remember that we covered the ongoing trend of enterprises transitioning from simply automating their IT to orchestrating it. During this session on using data ops methodologies to orchestrate your data pipelines, we will talk about another trend which is extremely relevant: data pipelines and the ways to manage them at scale. A recent twenty twenty one survey uncovered that ninety nine percent of firms are investing heavily into their data initiatives in twenty twenty one. Additionally, ninety six percent of firms are investing in improving their reporting. The same survey says that only thirty seven percent of IT teams were able to deliver on their commitments to the business last year, while seeing a thirty percent increase in the projects that they're asked to deliver on this year. So think about it. It's a lot. We get it, and it's hard to keep up the pace with a growing technology stack at your enterprise. And this is why you're on this webinar. So to unpack this topic further, we partnered up with Eckerson Group, a leading global research and consultant company that talks about all things data, from data ops methodologies to data strategy, management, architecture, data science, data analytics, anything data. And I'm extremely pleased today to have Kevin Petrie with me on this webinar. Kevin is VP of Research at Eckerson Group and Kevin has been in this industry for the last twenty five years, wearing multiple hats from being an industry analyst to being a writer, a consultant, leading different data teams at various companies, including global teams at Atunity, which is a part of ClickNow. Kevin, we are delighted to have you. Thank you for joining us and welcome to the program. Hi, Nadia. Really pleased to be here. Thank you very much. I think it is going to be a great discussion. So, folks, what I thought I would do today is spend about a half hour talking about some of the trends that we at Eckerson Group see happening as we work with different practitioners. I run the research division at Eckerson Group, so we write reports, we do webinars like this and engage with a lot of practitioners in order to help them understand how they can get more value out of data. We also have a consulting group, which I am involved in and have colleagues that lead the charge with. And they're actually engaging with those practitioners on strategic projects to help them build data strategies and implement those. And data ops and data pipelines are a recurring point of pain. They are also a recurring opportunity. And so I am really pleased today to have the opportunity to talk about DataOps, kind of defining what that is in terms of managing and improving the efficiency and effectiveness of data pipelines, but then moving up from that to say data pipelines are not an end in themselves. Ultimately, they are plumbing. You need to orchestrate all the pipes in order to run an effective business. So that is why I am excited here today to speak with Stone Branch and some of your community. So why don't we dig in here? If we start from a macro view, what's happening from a business perspective? Data, I think we all understand, is increasingly pick your metaphor the lifeblood, the way, the method by which organizations are going to drive competitive advantage. And so organizations need to modernize their businesses with data. So some of the strategic C level priorities that are getting a lot of attention in twenty twenty one, twenty twenty two, and onward are a digital transformation, converting more and more of your business transactions, your customer interactions into digital form. And as we all know, that went into hyperdrive in a sense with the COVID shock a year or two years ago, and that continues today as organizations need to support remote distributed workers, more and more online engagement forums with customers and so forth. Closely tied to that is data democratization. As you convert more of your business activities into digital form, it's throwing off a lot of data which can yield significant insights into all the factors that are going into these different business decisions. So you want more and more of your business owners, your business managers to make data enabled and data driven decisions. That's what data democratization is all about, putting intuitive data driven tools into the hands of more and more business owners. The third and supporting point is data modernization. And this is critical. This is probably the heart of what we are going to talk today because you need to figure out, all right, we have got decision makers that need to drive data into what they are doing. They have a whole huge supply of data, but how do you make the demand for and the supply of data come together? And you do that by modernizing your architecture. You figure out how to take advantage of new cloud, elastic cloud infrastructure, elastic cloud compute and storage infrastructure, new advanced analytics tools on the cloud in many cases, but still take advantage of all the data that is residing and probably will continue to reside in some ways on premises. So modernizing those architectures is critical. Now, if you can get it right, if you can transform your business digitally, democratize data decision making, and modernize the architecture that supports all this, you can enable smarter action. You can automate your operations and you can make increasingly sophisticated analytical decisions. And so that's what we will talk about today. So drilling down a little bit is that architectures can get real complex real fast and data modernization seeks to simplify this. And let's just look a little bit at the pain that is the modern, or I said the traditional, the status quo today in terms of architectures. Working from the top down, there are applications that are driving business operations. You've got finance supply chain functions, accounting, HR, and so forth. All these applications increasingly are residing on not just the traditional Oracle or SQL Server databases, but new NoSQL distributed databases such as MongoDB, such as Cockroach. And a lot of that stuff is running not just on premises, but in hybrid environments that include one or more clouds, in cloud only environments or multi cloud. Now on the other side here, we've got data science tools and BI tools. The business intelligence discipline has existed for quite some time, but it's getting more and more strategic. So you need to take a look at your operations in terms of regular accurate reports that take a snapshot of the business and help strategic decision makers figure out how to operate more efficiently. You've also got data science tools, are taking advantage of machine learning models that effectively teach themselves how to think by identifying patterns in historical data and then using that to predict the future. That's cool, sophisticated stuff. It relies on a lot of data. And so the new breed of analytics platforms bring together traditional data warehouse constructs, SQL Server, Teradata, Oracle, the model that they put into place, and then it starts to merge that with data lake functionality. So if you can combine the best of the data warehouse and the data lake, it's a whole exciting topic I'm happy to get into. But the point here is that you have got these new merged platforms, Vertica, Databricks, and Snowflake, Azure Synapse Analytics are all coming from different places, but they are merging data warehouse constructs onto data lake object storage. And a lot of this is starting to reside increasingly on the cloud. So a lot of platforms out there, and that proliferation in endpoints, in platforms, both for operations analytics, is going to continue. And that's why data pipelines are so critical, because data pipelines are going to start to connect many sources to many targets. It's going to handle ideally many different types of data and start to ensure that you can have data portability because you might be optimizing the workload instead of use cases in twenty twenty one on one platform, on one tool, but you need to add three more sources next year. And that means that certain tools are going to make more sense on a different cloud. So you have to maintain that portability and you have to look at your pipelines as not just sort of a rigid static set of plumbing in your house, but rather a vibrant circulatory system that's going to live and needs to continue to evolve. And data, because this is hard, can create problems. It's the opportunity, but it's also causing pain. And so some of these challenges are that data can be siloed. We all know this. It's a cliche and it's a cliche because it happens and it continues to happen. We continue to see it as a major source of why folks come to us on the consulting side. You've got data that's locked into proprietary formats, proprietary APIs. Maybe it's within a mainframe system, an SAP system. It gets tricky to figure out how to extract that and put it into something more modern and try to consolidate multiple sources onto, say, Snowflake or Azure Synapse Analytics. I spent five years with Atunity, a data pipeline vendor, and a huge part of that business was these big companies, Fortune five hundred companies spending a million dollars or more in a given engagement in order to consolidate data from five to twenty sources, ideally onto one cloud analytics platform. In a lot of cases, they ended up putting it to two or three more. So you had this ongoing problem of perpetuating the silo problem. Bottlenecks. So performance becomes a real challenge because as you start to have increasing data volumes, increasing data varieties and velocities, it becomes hard to enable real time transfer of data and meet the latency and throughput requirements of consumers, whether it's for operations or analytics. Complexity is stitched into everything I've said so far. These environments are complex and you've got a lot of interdependencies, which begs the need for automation. Automation is a big theme of what we'll talk about when it comes to orchestrating pipelines and orchestrating workflows. The more you can automate the handoffs, the more you can automate even the actions that are taken on data, the better you can make your business more effective and more efficient. So we want to pause here. And Nadia, why don't I let you execute this survey of our audience? What we're presenting here are four very common challenges when it comes to data management, in line with the prior slide: data silos, data reliability and performance issues, complexity of data management, and then the lack of real time data transfer. So Nadia, why don't I hand it to you? Alright guys, you must already see a pop up on your screen, so pick whichever one applies to what you're seeing in your organization. So what's your greatest challenge? Is it data silos? Is it reliability and performance? Complexity or the real time conversation? You want to see your data in real time. I'm going to give you all about ten more seconds, let's make it fifteen, I'm seeing answers coming in. Complexity right now is definitely driving this conversation, let's see if this changes when we close out this poll. Alright, I'm closing this out. You guys can see the results in the polls tab, just click on that and what I'm seeing is that twenty three percent of you cited data silos as your biggest challenge, fifteen percent said data reliability and performance presents an issue, fifty two percent say that complexity is the thing and real time data or lack thereof is just nine percent. So Kevin, how can you comment on this? Great, it's setting the table nicely for slides that are coming. So complexity, I am not surprised. The only piece that I might be a little surprised about is the lack of real time data. So what is interesting about real time data is that a a lot of organizations don't necessarily need instantaneous data availability. But what they do need to do is stop repeatedly copying unchanged data in this batch transfer process. So real time data transfer is a way to improve efficiency and reduce complexity. But the overall complexity really underscores the need for efficient, effective data pipelines. DataOps is a way to enable that, which we are about to talk about, and then data pipeline orchestration. So it is cool stuff. It is always good to get a reality check from people who are living this. Enter the streaming data pipeline. A lot of pieces here, and I will make a few high level points and then drill in a little bit to underscore the complexity that's involved. You've got operational sources in a lot of cases. They could be relational databases. They could be DB2, ZOS, or I Series systems that have been there for decades. They could be clickstream data about what your customers are doing on your website or IoT data that's getting captured from equipment, tractors in a farm, machine equipment, all kinds of things. All this operational data creates an incredible opportunity if you can manage it well. So the goal of the pipeline is to ingest that data, transform it potentially, often that is required, and then deliver it to a target either to enable operations or to enable analytics. So that's the flow. You're going to ingest all this data, transform it, and deliver it. Ingestion can mean one of a few things. You could extract data in batch, which, as I said, is pretty wasteful. You could capture it on a real time basis, for example, using change data capture, or you could stream it. A lot of folks use CDC or change data capture synonymously with streaming. Key point is that you are extracting, you are copying data, you are identifying it and capturing it on a real time basis in real time increments. Maybe that is milliseconds, seconds, maybe minutes, but you are doing those latest increments. And now you are pushing it. You're taking that copy and you might be filtering it by source, by type. You might be pairing up different data, joining tables, for example. You could be reformatting it into something more open and accessible, applying a model that structures the data or cleansing it, taking out duplicates, taking out null values and that kind of thing, and then delivering it, either appending it to an ongoing set of data within your data lake, as an example, or maybe merging it and reconciling different transactions. But that is the flow of the pipeline. It is applying not just to data but also to schemas, columns, tables, things like that, to organize the data, and then the metadata that describes the data itself as well as the schemas. You need to manage pipelines. And so what you need to do here is kind of explore all the data sets you have, configure the pipelines, execute them, monitor them and orchestrate them. We will be talking about orchestration. It's all under a complex set of infrastructure, maybe on premise servers, maybe hybrid systems, maybe cloud where you've got elastic cloud compute, elastic cloud storage. You might have multiple cloud providers involved and also on premises links. So there's a lot going on here, and that's kind of the point of the data pipeline. You've got a streaming data pipeline that can make things more efficient by getting rid of batch. But it gets tricky to execute all that in an efficient and effective way. And so why are we doing this? We're doing this for a number of different reasons. Here's a set of overlapping use cases. Data science, which is going to apply advanced algorithms, machine learning models, and the like in order to make sense of data by identifying patterns people might not have identified themselves. It could simply be a one time cloud migration or ongoing moves of data from operational sources on premises to a cloud analytics platform as an example. You've got the need for business intelligence. Data analysts for decades have been building periodic reports on operations. Log analytics is a big use case for streaming data pipelines because you need to look across all the pieces of your on premise application servers, network systems, the cloud version of all that, and figure out how the pieces work together in order to observe what's happening on those systems and optimize it. Embedded analytics is pretty cool and underscores, I think, the value of pipeline orchestration because now you're starting to embed analytics workloads within applications. That's a growing trend. A lot of applications, a lot of software as a service applications might well have little bits of analytics within what they offer. So if you go to schwab dot com, you can start to model your own financial future with your savings for retirement. And that's analytics embedded into an application. Security analytics is obviously critical going given all that's going on with active threats in our environment. And then IoT maintenance to optimize the efficiency and the effectiveness of all kinds of mechanical equipment. Okay, Nadia, why don't I let you take it over here? So guys, we have our second poll which I'm going to open up right now and the question that we want to ask you, we're here to talk about DataOps. Where are you on the DataOps journey today? What of the following aspects of data ops does your organization currently practice? Do you do continuous integration and delivery of data pipelines? Do you orchestrate data pipelines? Some of you may, some of you may not. Do you test data quality and pipeline functionality on a regular basis or do you do monitoring of pipeline execution and data delivery? So let's give it maybe another ten seconds or so to see how the answers are coming in. And all of you I'm sure are familiar with path that DevOps has taken as a methodology, so DataOps is another rising trend which is kind of following the same curve and we see some organizations being really in advanced stages of data ops while others are still catching up, it would be interesting to see how the answers are coming in. All right, so let me go ahead and close this poll. I think we have enough people that answered And here's how the results are looking and again you can see that in the Polls tab. Continuous integration and delivery of data pipelines twenty five percent, one fourth. Twenty eight percent orchestrate data pipelines today. So fifteen percent do data quality testing and pipeline functionality testing, and thirty two percent do monitoring of pipeline execution and data delivery. Kevin, does that align with what you see out there? Yeah, I think it does. I think that if we look at the biggest chunk, a third doing monitoring of pipeline execution and data delivery, in a sense that's the low hanging fruit to start with, no question, because you, within a data pipeline tool, should be able to configure alerts for performance and things like that. So that's absolutely critical. And I think that the continuous integration and delivery data pipelines, that's a good number that you provided. Think it was about a quarter. That's where and we'll talk about this on the next slide. You're starting to continuously iterate what's happening in your data flows. So that's a good distribution. I think that you showed you said fifteen percent for testing with data quality and pipeline functionality. There's there's definitely progress to be had there because that is a critical piece. You need to make sure that your data that you're delivering is timely and accurate. And then we had one third of people orchestrate their data pipeline. So the two thirds are on the right webinar. Exactly. Exactly. So, Okay, so let's we've led the witness here by by doing the poll before the definition. It's very helpful context because I think we're obviously talking to an educated audience. To your point, two thirds of folks are doing the first two. So that's critical. Data ops, data operations. This is a discipline that comprises tools and techniques. You can view it as a methodology, a way to improve the efficiency and the effectiveness of data pipelines. That's a term I've used a few times on this webinar. The four pieces. So the first one, continuous integration and delivery. We're talking about treating pipelines as a piece of code similar to what DevOps did, which improves the agility of your software delivery process by continuously testing and iterating and releasing frequent software updates. And that's the heart, continuous integration and delivery of DataOps in many ways. You need to ensure that your code has strict version control and that it aligns with these different versions, what you put into the wild is going to align with what is already there. So CICD is absolutely critical. Orchestration is another huge part of DataOps. And this is where we're connecting all the different tools that are involved in configuring, executing, monitoring, and aligning data pipelines and automating wherever possible those handoffs. Orchestration is critical. Testing. You need to check the quality of the data that's being delivered and the quality of the code that's delivering it. So two distinct work streams, both equally critical. And then monitoring, as we talked about before, this is tracking pipeline tools. It's tracking the data delivery and making sure that you're meeting SLAs related to availability, throughput, latency. It could also monitoring ideally is going to assist with the data quality checks as well. And you want to identify and respond to issues quickly in a very timely fashion. So we've done a lot of research on data ops. I encourage folks to learn more from some of my colleagues, Wayne Eckerson and Joe Hillary, in terms of the DataOps methodology, looking at what it is, how people are using it, how they should be using it. But it's a critical discipline. And so let's move on to the next slide here and start to abstract from DataOps to talk about data pipeline orchestration in particular. As you saw in the prior slide, orchestration is a piece of DataOps, but you can really use the DataOps methodology to start orchestrating your data pipelines and make it a strategic element of everything that you're doing. As I talked about at the start, data pipelines are effectively a circulatory system, and you don't want to make them efficient and effective as an engine itself, but rather to make the whole body run more effectively. So we offer five steps to achieve this. You want to centralize control of your pipelines so that you can take a comprehensive look across all the handoffs, all the data flows that are circulating through your body, through your enterprise. You want to automate wherever possible repetitive scripting. There's a lot of SQL command line scripting that's involved in most pipeline configuration. And there are plenty of data engineers who've grown up on SQL scripting and like it, are happy with it. But I think they all recognize that they're overwhelmed. And it makes sense to, at a minimum, start to automate the repetitive stuff and do it through a graphical interface. A lot of tools out there to help with that. Then you can start to standardize and reuse your components because there's no sense in reinventing the wheel every time you add a source or a target to a data flow. Rather, reuse the pipeline you've got and simply change out one of the components. That's true for processors. It's true for transformation jobs. The more you can reuse and standardize, the better. Maintaining an open architecture is critical. You want to maintain open data formats, open APIs, and really make sure that you're not limiting your data portability between all the different components of your architecture. And that can involve running sort of against the grain of where cloud service providers might want to push you. They might want to have you go to a somewhat proprietary data format that could limit your options in the future in order to optimize performance on their particular platform. So look for open table formats like Iceberg. Look for open data formats and ways that you can stay portable and mobile. And then it's critical to integrate your pipeline with production workflows. So what we're finding is that organizations do need to synchronize their data on a real time basis with the decisions that are being made, with the operational actions that are being taken. And that often means not just delivering data with a thump in a thunk into a data lake, but rather having the application maybe or analytics tool or application that has analytics inside it leverage on an ongoing basis the data that's being delivered. So that's what we view as the key steps to data pipeline orchestration. And I'm looking forward to hearing from Scott about thoughts on how StoneBranch works with customers in this regard. So a lot of different enabling technologies here. You Cloud data platforms certainly have pipeline tools and ways to automate what you're doing with the data. They've got a lot of different ability to to sort, to consolidate, store, process, and analyze data. Data integration tools. Here, you've got a bunch of different companies. I worked with Atunity, which is part of Click. Fivetran just bought HVR. A lot going on in this space. The Qualum's got some pretty exciting ways to move data on a real time basis and transform it in flight. But those data integration tools, ideally, you get one that's going to not be beholden to a particular source or particular target. So going with those Switzerland approaches like a Qualum, like Fivetran, like Click is a good way to go. So data ops platforms are from companies like Data. Live, from Data Kitchen. In others, they're going to help you apply those four different aspects of the methodology. Observability tools are pretty interesting. Here, you're starting to understand all the logs and traces that are making sure your applications and the components of your infrastructure are handing data to each other and helping automate all those handoffs and working with each other well. Developer platforms as well are going help. GitHub and others, which will make sure that you have strict version control about the code you're releasing, both for applications and for pipelines. Final point here is data pipeline orchestration platform, I think, is critical. And the reason is that all these different point tools are going to help with a different part of the elephant. But the data pipeline orchestration platform can help you look across the environment and make sure that the full circulatory system is working across your anatomy. And I think it is absolutely critical. Okay, so what does good look like? If you can do this well, if you can manage this mess well, you can have more productive staff in terms of data engineers, IT folks, DevOps folks, that'll have some cost benefits, not just in terms of the people side of the equation, but also the infrastructure, because ideally you're going to start to consolidate, take manual scripting out of the equation, and ensure that your organization can run-in a more automated way. And you've got the great persistence here in terms of the cost side of the equation, but you've also got the ability to capitalize on data events more quickly and feed that into analytics decisions, feed that into operations. So that's where data value creation can be one of the benefits that you achieve. And this all gets back to those C level priorities that we talked about at the start. So getting started here. I'll offer a few thoughts and then I will hand over to Nadia and Scott. What I recommend, what we have seen with Eckerson Group, both in the consulting and the research side, is that it makes sense to start with your business requirements. It bears repeating because it is easy to forget. You have got a lot of really smart folks within IT that have got great bells and whistles, but try to start with the business, what they really need, because oftentimes you can over engineer stuff. Another very basic principle is converting batch data pipelines to streaming so that you're capturing incremental data updates and not repeatedly resending unchanged data. Really internalize those data ops principles, the four ones that we talked about, so orchestration, CICD, automation and so forth. You really want to internalize that so that you are making your data pipelines more efficient and more effective. You want to identify weak integration points in your environment and start with those. So it makes sense to kind of start with low hanging fruit. Where have you got a bottleneck in your data flows, in the ways your pipelines are integrated with your business? Look for that. Try to fix where there is pain now. And if you do that, you can show a quick win, get funding and support for follow on projects that get bigger and bigger. And, you know, this is really all about connecting data pipelines to one another and automating how they work with the business and trying to streamline how the data flows between these different points in your environment. Thank you, Amanda. So going hand over. Please go ahead, Nadia. Thank you. So for all of you online, once you have identified your priorities, once you have established what data ops methodologies you will be using in your implementation, once you secured your buy in, where do you go from there? Let's talk about the application part within this equation. So to tee up the next portion of our webinar, let me play a short little video for you guys and then I will hand off to Scott who will talk about implementing data pipeline orchestration at scale. Within an enterprise, data takes a long complex journey prior to being transformed into actionable insights. This data which businesses depend on is collected from multiple sources. It's ingested, standardized, integrated, centrally stored for computation and analysis, and finally delivered to end users. Seems simple, right? Not exactly. At each stage, there are multiple data processing tools connected via custom scripts and point to point integrations. So what's the problem? A single script or point integration failure could break down the entire pipeline. Without centralized visibility, identifying issues is hard to do. Kind of like looking for a needle in a stack of needles. Enter Stonebranch's Universal Automation Center or UAC as a platform designed to orchestrate all IT automation. UAC solves the data pipeline orchestration challenge across any on prem, cloud, or containerized environment. UAC's big data pipeline orchestration solution eliminates the need for custom scripts and point to point integrations and replaces them with highly secure API or agent based integrations that help you centralize control and reliably move data from stage to stage without disruption. With the UAC, you will remove security risks with integration standards and highly secure encryption protocols. Receive proactive alerts when jobs fail. Root cause exactly what's wrong and where to fix it. And be data ops enabled so you can test, verify and deploy workflows with confidence. And speaking of workflows, UAC features a drag and drop workflow designer which is built to help you visually simplify even the most complex pipelines. With built in managed file transfer capabilities and event driven triggers, UAC automates the flow of data in real time, empowering end users with current data for rapid decision making across the business. Stonebranch has an always growing list of prebuilt integrations to applications and platforms. And there's an open source integration hub where you can easily build, customize, or borrow workflows and integrations from other UAC end users. All this to help future proof your big data operations with control and visibility across the entire data pipeline. StoneBranch, real time hybrid IT automation for whatever comes next. Hey, everybody. So thanks for taking the time to join us on this really important session. I'm glad that the video came through so clearly. And, Kevin, thanks so much for sharing your knowledge with us about data ops and data pipelines. I'm going to spend some time, not a lot today, talking about the challenges that people run into and then really spending a little bit of time talking about our product and our platform and how it helps. This is a preface to our session on Thursday where we will go into much more detail. We'll have I'm sorry, on next Tuesday, we'll have a discussion where we have a demo about the actual tool. We'll talk a lot about the CICD pipeline and the DevOps approach that you can employ within the tool. But for today, let's spend a little time here. So first off, one of the things I wanted to talk about is how enterprises actually orchestrate in today's environment. So when you're connecting pipelines, typically between these disparate tools, you're connecting via point to point, you're connecting via custom script, and the reality is many aren't connecting at all. There's really no automated process. It's manual movement of data. And so this causes a number of pain points, some of them driven by the technology that exists within the tools you're using, some of it driven because of the lack of visibility you have. But the biggest problem with the automation that typically comes inbuilt is with job schedulers that are built into tools like Informatica or other tools, and don't get me wrong, Informatica is a wonderful tool. They're just not a focused scheduling tool. It will only schedule inside of Informatica. And as such, it won't schedule things in Snowflake. It won't schedule what's happening inside of a source like SAP's database or an SQL database. You know, it will connect to those pieces, but there's not a master sort of scheduler that runs across. And a lot of data people love using open source tools like Airflow to do the scheduling. And Airflow is great. Usually what we're seeing in the market is Airflow is often a favorite scheduler because it's open source and it's used by data teams. But the point at which companies need to take the data pipeline that they've sort of stuck together with bubblegum and duct tape and operationalize it to scale, that's when they're coming to an organization like Stone Branch or other service orchestration automation vendors to help them achieve scale and have a tool that you can collaborate with between data ops, data teams, developers, IT ops, you know, have a single solution everybody can use. So because it's difficult to scale with open source, because it's difficult to have multiple schedulers and multiple tools, without a clear orchestration tool to run across your entire data pipeline, you wind up in a scenario where you don't really have a centralized view into the pipeline, which is a problem, right, to keep things running smoothly. It's really difficult to root cause issues when people see the pipeline go down. And what winds up happening, and there's many on the call today that'll probably relate with this, you don't know that the pipe's down until somebody tells you. That somebody could be a CEO trying to, you know, access a dashboard. It could be a customer that's trying to access some sort of reporting in a financial system, but you don't know until it's, you know, verbally being told to you because it's not obvious. And then to find that problem becomes really difficult because you have to pick through all the different applications that you have in the pipeline or data sources to find where it went down. So it's sort of that hunt and peck game. And the last thing I'll mention here is that a lot of pipelines are built by data teams, but they're, again, cobbled together using open source tools or scripts or whatever. And when you have somebody leave, it causes a pretty big problem, right, because they take that knowledge with them. Let's do one poll real quick. This will be the last one from today. Nadia, let me turn it over to you to run the poll. All right, guys, so let's see where you stand today as far as how many different sources of data your organization may be ingesting data from. Do you have one? Do you have two to ten? Or is it too many where you have really no idea how many there are? Let's give it maybe, I don't know, twenty more seconds and see what's coming in. I doubt I'm going to have a whole lot of people saying there is just one, we're not in that world anymore, but let's see how it's looking. Alright, five more seconds and I'm closing the poll. Again, you can see the results of this poll in a Polls tab, and surely seventy three percent of you said you have too many to count. Ten percent said that you have two to ten, okay, and nobody said there's only one tool. So Scott, I think this plays pretty much into the narrative Yeah, so let's actually go to the next slide. What we're seeing from our customers is too many to count, right? There are bucket loads of source data applications out there or databases or whatever, and this slide is just a fraction of the tools that we see in the market used along data pipelines. And this one in particular is kind of focused on, let's call it big data or analytic pipelines, but of course there's lots of different pipelines out there. And, you know, Kevin made a really good point earlier that the real shift that we've seen is that these are no longer the single mega applications like an SAP or an Oracle that sort of run the world, everything has become much more diverse, best of breed. And from a source system standpoint, the other major shift is it's now in the cloud, right? So you need to be able to operate in this hybrid environment. And the challenge that presents itself at that point is a lot of the scheduling technology that large enterprises have was built for only on premises automation, meaning that it would automate a mainframe or a distributed server or an on premises application like an SAP on prem. And so companies today are really struggling to try to bridge this gap between what their automation approach on prem is, which is probably working fine with their existing tools, and then the cloud automation that needs to happen in their cloud applications or their cloud service providers like AWS or whatever. And that's where we come in, and I'll talk a little more about that in a minute. But if you think about the stages that we go through, there was a slide that Kevin showed earlier on. It was titled Enter the Streaming Data World. And when Kevin's talking about streaming data, there certainly is stream data, a of that, but I think he's also talking about data running in real time. And to be able to make data run-in real time, the approach needs to be an event based approach. So you have to have system triggers, have to have if then this, if then then that statements, you have to have things that, you know, a file comes, it changes something and it moves it on. And that's what you're looking at when you're trying to get to that real time state. And doing that across all of the different applications that exist inside of this chart, and this is just a fraction of the applications that sit up there, is very difficult. It kind of goes back to this concept of you may have job schedulers in each one of these tools that will run a job, but most of them are batch, which prevents you from able to do things in real time. And then you don't want to run twenty different job schedulers, right? It becomes impossible to keep up with it. So I'm going to take you to the next slide, which gives you an example of probably a simple data pipeline, but this is something that you would build inside of the Universal Automation Center, which is our platform that supports data pipeline, DevOps methodologies, along with a whole bunch of other automation related activities. When I look at this chart, just to break it up real easily, if you look on the left hand side, the source data, this only has two sources here, but this could be fifty sources, right? And what you're doing is you're bringing that source data into your ETL tool somehow, right? And in this case, that's the data integration ingestion stage. Informatica is used here. Informatica is like, again, a great tool. You see it used lots of different places. And if you're an application to the right or left of it, it will move data and pull in data. It's when you start getting further away from Informatica, multiple hops of applications where you really need to have a layer that is more of an orchestration layer. Now, what I wanted to talk about a little bit here beyond that is the next couple stages. So, you know, Kevin earlier mentioned that, hey, you may have a data warehouse, which is really traditional, but people are also moving more towards data lakes. And they're moving data between data lakes and data warehouses and all kinds of different approaches. Being able to visually build that inside of an application is very helpful for data people. Now data people may also want to do it as code, and the tool that we have allows you to use either a visual GUI interface that would create a workflow just like this inside of a GUI and you can connect all the dots, or you can run it through as code and connect to GitHub and do your data ops methodologies in a way that supports that, right, as code. Now some of the foundational elements that we wind up seeing when we get RFPs for this sort of stuff is they really want to be able to do that low code or no code integration designer. So some it really depends, so data ops is a funny discipline, depending on where you grew up, you know, are we originally a developer, were you originally in data, were you originally in ops? Data ops teams wind up being an amalgamation of different roles. And so some people want the code, right? Some people want the no code, low code designer. And so what I'm ultimately saying is look for a tool that has both, and that's what Stone Branch's Universal Automation Center offers. Secondly, you need to be able to integrate and control the applications across your data tool chain. And if I think about the applications just on this chart, whichever tool you use to orchestrate, it needs to be able to integrate with SAP, it needs to be able to integrate with AWS, it needs to be able to integrate with Informatica and Snowflake and all the others here, right? So look for organizations that have applications that can reach in, connect to those tools, and then the orchestration layer of it actually automates what happens in those tools. So it gives you that sort of central command center to be able to control what happens in each of these tools along the data pipeline. So when you can control that, you can build these beautiful workflows, but then it gives you the ability to gain log data, to gain analytics around things so that you can drive observability off of that log data, can drive compliance, governance. But most importantly, to this discussion today is it allows you to create a data ops lifecycle methodology approach. So once you build a data pipeline, this idea of CICD and rapid testing and simulation needs to happen in order to constantly improve it. Also, it needs to happen because guess what? These tools don't stay static. You got to add tools, you got to remove tools. So the Universal Automation Center approach to this is something that we'll go through in a lot more detail on Tuesday's session as we're doing the demo. But in broad strokes, the way we set it up is you wind up having, just like in a development environment, you'll have a development environment, a test environment, and a production environment. And you have within our tool itself the ability to promote between those environments. You also have the ability to roll things back and a whole host of other features that help support DevOps. So if you're trying to achieve DevOps as an organization and achieve CICD and all the things that goes with approach related to DevOps, this tool enables that. So in our mindset, these are the foundational elements to help you achieve consistent and reliable data pipelines that you can head off into the weekend and not worry about getting a call for if they break down. Now, because I want to focus on the Q and A, I just want to talk about one customer use case. So we had an organization approach us last year that is one of the largest food and beverage manufacturers in the world. There's a lot of words in this slide, but ultimately the problem they faced, and it's a problem that a lot of people approach us with, is that they were using Azure Data Factory as their orchestrator or their job scheduler. But the problem with that isn't that Data Factory doesn't work, it works great, but it really only works within the Azure environment. So they could automate and orchestrate things as long as they stayed and didn't use tools outside of Azure. The real reason they approached us is because, well, we have all these other tools we want to make part of the pipeline, and we need an orchestrator that can tap into not just the Azure stuff, but the Informatica and the snowflakes of the world. And so what they really wanted to do is identify a platform that could connect all their critical data tools, and they needed it to run as code, they needed to have the GUI piece, they needed DevOps methodologies built into it and the whole kit and caboodle, right? And so again, large global food manufacturer, one of the top ten in the world, doing a ton of data work approaches with this problem. And so the output looks something like this. So one of the things that sets us apart from maybe your traditional, I heard Data Kitchen mentioned and DataLive a little while ago by Kevin, and yeah, we would go up against them in these sorts of things. But we have secure managed file transfer capabilities built into our tools. So one of the things that we find people coming to us for is not just our scheduling capabilities, but the fact that managed file transfer is in there, because managed file transfer is required to get the data out of the sources and move the data. So we can help move the data, we can help build integrations between data tools if needed, and ultimately we're going to be connecting into each of the applications that are on this data pipeline to control what happens with those applications as the data is flowing through. And because you're doing that, you get the log data, which gives you observability and you get peace of mind. But going back to this use case real quick, they had multiple data sources. I mean, they're starting small, obviously, don't want to just dive all in, but they have their original data sources were Amazon, Google Cloud Storage, they had some databases that they were pulling from. They used Informatica. They also used Data Factory. They wanted to keep using Data Factory. So in this case, we're not replacing the functionality of Data Factory, we're just reaching in and connecting to Data Factory and we're automating what happens inside of Data Factory. So, you know, one thing that I think needs to be made clear with our application is you're not replacing your data tool chain, you're still using those tools. You're simply adding a tool that connects into them so that you can centrally manage everything. So this is best for somebody who wants to keep the tools they have. They don't want to like try to standardize on all AWS if they're already not. They don't want to try to standardize everything on Azure. They want to keep tools they have, they like the tools, and they need to have something to help orchestrate all those tools together. Also, this large food and beverage manufacturer ultimately needed to be able to reach cloud tools and on prem tools. So again, just making the point that this on prem and cloud piece is such an important aspect of what stops people from being able to achieve a data pipeline if they're not even an orchestration tool like ours and others, is definitely what we keep seeing over and over again from customers and prospects. And ultimately, they got the observability, the visibility, they will improve SLAs, they got the real time monitoring alerts in a way where if something went down, they were sent an alert that let them know something was wrong before they got somebody calling them saying, hey, my dashboard isn't updating, or our customer is complaining that they can't see their data working in a tool. That's it. I think we should dive into the Q and A. Nadia, I'll turn it back over to you and let's go ahead and dive in there. All right, thank you, Scott. Great I love Q and A on data topics. The data audience is the best. You guys want to know so much, I got a lot of questions that are stacked up. Before I go there, I do want to acknowledge Daphne Bond. Daphne, thank you for all the questions. I want to send you our new Orchestrate the Universe! She was asking questions throughout the entire presentation, I mean she really wants to know more, thank you for doing that. Alright, so as we go into Q and A, I'll ask my backstage team to share the handouts that we have for this session so you guys will see that on your screen. And Kevin, my first question is for you. When you mentioned open data formats, could you elaborate on that? Did you mean JSON and such? Yeah, JSON is a great example. CSV files or our Parquet are often favored. And is obviously a way that organizations can maintain as much interoperability with other tools, with other storage systems as possible over time. Those are great examples. Yeah, JSON. Thank you. And when we are doing things as code, it all pumps out as JSON or XML. Great. That's exactly right. And that's what our customers are looking for. All right, to that point, Scott, my next question is for you. So, how many data sources can be orchestrated with a StoneBranch platform? It's really unlimited. We have the ability to tap into any data source you have and there's two ways to do it, really. And this is if you can find a way beyond this, then maybe there is some limit. But basically, we have what we call agents. So if there's not an open API within the tool, so often you find this in mainframes or some distributed servers, We install what we call an agent. Agent, you could call it a bot, you could call it anything, but it's just a little piece of code that helps our tool connect into that tool and run all the automation we need to. Alternatively, every tool out there today is being built with open APIs, and this has created a great way for orchestration tools or really any tools to interconnect with each other. And so just like all the other tools that work with APIs, we just tap into those APIs and we do it. The benefit to help jumpstart people who are heading down this path is that in addition to pre built agents and pre built API connections, we can build integrations really, really quickly. So, I'll give an example. We had somebody come to us, and I know this isn't a source, but more of an output. We had somebody come to us last year and they said, hey, we want to tap into Tableau. And we're like, yeah, we want to too because that's an obvious big tool that's being used across the data pipeline. And so it took us two days to build that integration. And so I I mentioned that because most other vendors, you put in a request, it'll take six months or weeks or whatever to build out those integrations. We have something called the Universal Integration Platform, which is a tool that either we can build integrations with or our customers can build integrations with. And I was speaking with a customer last week who was building integrations to tap into IBM's version of Informatica, right? And they built it on their own. They're using it. I think it was called DataStage, DataStage. And, you know, throw a good developer on it and they can build these things to tap into really anything out there. And I think that pretty much answers another question that we had where somebody was asking if the Universal Automation Center can integrate with a solution from a company named Precisely and the solution name is Connect CDC, Change Data Capture, I think that's pretty much to your point if that tool has I've heard that one come up before, we may have it, at least in proof of concept stage. So reach out. And if we don't, I'm telling you, I mean, days, not months. It's it's very simple for us to build these things, especially if there's an open API. But, also, if you have the tool, it makes it even easier for us to collaborate. We'll build it directly for you on what your use case is. Alright. So we're at the top of the hour. Whatever questions I did not get you guys, I promise we will reach out. We'll answer them through email. But what I wanna call out is that the next session that will be on Tuesday at this time will elaborate on the topic that we essentially opened up today. How do you do that? What are the tools? What are the real things that you can do to start this data pipeline orchestration at scale in your organization? Janjin, thank you so We'll do a live demo during it, so Moritz will be joining us, he'll walk you through the little picture that I showed and every step that happens there. It'll be very cool. Sounds good. Thank you very much. Thank you to our audience. Guys, we'll see you on Tuesday, right here. I'm Nadia Davis, and this is StoneBrains Online. Thank you. Thanks a lot.
You’ve probably heard the phrase DataOps. Isn’t it just DevOps, but for data? Not quite.
Look a little closer with Kevin Petrie, VP of Research at Eckerson Group, a leading data analytics-focused research and consulting firm. He offers a better definition of DataOps and explores how it’s driving significant change for leading organizations. He's joined by Scott Davis, VP of Global Marketing at Stonebranch, who introduces how the Universal Automation Center supports data pipeline orchestration and DataOps methodologies.
Key learnings:
- How data teams, developers, and IT Ops work together to apply DataOps methodologies to design, implement, and manage end-to-end data flows
- What DataOps is (and isn’t)
- How enterprises apply DataOps practices to achieve CI/CD in their data pipelines
- Which technologies DataOps teams leverage at different stages of maturity
Duration: 1:00:26