Orchestrate Flow of Data Across Entire Big Data Pipeline
Good day and welcome to the IEEE Computer Society's presentation on Orchestrate Flow of Data Across Entire Big Data Pipeline. This webinar is proudly sponsored by Stonebranch. Now I would like to introduce our speakers. Moritz Roos has a long history of helping companies address their IT automation needs in the most effective and efficient way. At Stonebranch, he leads a team of solutions engineers in Europe. Prior to his current role, Moritz worked as a consultant at PwC, as well as contributing to Stonebranch's product management and IT security teams. Peter Baljet is CTO at Stonebranch. Peter has more than twenty five years of executive leadership experience in the technology industry focused in the areas of enterprise software solutions and services. Peter's expertise spans product development, technology strategy, product management, operations, and consulting services. In the past, he has served as vice president of cloud architecture, engineering, and operations for Deloitte Consulting and Vice President of Cloud Architecture at SAP Labs. Ladies and gentlemen, more through Peter Baljet. Thank you, Amir, and good morning and good afternoon to everybody. Thank you for joining us today. And I'm going go ahead and jump right in to our presentation and discussion about data pipelines. What we have today is we're going to baseline ourselves on really what is a data pipeline, talk a little bit about orchestrating a data pipeline, then I am going to hand it over to Moritz who is going to go through a demo and talk about some real life case studies that we have seen. But feel free to submit questions, you know, and at the end we've lost some time to kind of get into a Q and A discussion. So about data pipelines, you know, pipelines have really evolved from the traditional data warehouse that we had with businesses when data really came from all of our back end systems. In this world, we basically had a very defined structure. We had operational data stores, data warehouses, star schemas going from what was called OLTP, online transaction processing, to OLAP. And then OLAP was the main way we kind of did data analytics. What's happened is really, in the last ten years, the volume of data has increased tremendously because of all the connections we have in our world through mobile. We're all surfing the web all day. We have got devices that are streaming data from all aspects of our companies and are live. So the sheer volume has increased, which really has driven the importance and complexity of data pipelines. I think I was reading somewhere where in twenty ten annually the world generated two zettabytes of data, which is basically twenty one zeros and a gigabyte is nine zeros. And in twenty twenty, we produced about twenty nine zettabytes, and that's going to increase at a twenty percent, twenty percent to twenty five percent every year. So understanding data pipelines as they become more important to our companies is going to be important as we get this increasing volume of data and trying to extract data I mean, value from that data. So this is our typical data pipeline, and we kind of have seen it in five stages. And although this is linear, it doesn't have to be linear. You know, some of these things can happen at different points and they can happen in parallel. But these are pretty much the distinct areas that a data pipeline consists of. First one is data sources. Data sources are increasing. You know, traditionally they've come from our internal company data sources, but they more and more come from outside, whether it be from third party data sets that are purchased. But there's a lot of streaming data that we get from people accessing websites and different things. Mobile is, of course, being a big driver of that also. The, you know, the example of a data source is we had done a dashboard for an automotive, one of the automotive companies, and they wanted a marketing sales dashboard. And so we had to bring in data sources from inside the company in terms of sales data, marketing spend. We brought in third party data sets that took sales and marketing data on other companies, their competitors. And we also took in sentiment data that we got from some of the social media feeds to create this dashboard. So the complexity of the data sources is becoming more and more difficult because all of these things come in at different times. If they don't come in, you have to know about it. If the quality of the data that come in, you know, isn't there, you know, you have to fix it. And as these pipelines become more mission critical, you know, validating that the whole pipeline is working correctly from the data source on back is more and more important. The data integration, you know, we it's traditionally been called ETL, Extraction, Transformation, and Load. We see more of an ELT, which is Extraction, Load, and Transform, meaning the volume of data is increasing. We get new data all the time that we don't have time to figure out what type of structure we want it. So we first load it and then we worry about transforming it later. This kind of is kind of where the whole data lake concept came into being, which is, hey, let's just get data in. We're not sure what we want to do with it, but we will figure that out after we see the data and understand what the value is. Data ultimately has to end up at rest. And traditionally, this has been our relational databases. But there's since we deal with a lot of unstructured data now, the way we store data has changed, too. So we typically, a data lake can take many forms, but we see a lot of people using the cloud blob storage. The classic example here is the AWS S3 storage, but we also have no SQL databases and streaming databases that you can take unstructured data in and store it in these databases and then later on use different types of tools to pull that data out. They're also getting very sophisticated on the cloud services where even when you just put your data there, it automatically gets structured and indexed. And you see a lot of these search capabilities being done on unstructured data. The analyze piece is also evolving very, very quickly with all of the artificial intelligence and machine learning techniques and tools that have been done here. Open source is playing a large part here. There's a plethora of open source tools as well as vendor tools that are very specialized in different data science areas. As, you know, as an example, we had taken customer support voice data from a company and had loaded that into cloud in AWS S3, used the analyze, the NLP tools to turn that into text, and then did some of the NLP analytics tools to look at sentiment analysis, but also to look for certain products that maybe were causing more problems or certain features within a product that were causing problems. So this area here is just taking in data and then learning how you can use that data to provide value out. On the delivery, you know, there's a lot of different delivery mechanisms. We use the word presentation here, but it's not meant to be a visual. It's meant to be how do you present the data for whoever is going to use that. The traditional piece here has been dashboards. You know, so this is typically for people to view and analyze and make decisions upon. But we're also seeing a lot of the delivery being done through APIs. There's recommendation engines. So all of the recommendations, know, you see across your websites are based on what they know about the person who is on that website. And that is based on all of the maybe data that they know about your search history, but also data that they might get from some third parties as well. So that's all kind of feeding in there. You know, the other piece here is a lot of the insurance companies and lending companies have scores that they give people, and so those scores are based on a lot of different data sets. We were working with a company that had geographic data sets, crime rates, socioeconomic data. If you are dealing with insurance companies, it could be more like weather patterns or incidents that happen in a certain geographic area. And you will take machine learning models and scoring engines to, you know, to calculate a value, and typically those values can be queried through an API. So the delivery of the value of what you're getting from all of this data coming in can take on many, many forms. If you've dive down into each of these areas and look at just what tools are being used here, it's really amazing how over the past couple of years, both from an open source perspective, a third party vendor tools, as well as cloud service tools out there. All three of those categories have a tremendous amount of tools in each of these different areas, which is really great because it gives data science teams and pipeline developers a lot of different tools in the tool chest to use for their particular use case. The issue with that that we see is typically companies will start out with a team and they'll start developing these data pipelines. And the team will get to choose what tools that they want to use. And all the connections in between these different stages have to be connected and they'll choose tools and languages on how they want to connect it. And that kind of is fine if you have one data pipeline or just a few, but as you scale up your data pipelines, because you'll eventually end up with data pipelines for all different parts of your business, you it becomes a very challenge to kind of manage that complexity from an operational standpoint. You know, this is the example I use is letting all the developers on the development team choose what language that they want to use. And it's not the same, but it's a little bit similar in the data pipeline world where you want the best of breed tool, but you also have to try to minimize the heterogeneous nature across all of your data pipelines. This kind of gets into different, what I would call, pain points around data pipelines, which, you know, when you have to connect all of these stages, what we see is a lot of teams doing point to point integrations. So you may have data stored in Amazon S3, but you may want to use an analytical tool in GCP, the Google Cloud Platform. So you'll do a custom integration from one cloud provider to another or from your on premise to your cloud provider. And it's something that that person chose. Or you may do custom scripts of automation between each of the stages. This is typically what we call glue code. And glue code takes many forms. It's a lot of scripting. You have Ansible that's very popular, but then you get all of the cloud native type of automation tools. So again, you can get into a situation where as your pipelines increase, the, you know, the different tools that you use increases as well. And you typically have to have people that understand each and every tech, you know, technology. So if you are not careful, your development and your operational costs will go up. And we have seen companies get into a challenge. And I have managed a data science team where we, at some point, we had to take a step back and really decide on what's the minimal set of tools that we can get away with because it was getting too expensive to not just develop but also operationalize. And at the end of the day, pipelines are mission critical. So they used to be batch oriented, but they're getting more and more real time as data streams become more and more real time as well, where you get data from social feeds, from, you know, call from call centers coming in and you are constantly trying to update either your data science models or the data that you have in your data stores. So when you operationalize them and you have a number of data pipelines, you know, the team who built it isn't the team who's going to be there on Saturday watching it. And so you've got to understand how you're going to operationalize these data pipelines and not just each individual one, but the IT team that has to watch them has to have some common way of going across all of them. So you can typically think of this as teams monitoring back end applications, especially mission critical ones like manufacturing systems, data pipelines are being more and more viewed with the same operational capability. And again, you know, if the IT departments as the IT budgets don't go up, but, you know, as we get more and more data, you know, data to data pipeline. So you have to kind of have a very proportionally low cost to operate a pipeline as they increase. So talk about switching over to big data pipeline orchestration. This how we at Stonebranch kind of view how we look pipelines and how do we solve all the pain points and support all the needs that I just talked about. If you look at, again, these five stages, this you can have errors and problems that happen in each of these stages. For the data sources, you can have data sources that don't come in on time or didn't show up or that came in with not the right dataset. In the ETL or, you know, ELT, there could be a failure of trying to load it into the data source, or there could be an error that comes out when it's trying to transform it. And so you can see across each of these you could have issues, and it could be performance issues also. So how do you manage this in a scalable way is the challenge. And so orchestration across each of these stages is you have to create a layer, what I like to call a single pane of glass, that treats each of these stages and can connect to all of the different tools and technologies, but yet provide a single view of the status of the whole pipeline. And not just visually doing it, but also being proactive in terms of, hey, this was supposed to this data load was supposed to take five minutes, but it's taken, you know, fifteen. It looks like it's getting out of whack and we need to take a look at it. Or for any errors that happen, you know, across the pipeline, centrally managed and orchestrating all these different components needs this central type of layer. Probably the one of the good things that we've seen is whether you're a cloud vendor, a third party, or an open source, they pretty much have a standardized API layer. So what we don't see is a lot of custom integrations with a technology that you have to code specifically for. Because you have this API layer, you know, you can essentially managing all of those different tools and technologies and connecting down to them becomes relatively straight, you know, straightforward. The hard part is connecting them all together and watching them, you know, holistically. So having a workflow editor, which Moritz will show, that you can kind of abstract all of those different things with a common view is very, very helpful. You know, when you have this central layer, one of the things that you're able to do is observe the whole data pipeline. And what we see a lot of companies wanting to do as they pull data from different sources is really keep traceability of where this data goes. So if you have some type of sensitive data like personal information or health care data that you're moving through a data pipeline, knowing the where did that data come from, where did we put it, because it may not be in our on premise systems. We may have moved it to a cloud provider and then it may go to somewhere specifically in that cloud provider. Knowing that history of where that data went in case you ever need to trace it is becoming more and more important as the data regulations play. Security is also key. So one of the things that this central layer does is you can abstract all of the secrets, as we call them, from all of the different data sources. So not hard coding any of your passwords, certificate, encryption keys. Those need to be abstracted outside of your data pipeline because you want to rotate those every so often. You want to audit them every so often. And so this central layer can also provide that capability also. Custom scripts, you know, the goal is to eliminate that glue that I talked about or at least have everybody using the same glue so that if you eliminate what I call tribal knowledge, so that the knowledge isn't in anyone's head, the operations team understands and can troubleshoot because everything is done in a more standardized way. The alerts, so anything that's error prone that happens across a data pipeline, you can actually route that through this central management to a particular person that may need to come and take a look. So it's almost akin to an application, there is some sort of problem with an application, you may need to route it to a particular person who has that knowledge. And then really managing the time and the cost associated with operating and building data pipelines. As you standardize more and more aspects of your data pipeline, you know, you are always going to use specific tools because depending on the data specifics and what type of value you're trying to get to it, you're never going to come down to a small set of data tools. What's important is that you standardize the integration between these tools and then all of the operational aspects of it so that when the data science team hands it off to the operational team, they, you know, they understand what to do. And just down here, this is really just some of the third party tools that we can integrate to. And this is a lot of the ones that we see more commonly. And again, I would put these into three different categories really, which is the cloud service providers, which are really making access to very sophisticated tools cheap and easy. Some of the third party tools that you see Databricks in, you know, Informatica, Teradata, and then open source. You know, especially when you get into the data science world, there's really a lot of open source that we see people using. So that is so just to baseline us on a data pipeline, what a data pipeline is and some of the challenges with it. Now I'm going to hand it over to Moritz, who's going go through a demo and talk about some of our customer use cases. Moritz? All right. Thank you very much, Peter. Hello and good morning, good afternoon, everyone. This is Moritz. And as Peter said, I am, in the following section of this webinar, going to demonstrate how such a big data pipeline can be orchestrated with a live demo using the Universal Automation Center. But before jumping into the live demo, let me quickly explain what UAC, which stands for Universal Automation Center, is all about. So UAC is a platform for real time hybrid IT automation. It can be used for all kinds of modern automation scenarios across your on prem and also cloud environment. And this slide shows some of the core features of the platform. Most importantly, it allows event driven automation. This means it can react on external or internal events to drive the automation process in near real time. These things used to be mostly time driven in the past, which we can also do still, but event based is just much more efficient and state of the art. So you really wanna do things or you wanna trigger the automation when when they should be running and not at a certain static time of a day. The platform also allows to visually design workflows that can span across any environment. And in the demo later on, you will see an example of such a workflow. Self-service automation is all about bringing automation to so called citizen automators. Traditionally, these kind of tools were available only to technical people, right? But nowadays, everyone should be able to be involved and to benefit from automation because it affects most of the people in organization And in the best case, everyone can have an influence or can be involved or even can define his own automation workflows where he depends on his daily business. Infrastructure and service automation means that we are not only automating things that are running on the service, but also the service itself and maybe even networking components. So automating the entire architecture and infrastructure. Managing data pipelines, also a core feature, is all about moving files and data securely between environments and application. And for this reason, UAC has a built in managed file transfer solution. Last but not least, analytics and visibility is about always knowing what is going on in the system. So depending on the user, you might want to see different information about workflows that affect your work or that you have responsibility over, right? So some people have responsibility that all the automation and their data pipelines in this case run smoothly. So you want to have all the information that you need to do your job and to react on certain events. So this is about the platform itself, but what you can do with this platform is on another page. And we try to summarize the, I would say, hottest use cases or solutions that we see among our customers on this slide here. So we have jobs and workflows workloads. And come as Stonebranch is coming from a traditional job scheduling and batch processing world, this is our bread and butter, so to say. It's all about running scripts and commands on all kinds of servers that were traditionally, of course, in our customers' data centers that now have moved to the cloud, but still jobs and workloads need need to run on those servers. With the cloud also came the need to orchestrate and automate the cloud itself, meaning not only cloud services, but also platforms and infrastructure in the cloud. DevOps automation is all about automating the CICD pipeline. So if you are a DevOps person, you know that code really runs through a lot of programs and stages until it is deployed to production. And this entire tool chain, we can also orchestrate. Big data automation orchestration is what we are talking about today. So it's about automating big data applications and moving data through those big data pipelines. Hybrid file transfer is something our customers are really interested in lately. It's about moving data not within on premise environments, but also between the new cloud environments and legacy on premise systems. We still have customers that use a mainframe, which is still very common in the banking and insurance sector. Right? So but these banks are also innovating a lot in the IT, so they have also very innovative platforms in the cloud or using container technology, and they need to get data, for example, from their mainframe to their new applications that run-in the cloud to, yeah, to feed their application with the right data from the co banking. And this bridge is this gap is really something that Stonebranch was able to bridge with UAC. Okay. Moving on to the demo. So what you're about to see is a data pipeline that we prepared for you. And if you remember the slides from earlier, a data pipeline usually has several tools involved in several categories. So we have data sources. We have also transformation and ingestion tools. We have storages. And in the delivery category, we have tools like Tableau or Power BI to visualize this data. Right? And in the demo that is following, we have a couple of these tools involved. But the data that we are moving through our data pipeline is not just any data. We want to make it close to a real life scenario as possible. So consider this business scenario. So imagine you are a large office supply manufacturer who operates all over the US, and all the different resellers in each region provide data, maybe once a day, once a month, in the best case, as often as possible. Right? So the business user wants to use this data to base some business decisions on. And this data is collected and is run through the data pipeline to be enhanced, to be cleansed. And in the end, it would show up in a dashboard of the business user. In this case, we use Tableau. So for this scenario, all data is provided except the data from the central region. So the central region data is missing. And I'm now going to share my screen so you guys can see what the business user is seeing. There we go. So what you can what you can see here is a dashboard that the business user of this office supplies manufacturer created to compare the the sales in each region with the discounts that they have applied to sell those products. And as you can see, we have data from the east region, south region, and west region. So the data from the central region is missing. So the the data pipeline that we have prepared can be viewed as like this. So what you see here is the Universal Automation Center. It's a web based application, so I'm using my browser to access it. And I created a specific dashboard for a specific user, me in this case, who is responsible for this particular data pipeline. So what we see here is a summary of all the jobs that is involved, that are involved in this data pipeline, which are quite a lot. We also see all kinds of alerts and notifications that have been created from this data pipeline. If something has gone wrong, we notify someone or maybe an approval that is necessary. So we see all this information here. We also see SLA violations. So behind each step in this big data pipeline, there's an SLA. So we really want to know when something is running too long, when something has started late, when something has finished early maybe. So we want to see all this information in our dashboard. We even see something we call projected SLA violations. So we have ways of even projecting and forecasting that something is about to go late, about to run over its its time, so we can take action before it's actually happening. And this is what the operator sees here in this window. In the largest widget here on this dashboard, we see all the different job types involved in this data pipeline, and you can see most of them are in waiting status. So this data pipeline is idling. So if I go to my workflow, which summarizes or which includes all the different jobs here, I will be presented with the workflow editor, workflow monitor, which shows all the different steps of the data pipeline end to end. Some of them are red, and this is the way of the of our tool to to let us know which one which tasks on this workflow are critical, so which which ones are on the critical path. Meaning, if if one of those jobs on the critical path runs over, the entire data pipeline will be late. Let's disable this view for a second, and we will see we are waiting. What are we waiting for? So the data pipeline or any workflow in UAC can be triggered by events. I said earlier that event driven automation is a core feature of the platform. Right? So there can be different events where we can react on to start our data pipeline. And for the sake of this demo, we decided to go for SQS, which is an Amazon Web Services application for message message queuing. Right? So imagine there's a message queue in AWS, and all all kinds of applications can push messages to this queue. And in our case, we are monitoring a specific queue for a message to arrive, which tells us all the central the data from the central region is available. We can run it through our data pipeline. To simulate that this message is sent, I use Postman as a, you know, web services call tool to send a message to our AWS queue. And as soon as I hit the send button, we will see that our data pipeline will be starting. So let's check it out. So the request went through. So this hypothetical application made the event, created the event that a message is posted to the queue. And we see our monitor has captured this event and has already started with our data extraction. So if you remember the data pipeline high level images that we have shown earlier, these two jobs here are our data sources. So one is an SAP system, and the other one is a Windows machine with an SQL database on. What we're doing here right after the data pipeline is kicked off, we extract data from SAP and we extract data from this database, which is, you know, from the central region. Afterwards, we have a job to check for so we want to transfer those two files that has have now been created on those two servers. We want to transfer them to a central space, to a central directory, so they can be transferred through the pipeline. So we put a stop to to this job here for demo purposes so it doesn't run through. It's quite fast, actually. So let's release this particular job, which checks that enough disk space is, available on the target directory. Since it is since it has gone to success, we can assume there's enough space to transfer those files. Afterwards, we have a Linux task because the target machine is a Linux system to do some grooming on the target directory, meaning we move some files to an archive so the directory is clean for our new files to arrive. Once this cleanup is finished, we have two file transfer jobs. One is using UDM, which is a proprietary Stonebranch file transfer protocol, and the other one is using traditional SFTP. So using UDM, we transfer a file, the extract from our SAP system, from the SAP machine, which is a Windows machine, to our Linux machine. And if you look look into it, we can see how it is defined. So the relevant fields are down here. We use our UDM file transfer protocol. We transfer a file from a Windows machine to a Linux machine. And if we check the output, we will see that the files the file has has been successfully transferred to our target directory. The same applies for the other file transfer, which uses SFTP. And in this case, it doesn't use an agent on the source system because it uses a remote file transfer protocol. Once the file transfers are successful to the target directory in Linux, we have included an approval task to our data pipeline. So we want someone to approve that the that the data pipeline has been, you know, okay so far, and we can continue with our transformation of the data. So the ways that we can get this approval, currently we have an integration to Slack and also to Microsoft Teams. So if I move to my Slack application, I will get a notification, Hey, there's a new there's a new data pipeline running. We need your approval to continue. So right from the application, I could do the same within Microsoft Teams. We can hit the approve button, and on the other side, our data pipeline will continue. So the user doesn't even have to open the application to move on with the data flow. It can be all done from Slack, for example. The next step is that we move the file from our on premise system to cloud storages. So the SAP extract is moved or is uploaded to our Azure storage, and the database extract is uploaded to an s three storage, for example. As soon as this is successful, our Informatica application will be triggered to cleanse the data and to to transform it. So Informatica has integrations to Azure and S3. So Informatica can grab those files itself. We don't have to push it to Informatica. Right? So this is the point to point integration that Peter talked about. So some of these tools know what is left and right. Right, and they can integrate with them. The benefit of such an overall workflow is that we really know what's going on end to end. As we can see, after the Informatica transformation on the data is successful, we have a couple of Snowflake related jobs. So Snowflake is our data warehouse where Tableau gets its data from to display the dashboard what we have seen at the beginning. So what we do here is we do some cleaning on the table in Snowflake, and then we upload data from Azure directly to Snowflake again. Snowflake has an integration, and we tell it to upload the data actively from an Azure storage. After the data is implemented or ingested to Snowflake, we refresh our our Tableau workbook because our Tableau is running or this this dashboard, for example, is running against the data source, and this data source fetches data only maybe once a day. Right? Because it's a very compute heavy process. We it's not a live connection in this case. So we need to tell Informatica sorry. We need to tell Tableau that there's new data in the data source, and we refresh the data source. So if we now refresh our dashboard here, we will see the data from the central region has been added and the dashboard now displays the data that our business user is so desperate to see. Right? So now the business user can use this data, and he even gets an an email notification that the data pipeline is successful and that the data is ingested and that he can check his dashboard again. So now you can see, okay, he's notified in real time. The data is now available. I can check out my Tableau. There is a couple of more jobs in this this data pipeline that is concerning a totally different team. Right? So this was concerning the business users. However, the data is also used by a data science team, for example, to do or to train their machine learning algorithms. For this reason, we have a parallel flow here using also utilizing Azure Data Factory to to prepare the data, and then we upload it to a compute solution called Databricks, which has a cluster in the background that we check is available. And afterwards, we run a machine learning job to, you know, feed our algorithms with the new data. And afterwards, since this team is not using Tableau, it is using Power BI for whatever reason. Right? It's all good. We want to refresh the Power BI data flow. However, this task has now failed. And if we check back to our dashboard, it has changed a lot. A lot of tasks have have been finished successfully, but we have a couple of failures here. And we can see there is an alert that has been posted to ServiceNow. So for this particular task, we have created a ServiceNow incident that be posted to the team who is responsible for the Databricks. And one second. I was locked out again. So this team has now gotten an incident to check on this Databricks sorry, on this Power BI, and the incident is assigned to to this particular team. And when we check-in the incident, we even have the output of our of our job. So the team already knows what's going on, and they can fix the issue. They can notify the operator of this data pipeline, and they can then restart this job. For the sake of this demo, I will just first finish it so we can finish the workflow. So after everything is successful, we even get an email that everything has run successfully and we get basically a summary of what has happened in the workflow, if everything ran fine, what were the starting dates of each task, what were the ending timestamp. So everything is fine. If we go back to our dashboard, we see everything has been, you know, finished successfully and the dashboard looks fine again. So this was basically the live demo I wanted to show you. So I'm gonna stop stop sharing my screen because we have a couple of more slides to share. Alright. You should be seeing the the slides again. And before we finish the session, we wanted to tell you a bit about a customer that we have where we were able to solve a particular problem he had with his data pipeline. So basically, this customer was undergoing a digital transformation to move all applications and service to the cloud. And so their starting point included having data assets and products still living in multiple on premise legacy systems. Right? So they then, and as part of the digital transformation, they built a pretty cutting edge global cloud platform for their entire global operations. And this platform was for enterprise analytics and data management. It was based on Azure, and they used the Azure hub and spoke model. So each region in the world where they operate had their own segregated cloud environment that receives and uses services from the central hub. Since they were on Azure, they used Azure Data Factory to manage most of the data pipeline, which worked and still works great. Right? We don't want to replace Azure data factory in this example. However, looking at the full picture, some parts of the data pipeline still resides on premise and also involves other ETL tools such as Informatica and AWS Glue for some areas since they are not exclusively on Azure. Right? So as most companies nowadays, this customer was, you know, having a multi cloud strategy to use the best services out of each cloud platform and not putting all their eggs in one basket. So in the end, there were really a couple of blind spots for Azure Data Factory to really orchestrate the entire data pipeline. Therefore, the goal of this customer was to really orchestrate the full data pipeline end to end by finding a platform that can integrate all of their critical tools. The solution in this case was to use Universal Automation Center as a meta orchestrator for the entire data pipeline. And this is how it looks like on a high level. So we have a very similar picture to what Peter has shown earlier. However, the tools that they use for their particular data pipeline are shown here. So they have Amazon s three as data sources. They have Google Cloud Storage. They have a couple of ETL tools, such as Informatica and Data Factory. They use BigQuery, Snowflake, Data Lakes for data storage, Databricks for analyzing the data, and visualization is done in Power BI. So, as you know, every data pipeline looks different. And this and and these data pipelines are prone to change a lot over time. So the way we tackled this issue was by providing easy integrations to new important tools, either we provided ourselves to our customers or we provide our customers the opportunity to develop these integrations themselves. And UAC brings, you know, all the tools that are necessary for this. So I said earlier Meta Orchestrator because we don't really replace Data Factory or Informatica in this orchestration scenarios. They are pretty good at what they're doing. Right? So we simply bridged the gaps in a very diverse pipeline by providing an orchestration layer on top with integrations to all the important tools. Alright. I think we're coming to an end of the slides. However, as a summary, I wanted to give you a summary of the things that you should be looking for in a big data pipeline orchestration solution if you have a similar issue or if you're looking for a platform that can do similar things. So I'm not gonna read through all these things because we mentioned them already. However, the most important ones of such a tool are the capability to automate across the entire hybrid IT environment, meaning on premise and cloud. Event based automation approach is really, really important for a new tool that you implement in your environment. It also should be DataOps enabled, meaning that integrations to all the critical tools in data pipeline should be there. If not, you should be able to build them yourself. Another important point is also this tool should be the SaaS based on premise because you probably also have a similar digital transformation you're going through where you put all your resources in the cloud and such an orchestrator just purely has to be able to be deployed in the cloud as well. Alright, this is it for my part. I think now we start with the Q and A session. Thanks, Moritz. I appreciate it. What we are going to do is I will read off the questions and then I will take it or I will hand it off to Moritz so we can both kind of go back and forth on it. Sounds good. So the first question we have is on data sources. You know, and I'll read the question. Being a pharma company working with sensitive data such as genomics, we acquire large data sets from many vendors using an interactive UI versus doing it with CLI, command line interface. With encryption, passphrases sent by separated keys, also get public domain data sets that require firewall changes and use of custom downloading tools by data providers? How does UAC help address these challenges? Maybe I'll kind of make a couple of comments and then Moritz, you can kind of comment also. Sure. One of the, you know, we kind of view UAC as a hybrid IT tool, and one of the aspects of that is that we have not only can we do things centrally, but we also have agent technology. And one of the reasons for our agent technology is because data sources can exist on different systems that may not have modern ways of transferring data. So our agents can sit on many different types of platforms and operating systems and then wait for data. There's different types of trigger mechanisms and then they can also transfer that data to different systems. We all, that's kind of, you know, can be from system to system. There's also, we do manage, what we call manage file transfer, which is usually when you're talking about going from business to business file transfers. You know, managed file transfer typically uses secure FTP or FTPS. It basically has a way to configure ways to programmatically send and receive files from defined and, you know, endpoints out there. But from the, you know, the encryption part, there's a lot of ways that we can, you know, separate keys from actual data and then during the process and during the pipeline have where we can trigger that data to be decrypted and it can go get the key from email or, you know, what's becoming very, very popular is having a secret management tool. All the cloud providers have it. There's also different vendors that provide centrally managed key stores, so we can integrate with those also. Absolutely, good point. And also, think B2B file transfer is always difficult because if you are a large organization with many partners where you want to exchange data with, you may also use different kind of protocols that are specific to this industry maybe even, So definitely UAC can be used as a platform to orchestrate all file transfers using different protocols and different ways of getting the data. Peter mentioned the agent. That is definitely one way. We have customers that provide their very important vendors with an agent as an endpoint to very securely transfer those files, right? Because we have built in encryption and also compression, so even large data sets can be transferred securely and fast. But depending on what the business partner wants to use, you have to find a different solution for this, right? And USC is definitely a very flexible platform to incorporate all these different ways. Next question: Can you run a simulated test before pushing the workflow into production? Yeah, so probably one of the things that you should think about your data pipelines is you should almost look at your data pipelines as a development process. And what that means is that you have an environment where your teams are developing the pipelines. You may have an environment where you are promoting it to be tested. So testing all of the, you know, it kind of mimics the actual production environment where you have the data coming in and everything in the pipeline being going through some sort of a test process, validating the data and the models, and then having a production in, you know, environment where you actually run it. And what's really important about that is that everything should be version controlled so that just like when you're doing software development, any sort of schemas, scripting, you know, it's what we typically called automation as code or infrastructure as code, that that's all versioned and that you have a way to, you know, promote that from the environment to, you know, environment. And UHC has that concept where you can take a particular pipeline and then promote it to a different, you know, environment. You define as many environments as you have as part of your pipeline development process. But yeah, that's something to think about you're developing a pipeline and you know that it's going to evolve go through different versions over time as you get more sophisticated. My next question, Can you create user specific environments? For example, I want my architect to have a different view from my finance team. Moritz, do you want to take that one? Sure, sure. That's actually a great question. So because the platform is used by a very so the USC platform is used by a lot of different teams that, you know, do different things and are also supposed only to do different things. So we can segregate this environment using a technology called business services to you have to basically partition the data. So depending on what member the user is what group the member the user is part of, he sees and he can only access certain data. And on top of this, every user and every group can create their own dashboards, for example, to really in one one view see only data relevant to to their task at hand. Right? So there is really a lot of flexibility here to give different kinds of users different experiences and make all the information available easily. Okay. Thanks, Moritz. Next question. There's a lot here, so I'm trying to go through them. I'll take this one. There seems to be a lot of scripting also in a solution of this type. What are the strategies to reduce the effort in development all of the bridges? And there was a similar question that talked about just recommendations on how do you kind of corral the number of tools being used. So from the, on the scripting part, you know, that's a lot of what we, what you do with this centrally managed thing is a lot of reuse. And so what you want to do is look for common functions that you have across your data pipelines and you can create what we call tasks that can be reused across your data pipelines so that you're getting that reuse there. And we also try to provide a UI front end to the scripts. So a lot of time these scripts have to have values passed down or shared values across your data pipeline, And so those can be abstracted away from the script and managed centrally so that makes these scripts more usable. The other thing that we do is standardize on the scripting languages so that you don't have, you know, five different scripting languages that, you know, that you're using, that you're really using a common one with the tool. And then on the proliferation of, like, of the tools and technologies you use in your data pipelines, one recommendation we would make is that there's, you need to look for, if you need, for example, a place to store data, just look at your, primary places to do your unstructured data, to do your structured data. You know, believe it or not, there's still a large amount of relational databases that are being used out there, so choose a relational database vendor. This was a problem that I ran into where our database team got really big because we had to support all of these different databases, and then we kind of brought it back down to just a small set so that we didn't have to maintain this huge set of knowledge. So for certain things, agree on a certain technology or certain service and then use that first unless there is a very good reason to deviate from it. The next question. If the underlying infrastructure stack has many points of failure that require frequent human troubleshooting intervention, does it make sense to adopt UAC or would it simply add overhead? In other words, do you assume relatively robust underlying infrastructure for this to work reasonably well? Moritz, do you want to handle that one? Yes, so I assume this is pointed towards infrastructure of the UAC platform itself or the infrastructure of all the different applications that we use in our data pipeline. So if it's towards the UAC infrastructure, yes, the entire platform can be set up highly available, so there's no single point of failure at any layer, you know, that would, you know, make such an important application and platform be going at risk to not run and to not to automate, not to orchestrate. If it was towards the infrastructure of all these big data related applications, absolutely, we can have agents on those servers and on those cloud, you know, environments that are hosting these machines to periodically check if the system is running. And if not, we can, you know, take actions and notify people to check what is going on before actually trying to execute workload on those environments and, you know, running into a failure. Yeah, and I'll just add that one thing that it provides is for operations teams when they're operating these pipelines is it does provide them with a common alerting way to alert and troubleshoot. And any data pipeline needs to go through a maturity process like any piece of code where you just kind of do root cause analysis and try to work out, you know, different issues over time so it gets more and more stable. Think we are out of time, so I am going to hand it back to the moderator. But I would like to thank everybody for attending and appreciate all the good questions. All right. Thank you. And as always, I'd like to thank the speakers, thank the speakers, Peter and Moritz, for this amazing, informative presentation. Thank you for the audience for even participating. We will be on demand so if you haven't had a chance to listen today hopefully you can get a chance to listen at any other time. A big thank you to our sponsor Stonebranch and also the IEEE Society. Thanks everybody for attending. See you next time.
With the aid of any data management and processing tools, big data flows through multiple on-prem and cloud data storage locations before it’s delivered to business users. As a result, IT teams, including IT Ops, DataOps, and DevOps, are often overwhelmed by the complexity of creating a reliable data pipeline that includes the automation and observability they require.
The answer to this widespread problem is a centralized data pipeline orchestration solution.
In this on-demand recorded webinar, Peter Baljet, Stonebranch Chief Technology Officer, and Moritz Roos, Stonebranch Director of Solution Engineering EMEA, take an in-depth look at how enterprises can centrally manage and orchestrate the automation required to control and maintain the data pipeline for big data in a hybrid environment with an IT automation platform.
Key learnings:
- Discover how to orchestrate data pipelines across a hybrid IT environment (on-prem and cloud)
- Find out how DataOps teams are empowered with event-based triggers for real-time data flow
- Learn how to replace point-to-point integrations and custom scripts with pre-built tool integrations and native managed file transfer functionality
- See examples of reports, dashboards, and proactive alerts designed to help you reliably keep data flowing through your business – with the observability you require
- Explore how to replace clunky legacy approaches to steaming data in a multi-cloud environment
- Watch a demo designed to illustrate the power of what’s possible with Stonebrach’s Universal Automation Center
Note: For additional information, we offer Gartner’s “2024 Market Guide for Service Orchestration and Automation Platforms (SOAP)”* in our resources section. Request free access now.
Duration: 59:10
* Gartner, "Market Guide for Service Orchestration and Automation Platforms," Manjunath Bhat, Daniel Betts, Hassan Ennaciri, Chris Saunderson, 17 April, 2020. Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally, and is used herein with permission. All rights reserved.