This article was written by Rahul Pathak, vice president of relational database engines at AWS
Integrating data across an organization can give you a better picture of your customers, streamline your operations, and help teams make better, faster decisions. But integrating data isn’t easy.
Often, organizations gather data from different sources, using a variety of tools and systems such as data ingestion services. Data is often stored in silos, which means it has to be moved into a data lake or data warehouse before analytics, artificial intelligence (AI), or machine learning (ML) workloads can be run. And before that data is ready for analysis, it needs to be combined, cleaned, and normalized—a process otherwise known as extract, transform, load (ETL)—which can be laborious and error-prone.
At AWS, our goal is to make it easier for organizations to connect to all of their data, and to do it with the speed and agility our customers need. We’ve developed our pioneering approach to a zero-ETL future based on these goals: Break down data silos, make data integration easier, and increase the pace of your data-driven innovation.
The problem with ETL
Combining data from different sources can be like moving a pile of gravel from one place to another— it’s difficult, time-consuming, and often unsatisfying work. First, ETL frequently requires data engineers to write custom code. Then, DevOps engineers or IT administrators have to deploy and manage the infrastructure to make sure the data pipelines scale. And when the data sources change, the data engineers have to manually change their code and deploy it again.
Furthermore, when data engineers run into issues, such as data replication lag, breaking schema updates, and data inconsistency between the sources and destinations, they have to spend time and resources debugging and repairing the data pipelines. While the data is being prepared—a process that can take days—data analysts can’t run interactive analyses or build dashboards, data scientists can’t build ML models or run predictions, and end users, such as supply chain managers, can’t make data-driven decisions.
This lengthy process kills the opportunity for any real-time use cases, such as assigning drivers to routes based on traffic conditions, placing online ads, or providing train status updates to passengers. In these scenarios, the chance to improve customer experiences or address new business prospects can be lost.
Getting to value faster
Zero-ETL enables querying data in place through federated queries and automates moving data from source to target with zero effort. This means you can do things like run analytics on transactional data in near real-time, connect to data in software applications, and generate ML predictions from within data stores to gain business insights faster, rather than having to move the data to a ML tool. You can also query multiple data sources across databases, data warehouses, and data lakes without having to move the data. To accomplish these tasks, we’ve built a variety of zero-ETL integrations between our services to address many different use cases.
For example, let’s say a global manufacturing company with factories in a dozen countries uses a cluster of databases to store order and inventory data in each of those countries. To get a real-time view of all the orders and inventory, the company has to build individual data pipelines between each of the clusters to a central data warehouse to query across the combined data set. To do this, the data integration team has to write code to connect to 12 different clusters and manage and test 12 production pipelines. After the team deploys the code, it has to constantly monitor and scale the pipelines to optimize performance, and when anything changes, they have to make updates in 12 different places. By using the Amazon Aurora zero-ETL integration with Amazon Redshift, the data integration team can eliminate the work of building and managing custom data pipelines.
Another example would be a sales and operations manager looking for where the company’s sales team should focus its efforts. Using Amazon AppFlow, a fully managed no-code integration service, a data analyst can ingest sales opportunity records from Salesforce into Amazon Redshift and combine it with data from different sources such as billing systems, ERP, and marketing databases. Analyzing data from all these systems to do sales analysis, the sales manager is able to update the sales dashboard seamlessly and orient the team to the right sales opportunities.
Case study: Magellan Rx Management
In one real-world use case, Magellan Rx Management (now part of Prime Therapeutics). has used data and analytics to deliver clinical solutions that improve patient care, optimize costs, and improve outcomes. The company develops and delivers these analytics via its MRx Predict solution which uses a variety of data, including pharmacy and medical claims and census data, to optimize the predictive model development and deployment as well as maximize predictive accuracy.
Before Magellan Rx Management began using Redshift ML, its data scientists arrived at a prediction by going through a series of steps using various tools. They had to identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot, export the data from the data warehouse, and prepare the training data to work with these models. When the model was deployed, the scientists went through various iterations with new data for making predictions (also known as inference). This involved moving data back and forth between Amazon Redshift and SageMaker through a series of manual steps.
With Redshift ML, the company’s analysts can classify new drugs to market by creating and using ML models with minimal effort. The efficiency gained through leveraging Redshift ML to support this process has improved productivity, optimized resources, and generated a high degree of predictive accuracy.
Integrated services bring us closer to zero-ETL
Our mission is to make it easy for customers to get the most value from their data, and integrated services are key to this process. That’s why we’re building towards a zero-ETL future, today. With data engineers free to focus on creating value from the data, organizations can accelerate their use of data to streamline operations and drive business growth. Learn more about AWS’s zero-ETL future and how you can unlock the power of all your data.