Amazon Web Services is once more tapping hardware to optimize the performance of its Amazon Redshift data warehouse. This week, it is announcing general availability of the Advanced Query Accelerator (AQUA) for Amazon Redshift (AQUA). It follows up announcement of the private preview, which we covered in detail when it was announced at its 2019 re:Invent conference just over a year ago.
To recap, AQUA sits between an Amazon Redshift RA3 cluster’s compute and storage, and runs with Amazon Redshift RA3 instances; these are the compute instances that are designed for separating compute from storage. In contrast to traditional nodes used for Amazon Redshift (typically DC2, which offer their own locally attached SSD storage), RA3’s use local SSDs strictly for caching. Instead, it typically draws upon data sitting in Amazon Redshift Managed Storage (RMS) on S3, and can scan it in massively parallel fashion.
Prior to the advent of AQUA, this option was targeted at cooler data that could be used near-line, and analytics that touch only a fraction of the total data. Unlike traditional Redshift, the RA3 option provided a bit of elasticity in that compute nodes could be turned off when not used. And since the original announcement, AWS has expanded the RA3 line with a smaller node that is a fraction of the cost of the existing larger ones.
AQUA features additional custom hardware that includes Amazon’s Nitro hypervisor system along with FPGAs; Nitro offloads compute-intensive functions such as encryption and compression, while the FPGA runs customized analytic routines such as scans, filtering, and aggregation. This is designed to reduce the time for offloading and processing data from RMS.
When AWS unveiled the RA3 instance for Redshift, it was targeted at workloads with “near-line” data – data that is occasionally used in analytics but does not need to be kept continually online. AQUA in turn added a high-performance option based on the notion that large volume of data sitting in managed storage is being used on an operational, rather than occasional basis. The end result is that it can run the portion of this data that fits into cache at high speed, and larger datasets at lower speed that nonetheless benefits from massively parallel scans courtesy of the FPGAs.
Amazon Redshift RA3, with or without AQUA, works against data that is stored in specialized form on S3. Termed “managed storage,” this is data in S3 with optimizations such as data block temperature, data block age, and workload pattern, and the ability to autoscale storage on demand. This should not be mistaken for a data lake, as AQUA is not designed to work against raw, unoptimized data stored in S3 that would otherwise be used by AWS services such as Amazon EMR, Athena, or Redshift Spectrum.
AQUA represents a generational shift for Redshift. When Redshift was introduced nearly a decade ago, it capitalized on the columnar, massively parallel (“shared nothing”) architectures that were the state of the art for data warehouses back in the early 2000s. Such platforms provided scale and performance improvements over conventional data warehouses using row-based architectures because queries could skip non-relevant columns and more readily compress data.
But in the interim, cloud-native architectures that separated data from compute emerged, pioneered by Google BigQuery and Snowflake. The advantage of cloud-native architectures was that they were elastic; because storage was separated from compute, compute nodes could be turned off when not used, and more readily ramped up when needed. Such architectures have become de facto standard for cloud data warehousing, and over the past few years, Redshift has added some options such as Redshift Spectrum and Redshift Federated Query that provided some degree of elasticity.
With AQUA, Redshift travels full circle; it now has an elastic option that doesn’t sacrifice performance, at least for cooler data. Here’s what it doesn’t do: it doesn’t replace conventional Redshift instances, which are designed for “hot” data, and it doesn’t replace the data lake (or “data lake house”) that goes against raw, unoptimized data sitting in S3. Instead, AQUA extends the data warehouse with an economical pricing model designed for going against occasionally used, near-line data.
But here’s our wish list going forward. Obviously, keep existing Amazon Redshift for customers demanding predictable latencies against core operational data. But we would also like to see AWS introduce a serverless version of Redshift built around AQUA for customers with highly variable analytic workloads, just as it already offers serverless options for databases like Amazon Aurora. Yes, it already offers a serverless option for query compilation; let’s now take this one all the way.