Real-time data processing is hot. Pioneers like Netflix have been doing it for years and reaping the benefits. Big on Data has been onto this for years, too. Now the rest of the world seems to be catching up.
The streaming analytics market (which depending on definitions, may just be one segment of real-time data processing) is projected to grow from $15.4 billion in 2021 to $50.1 billion in 2026, at a Compound Annual Growth Rate (CAGR) of 26.5% during the forecast period as per Markets and Markets.
Today, Redpanda Data (formerly Vectorized) announced it has raised $50M in Series B funding, led by GV with participation from Lightspeed Venture Partners (LSVP) and Haystack VC. Released in early 2021, Redpanda is touted as a modern streaming platform that gives developers a simpler, faster, more reliable, and unified record system for real-time and historical enterprise data.
We caught up with Redpanda founder and CEO Alex Gallego to discuss the platform’s origins and key premise, as well as business fundamentals and roadmap.
Natural evolution
One thing to know about the real-time data processing market is that there is a sort of de-facto standard there: Apache Kafka. We have followed Kafka and Confluent, the company that commercializes it, since 2017. ZDNet’s own Tony Baer and Andrew Brust have been keeping up, with Baer summarizing the evolution of Kafka and Confluent in April 2021, when Confluent confidentially filed for IPO.
In 2019, over 90% of people responding to a Confluent survey deemed Kafka as mission-critical to their data infrastructure, and queries on Stack Overflow grew over 50% during the year. As successful Confluent may be and as widely adopted as Kafka may be, however, the fact remains: Kafka’s foundations were laid in 2008.
As real-time data processing is getting more adoption, the stakes are getting higher, and the requirements are getting more demanding. Gallego has been working in stream processing for about 13 years prior to starting working on the Redpanda engine. In 2016, he sold Concord, another company in the real-time data processing space, to Akamai.
Redpanda started as “the natural evolution” of what Gallego thought streaming should be like. His motivation was to understand what was the gap between what the hardware could do and what the software could do:
“I literally connected edge computers with the cable back to back just to make sure there was nothing in between these two computers. And I just wanted to measure and understand: what is the fundamental evolution of hardware, and did software actually take advantage of modern hardware?” said Gallego.
His findings suggested that existing solutions, built for decade-old hardware, were oriented towards addressing what the fundamental limitation of the hardware at the time: spinning disk was. The new limitation, he found, is actually CPU coordination.
Sometimes you really get to reinvent the wheel when the road changes, is how Gallego summarized his findings. In 2017, he shared his findings publicly, and in 2019, he started working on Redpanda. Originally Redpanda was a platform for experts by experts, Gallego said: “It was designed for people that were like me: streaming experts that wanted something more with the storage”.
Gallego is not alone in pointing out shortcomings in Kafka. About 40% of Redpanda customers are streaming engine experts, Gallego said. Crucially, the choice to maintain compatibility with the Kafka API and the entire Kafka ecosystem was made early on. The Redpanda storage engine was written before embarking on building a company.
Redpanda was initially a closed source. In late 2020, it was made source available, adopting the BSL license, inspired by CockroachDB. In 2021, Gallego said, Redpanda started with hundreds of customers. By the middle of the year, they were in the thousands, and they ended the year in hundreds of thousands of Redpanda clusters.
The Ring Zero of real-time data processing
Besides experts, Redpanda has also attracted people who had never heard about streaming before, Gallego noted. At the same time, he feels credit is due to Kafka, as well as Pulsar, RabbitMQ, and the entire family of streaming systems that came before Redpanda.
Also: Data is going to the cloud in real-time, and so is ScyllaDB 5.0
The Kafka broker was a fundamental piece in building the data streaming infrastructure, Gallego acknowledged. The most powerful thing that Kafka did is it created an ecosystem. The fact that Kafka connects transparently to platforms ranging from Spark streaming, Flink and Materialize to MongoDB and Clickhouse means that Redpanda does, too.
No hero migration stories, no code changes, just some configuration change, and it all works, is the promise. That definitely sounds compelling for everyone in Kafka’s large installed base. Redpanda has released a benchmark comparing its platform to Kafka to back the claims of superior performance.
Redpanda’s brownfield and greenfield use cases include Fintech, gaming and Adtech companies, electric car manufacturers, the largest CDN in the world, some of the largest banks, as well as the likes of Alpaca and Snapchat.
A feature that sets Redpanda apart, and Gallego believes this helped onboard new users to streaming, is the fact that it comes in a single binary file, with no external dependencies whatsoever. But there are more. For starters, the fact that Redpanda is implemented in C++. This is a story we’ve seen before — ScyllaDB vs. Cassandra comes to mind.
The main premise of Redpanda is — a simple, fast, reliable engine with Kafka compatibility. But Gallego chose to emphasize something else: unified, meaning unified access to data. That, Gallego said, allows developers to build a new category of applications they couldn’t build before:
“For a developer, having unlimited data retention means that they don’t have to worry about disaster recovery, and they now have a backup. They don’t have to worry a priori about which other databases or downstream systems they need to materialize. They simply push their data into Redpanda, and we’re transparently here, and it’s relatively cost-effective to store even petabytes of data”.
What Redpanda is focusing on, as per Gallego, is what he called “Ring Zero”: having a streaming system as the source of truth, which is not a solved problem, but Redpanda is tackling head-on. However, we should also note that there are some parts of the streaming puzzle that users won’t find in Redpanda, namely complex processing or a SQL interface.
Gallego breaks downstream processing into complex stream processing and simple transformations. Simple transformations, such as masking private and sensitive information, can be done more efficiently with Redpanda, Gallego claimed. That’s because the transformation is done in Redpanda instead of sending it to an external engine like Flink or Spark.
Going forward
As for complex stream processing, whether it’s SQL or something else, Redpanda relies on a partner ecosystem. Gallego believes having companies that are focused on specific layers yields a better product. This principle also extends to how Redpanda approaches real-time machine learning.
While Gallego believes that real-time machine learning is on the rise, he does not see Redpanda fitting into this storyline on the machine learning algorithms part. The TensorFlows and SparkMLs of the world have that covered, he concedes. What Redpanda brings to the table is a scalable backpressure valve that allows the machine learning algorithm to replay.
Fraud detection is a typical example for real-time machine learning. In a scenario where bias is detected in a credit score application, you would need to go back and reprocess the entire history, and this is where Redpanda shines, Gallego said:
“Using Redpanda means that you don’t have to change your application to be able to reprocess the entire history of all of your events that led to that decision. What that’s really creating is a new engine of record that allows the machine learning algorithms to reprocess the data, have access controls, have backpressure spill to disk in case that you get a ton of load”.
As far as the future of real-time data processing goes, Gallego thinks of Kafka and its API as a historical artefact — in a positive way. Developers bought into the ecosystem, and they built millions of lines of code, but the future is a different API, Gallego thinks:
“I think the future is serverless. I think the future is a less heavyweight protocol than the Kafka protocol. I think that Redpanda is a company that can give people both A and B. A is compatibility with this hugely rich ecosystem that is always going to be important, and B is because we’re more tied to the market evolution from batch to real-time.
Today it happens to be that Kafka API is the best way that we could do that. But I think it will be a different API in the future, and it’ll be a new API that is really designed for the way modern applications are being built. That’s how I see the story arc for Redpanda”.
That sounds like an approach that tries to marry pragmatism with vision. The extent to which Redpanda can grow its brownfield and greenfield user base remains to be seen, however, adoption signs seem encouraging, and the nod of confidence from investors helps.
With its latest capital infusion, Redpanda has raised $76M to date and plans to grow its global engineering and go-to-market teams as customer adoption accelerates. The company started 2021 with a little bit less than 20 employees and ended the year with 60.