Big on Data bro Andrew Brust’s recent post on the spring cleaning of Hadoop projects evidently touched a nerve, given the readership numbers that went off the charts. By now, the Apache Hadoop family of projects is no longer the epicenter of “Big Data” the way it was a decade ago, and in fact, postmortems on the death of Hadoop have been floating around so long that they sound more like the latest incarnation of the old tagline that Francisco Franco is still dead.
And if you want to rub it in further, just look at job postings. A recent survey (shown below) compiled through web scraping over 15,000 data scientist job listings, posted last month by Terence Shin, shows demand for Hadoop skills seriously declining, joined by C++, Hive, and several legacy proprietary languages. By the way, Spark and Java are also on the same list. Would the results have been any different if the same question were asked about data engineers?
Resolved, “Hadoop” is considered so 2014. And as for Big Data, well, the world has moved on there as well. Big Data was so labelled because, at the time, combing through multiple terabytes or petabytes was exceptional, as was the ability to extend analytics to nonrelational data. Today, multimodel databases have grown common, while most relational data warehouses have added the capability to parse JSON data and superimpose graph data views. The ability to query data in cloud storage directly and/or federate query from data warehouses has also become commonplace. No wonder, we now just call it “Data.”
As Andrew pointed out, the spring cleaning was about clearing out the cobwebs. Contrary to conventional wisdom, Hadoop is not dead. A number of core projects from the Hadoop ecosystem continue to live on in the Cloudera Data Platform, a product that is very much alive. We just don’t call it Hadoop anymore because what’s survived is the packaged platform that, prior to CDP, didn’t exist. The zoo animals are now safely caged.
What’s dead is the idea of assembling your own cluster with anywhere from a half dozen or more discrete open source projects. Now that there are alternatives (and we’re not just talking CDP), why waste time having to manually implement Apache MapReduce, Hive, Ranger, or Atlas? That’s been the norm in the database world for at least the last 30 years; when you bought Oracle, you didn’t have to install the query optimizer and storage engines separately.
This being the 2020s, more likely than not, for new projects, your organization is probably planning on implementing cloud services rather than installing packaged software. While the drive to cloud was originally about cost shifting, today it’s more likely to be about operational simplification and agility under a common control plane.
Today, there are multiple paths to analyzing data in what used to be called The Three Vs. Today you can readily access data residing in cloud object storage, the de facto data lake. You can do so through ad hoc query, using a service like Amazon Athena; take advantage of federated query capabilities that are now checkbox items from most cloud data warehousing services; run Spark against the data using a dedicated service like Databricks, or a cloud data warehousing service like Azure Synapse Analytics. Recognizing that the boundaries between data warehouses and data lakes are blurring, many are now adopting the fuzzy term data lakehouses that either consolidate access across data warehouse and data lake, or turn the data lake into an 80% version of the data warehouse.
And we haven’t even broached the topic of AI and machine learning. Just as Hadoop in the early days was the exclusive domain of data scientists (with help from data engineers), the same was originally true of machine learning and broader AI. Today, data scientists have an abundance of tools and frameworks for managing the lifecycle of models that they create. For instance, AutoML services have brought the building of ML models within reach of citizen data scientists, while cloud data warehouses are increasingly adding their own prepackaged ML models that can be triggered through SQL commands.
In all this, it’s easy to forget that barely a decade ago, this all seemed hardly within the range of the possible. Google’s pioneering research kicked the gears in motion. With Google File System, the Internet giant devised an append-only file system that broke through the limitations of conventional storage networks by taking advantage of cheap disk. With MapReduce, Google cracked the code on attaining almost linear scalability, also on commodity hardware, an accomplishment that was elusive under the scale-up SMP architectures that, at the time, were the prevalent paths to scale.
Google published the papers, which was a good thing for Doug Cutting and Mike Cafarella, who at the time were working on a search engine project that could index at least a billion pages, and saw a path through open source that could dramatically reduce the cost of implementing such a system. The rest of the community then picked up where Cutting and Cafarella left off; for instance, Facebook developed Hive to provide a SQL-like programming language to comb through petabyte-scale diverse sets of data.
It’s easy to forget today, with the adoption numbers of classic Hadoop projects dropping, that the discoveries of the Hadoop project led to a virtuous cycle where innovation devoured its young. When Hadoop emerged, data got so voluminous that we had to bring compute to the data. With emergence of cloud-native architectures, made possible by cheap bandwidth and lots of it, and more tiers of storage, compute and data got separated again. It’s not that either approach was right or wrong, but instead, that they were right for the time that they were utilized. That’s the cyclic nature of tech innovation.
By blasting through the limits of scale out processing, lessons learned from Hadoop helped spawn a cycle where lots of old assumptions, such as GPUs used strictly for graphic processing, fell by the wayside. Hadoop’s legacy is not only the virtuous cycle of innovation that it spawned, but the fact that it got enterprises over their fear of dealing with data, and lots of it.