Last week, dotData, a company focused on automated feature engineering (AutoFE) and automated machine learning (AutoML), announced the integration of its AutoFE technology with the Databricks platform. Feature engineering is one of the toughest parts of building a machine learning model because it requires both technical and domain knowledge to determine which columns in the source data are most relevant to the model’s predictions. DotData’s new integration enables Databricks users, including those without advanced data science expertise, to design richer ML model features, tackle more challenging AI use cases and enhance model accuracy.
Also read: Data-driven 2021: Predictions for a busy year in data, analytics and AI.
What is Automated Feature Engineering?
ZDNet spoke with Ryohei Fujimaki, PhD, dotData’s CEO and founder, who explained that the company’s AutoFE technology works by discovering patterns in the source data to find statistically important features that can improve model accuracy in order to augment the domain-relevant features that data scientists might discover intuitively. Specifically, dotData’s Python library, dotData Py, is now Databricks-compatible and can be installed via pip on the Databricks platform. This replaces tedious conventional feature engineering work that must otherwise be carried out manually in code against Spark, Pandas, or Dask dataframes.
DotData’s tech uses algorithms that can discover multimodal patterns in the data to find columns (i.e. features) impactful to predictions. In addition, AutoFE can transform a set of normalized relational tables into a single “feature table” that can be used as the data set with which to train the optimal machine learning model. Temporal, geo-locational, and text data is also supported, as is integration with object storage and file systems like Amazon S3, Azure Data Lake Storage (ADLS) and the Hadoop Distributed File System (HDFS), as well as traditional data warehouses.
The New Collaboration
Explainability features, such as auto-generated feature explanations and feature blueprints, are available to aid citizen data scientists and data scientists alike so that they can understand what each feature is and how it is relevant. The dotData/ Databricks integration uses the power of both platforms for quick prototyping of use cases and improving model accuracy by finding the optimal features faster. For example, users can govern dotData’s AI features by using Databricks’ new Feature Store (a centralized repository of features), and ML experiments can be managed by using Databricks’ implementation of MLFlow. At a lower level, dotData’s AutoFE technology uses the Databricks File System (DBFS) and the Databricks Runtime (an optimized version of Apache Spark) to speed execution.
Also read:
These particular integrations are primarily aimed at experienced data scientists who use Python, notebooks and various machine learning libraries such as PyTorch, XGBoost, TensorFlow, and Scikit-learn. DotData’s AutoFE supports data scientists as they explore different types of feature hypotheses. It focuses on the conventional business data use cases rather than deep learning use cases (i.e. dotData does not mine image, video, or unstructured data). It lets users benefit from advanced computation rather than contextuality.
What does the future hold?
Integrating domain knowledge into the model building process has always been a challenge. AutoFE tackles this challenge by augmenting domain features with more statistical ones. Compared to manual feature engineering, AutoFE analyzes more data in a shorter period of time in order to find the most relevant features. DotData is a pioneer in this space. Through this new integration with Databricks, users of both platforms can now benefit by finding and generating relevant features and optimizing model accuracy. Chances are, some form of AutoFE will become more mainstream in the future, finding its way into numerous AutoML platforms.
Esin Alpturk contributed to the reporting in this post.