Skip navigation EPAM

MLOps: How We Got Here & How Databricks Can Help

MLOps: How We Got Here & How Databricks Can Help

The democratization of machine learning (ML) and the explosive growth of opportunities for models have created or exacerbated classes of risk that prevent companies from realizing ROI on their data and ML initiatives. ML model risks span people, processes and technology. In our experience, while most organizations have made significant progress over the past few years, we still see preventable tech risks for ML related to friction and difficulty reproducing experimentations made by data scientists. These include:

  • Input reproducibility: The data scientist is not able to trace or reproduce the data set that was used for training the model. This includes lack of upstream data lineage, as well as manual data copy.
  • Methodology reproducibility: The experiment can’t be reproduced easily by another individual or put in production. This could be because of lack of access, low quality of the training code or the inability to reproduce the training environment (libraries and/or tooling).
  • Scoring reproducibility: The team has no easy mechanism to establish which model made a given prediction. Similarly, metrics are pulled and analyzed manually, a friction-full process that leads to a slow, reactive response to problems.

Due to these challenges and a variety of additional factors, most ML projects do not make it to production. We’ve found that simply productizing ML models is not enough without taking in a broader context. Delivering data science artifacts into a production environment is much easier than ensuring the end-to-end product is a value-generating business asset. While the holistic approach to AI products is important, certain technical impediments have been holding back progress as well.

More recently, there has been significant advances in the ML software ecosystem, easing the risks mentioned above. The breakneck rate in which new tools, new methodologies and the effervescence around MLOps can often cause the opposite effect that was originally intended: segmentation of best practices rather than predictable uniform methodologies as each team chooses their preferred stack, lack of portability across domain limiting cross-functional initiatives, and accumulation of design, data and technical debt around the model ecosystem.

Ultimately, the original goal of speeding up development and deployment of ML products ends up being diluted by migration and orchestration of multiple tools, each requiring time and effort to adopt effectively.

To address some of these challenges, MLOps entered the market as a socio-technical manifestation that recognizes the importance of multidisciplinary collaboration between business, technology, people and processes, which are crucial for the success of advanced data products. The primary goal of practicing MLOps is to apply agile, DevOps, DataOps and ModelOps principles to the relevant decomposed system components to solve these challenges. Furthermore, MLOps can help conceptualize and define AI products as it brings together data, AI/ML models, infrastructure and responsible AI governance and principles to deliver measurable business value.

In the “data and ML platform as a product” mindset, we can’t neglect developer experience (DX) as a key factor in sustainable adoption. By treating data scientists and analysts as customers of the ML platform, we put more importance on ease of use, quality of tool/platform documentation, clarification of access control and discoverability. A comprehensive ML and data platform like Databricks provides a powerful cloud-agnostic set of components to expedite the development of an advanced analytical platform with the vision of MLOps in mind.

Databricks provides a unified API and UI access to primordial components of a data and ML platform. Some of these features include:

  • A Delta Lakehouse, which features Delta Lake for storage layer, Apache Spark for both batch and stream processing and Unity Catalog for data governance. Out of the box, this provides traceability of data assets, version control through change data feed and easy to provision compute units for processing or experimentation.
  • Integrated MLflow, which supports model development lifecycle with experiment tracking, model registry and serverless model serving
  • Databricks Workflow and Delta Live Tables, which orchestrates data processing, ML and other analytical data pipelines. Those are configurable by the user as well as deployable via IaC tools, such as Terraform or dbx.
  • Feature Store, which simplifies complex and time-consuming ML feature engineering and management
  • Integrations with popular version control systems and developer tools directly through the platform, which allows easy versioning of experimentations directly through the Databricks interface
  • Comprehensive APIs and tools like Databricks Connect, which allows developers to leverage compute, code and data into any tool

A data and ML platform will not solve all issues related to ML development and deployment. Yet, by taking a product mindset for the platform and considering the journey of our users — from onboarding to peak-usage — we can alleviate a class of technological frictions that hinder or ruin the ROI on analytical initiatives. Developing and scaling next-gen data platforms while properly governing AI and ML models is the key for companies that are committed to driving more insightful, data-driven decisions. Explore our Databricks partner page for more insights.


Hi! We’d love to hear from you.

Want to talk to us about your business needs?