DataOps vs. DevOps
DataOps is gaining traction in the market, formalizing the practices of modern data management at large companies, much like the adoption of DevOps among enterprises in the past. The engineering framework that DevOps created provides a great foundation for DataOps. Just like companies need DevOps to provide a high-quality, consistent framework for feature development, data-driven enterprises need a high-quality, consistent framework for rapid data engineering and analytics development.
The key DevOps concepts adopted by DataOps include:
- Agile Development
- Focus on Delivering Business Value
- Continuous Integration & Continuous Delivery (CI/CD)
- Automated Testing & Code Promotion
- Reuse & Automation
Despite many similarities, there are several differences between DataOps and DevOps to note:
The Human Factor
One key difference between DataOps and DevOps relates to the needs and preferences of stakeholders. DevOps was created for software developers–engineers who love coding and embrace technology. DataOps users are often the opposite. They are data scientists or analysts who are focused on building and deploying models and visualizations, and they are typically not as technically savvy as engineers.
The DataOps lifecycle shares the iterative properties of DevOps, but an important difference is that DataOps consists of two active and intersecting pipelines: the data pipeline and the analytics development process. The data pipeline takes raw data sources as input, and through a series of orchestrated steps, produces an input for analytics. DataOps automates orchestration and monitors the quality of data flowing into analytics. Analytics development is the process by which new analytical ideas are introduced. It conceptually resembles a DevOps development process, but the DataOps process is often more challenging due to these two pipelines compared to DevOps.
Unlike application code that does not need to be orchestrated in the DevOps process, orchestration is required in both the data pipeline and the analytics development processes. Analytics development orchestration occurs in conjunction with testing and prior to the deployment of new analytics. Orchestration of the data factory is the second orchestration in the DataOps process, which drives and monitors the dataflow. This coordination of pipelines is typically not present in application development and DevOps processes.
Testing in DataOps occurs both in the data pipeline and the analytics development process. In the former, the tests monitor data values flowing through the data factory to catch anomalies or flag data values outside of the norm. In the analytics development process, the tests validate new analytics before deploying them, unlike DevOps. In addition, these tests are usually embedded into a data quality framework to continually monitor the data pipelines in production.
Test Data Management
Test data management is often one of the first challenges in DataOps; in most DevOps environments, it is an afterthought. To accelerate analytics development, DataOps must automate the creation of development environments with the needed data, software, hardware and libraries, so innovation keeps pace with agile iterations.
Unlike DevOps, the tools required to support DataOps are in their infancy. For example, testing automation plays a major role in DevOps, but most DataOps practitioners have had to build or modify testing automation tools to adequately test data and analytics pipelines, as well as analytics solutions.
Exploratory Environment Management
Sandbox creation in software development is typically straightforward: The engineer usually receives several scripts from teammates and can configure a sandbox in a day or two. Exploratory environments in data analytics are often more challenging from a tools and data perspective. Data teams collectively tend to use many more tools than typical software development teams. Without the centralization that is characteristic of most software development teams, data teams tend to naturally diverge with different tools and data islands scattered across the enterprise.
To summarize, the concepts of DevOps for software engineering serve as the foundation for DataOps. However, DataOps requires some additional considerations to be effective for operating data and analytical products. While software artifacts tend to follow a deploy, refactor and replace/redeploy process, data products are live and typically evolve rather than be replaced. To get started with DataOps, businesses should take a cue from DevOps by implementing agile practices to deliver high-quality, unified data at scale.