In this special guest feature, Petr Travkin, Solution Architect in the data and analytics practice at EPAM Systems, introduces the concept of Data Debt which can be measured as the cost associated with mismanaging data and the amount of money required to fix the data problem. EPAM Systems is a leading global product development, digital platform engineering, and top digital and product design agency. Petr has multi-year cross-industry field expertise in helping companies develop and implement enterprise data strategies and architectures. He is a DataOps approach evangelist and adept at data governance, always eager to connect and exchange ideas about building data-driven processes and organizations.
Many data engineering and analytics teams are too busy to stop and think about how and why they work the way they do. Organizations are thrilled to apply advanced analytics to business areas but are very rarely as enthusiastic in using it to improve workflows. As a result, there is waste in data engineering, data science and analytics efforts, resulting in what’s known as “Data Debt.”
Data Debt is the cost associated with mismanaging data, and the amount of money required to fix the data problem. Companies can look at all their current data challenges and provide a rough estimate of the cost it would take to fix them. Data Debt can be used as a strong argument when discussing the importance of revamping outdated processes and policies. Until the debt is paid, an organization will always pay more to maintain its data landscape than gradually investing in reducing Data Debt. Implementing data governance and DataOps principles is an effective way to pay down Data Debt and avoid accumulating it in the future.
Since Data Debt can take various forms, let’s break down some of the most common pitfalls that organizations fall into and how DataOps can help address the challenges companies might be facing:
Excessive, Wasteful Processes
Duplicating data, storing data in multiple areas across the organization, failing to reproduce work due to a lack of configuration management or using a complex algorithm instead of a simpler option all result in wasted time and effort. Companies may operate with inefficient, obsolete workflows without realizing the long-term consequences. Avoiding data silos, regularly reviewing processes to adapt to change, and orchestrating data, schemas and tools can address these challenges.
Wait, Wait, Wait Some More
Waiting for access to systems or data, or delays in people being assigned projects accordingly can all lead to delays and waste. A foundational aspect of analytics insight efficiency is to avoid the repetition of previous work. Analytics pipelines should be built with the capability to automatically detect abnormalities and issues in code, configuration and data, with continuous feedback to data teams to avoid errors in the future.
What is the Actual Problem?
Establishing a correct problem definition is surprisingly hard in analytics, especially in data science. Solving the wrong problem, usually due to poor communication and misaligned expectations, is defective work. Data and code can have bugs, leading to wasted effort in finding and fixing problems. Writing good code on top of bad data is merely a case of garbage in, garbage out. Software engineering, lean and DataOps practices can help by uncovering all defects in the data as soon as possible. Implement mistake-proofing tests so poor quality data does not enter data pipelines. Stop and fix the problem, and then add a new test so the error cannot cause the problem again.
Work that Never Makes it to Production
Work that is only partially done hinders an organization’s ability to make decisions and deliver an effective customer experience. By failing to consider interpretability or explaining solutions clearly to stakeholders, implementation is delayed or cancelled unnecessarily. The biggest waste is work that never makes it to production. To mitigate this, gather feedback as frequently as you can. Share knowledge, simplify communication and provide feedback at every stage of the data analytics lifecycle. Whenever an easier solution presents itself, it is likely a superior one.
Multitasking & Loss of Knowledge due to Handoffs
All data-related disciplines are sophisticated and require deep focus to solve problems. Multitasking generally doesn’t work – it imposes high switching costs, wastes time and hinders the early completion of work. Additionally, handing off work between employees can result in knowledge that is left behind. To address these challenges, refrain from assigning data team members to multiple, ongoing projects unless they are leads or architects who emphasize coordination. Try to maintain the same team composition until the end of delivery.
Wasting Expensive Data Talent
People are often the most expensive and valuable resource in the data-related process, so using their talent effectively should be a priority. Good data analysts, data scientists and data engineers are hard to find and expensive to hire, and their skills are often wasted. To comprehensively leverage their skills, keep key data team members informed of everything in the enterprise IT landscape, then they can anticipate risks and technology changes related to new data flows. Implement data management practices to improve quality and availability, so time isn’t wasted locating and cleansing data.
Often, businesses focus on what activities will happen, but little emphasis is placed on what activities are unable to happen due to a lack of proper data processes and policies. Recap existing issues with data, reporting, poor content management, compliance issues or the high cost of ownership in terms of factual losses, as well as missed future opportunities.
Poor Data Management
Companies lose a lot of money due to low quality data; this usually serves as the primary manifestation and metric of a functional data governance program. Exceptionally sensitive data can create risk for businesses due to misuse, poor quality or compliance issues, which can lead to lofty fines. Storage costs for production environments, as well as development, testing and user sandboxes, can rapidly increase when data is not archived or deleted quickly. Businesses need to measure current costs and risks associated with poor data quality to ensure these costs aren’t contributing to their Data Debt.
To measure your Data Debt, consider the cost of benefits received or lost. Here are some areas that will help you start to assess the amount of accumulated Data Debt:
- Processes: improve cycle time, lower cost or improve quality
- Competitive Advantage: gain competitive intelligence and create differentiators
- Product Development: identify a new product or feature
- Intellectual Capital: embed knowledge into products and services
- Human Resources: enable employees to do better work
- Risk Management: reduce various types of risk (financial, data or legal compliance)
If a business makes decisions without considering the impact on their data, future costs will occur when dealing with inconsistency, errors and redundancy. As organizations become more data-driven, it’s important to measure the value of data by introducing the concept of Data Debt and implementing DataOps principles to help repay it. Estimating your Data Debt can reveal the costs associated with ineffective processes and highlight the value of a data governance program.
The original article can be found here.