A Practitioner's Guide to the Data Catalog (Part 1)
Traditional data management practices are being challenged by enterprises everywhere. With the overwhelming volume of data and changing data regulations combined with the pressure to develop innovative and data-driven solutions, companies need more effective ways to ensure their enterprise data is easily discoverable, accessible, interoperable and reusable.
Enter the data catalog, which has been cited in major research companies as an essential part of enterprise data management and data governance. With the data governance and data catalog technology market gaining maturity and becoming saturated, the global market size is projected to grow 168% over the next six years.
With all this buzz, can the data catalog solve your challenges and help your organization? Let’s dive in.
Solving Challenges with the Data Catalog
The data catalog helps solve critical business challenges in data and without a solution in place, you may risk the following:
Risk of Not Addressing
Data is hard to find and identify within the enterprise
Slow data product development
Data origin, residence and ownership is unclear
Extensive communication overhead
Context about data is missing and data quality is undefined
Risk of bad data intake
Data governance tasks are time-consuming, and repetitive
Overtime for data management teams
Data curation is a manual process that does not scale
Delays in data availability
Data usage is not tracked within the organization
Access to data is limited to avoid legal conflicts due to PII
Delays in product development and value-generation
High manual effort is required for sensitive data and PII
Delays in data availability
By providing a central point for data discovery, clear definitions, data context and ownership, the data catalog can ensure data transparency, improve collaboration across various departments and automate data curation tasks to enable safe scaling. Data catalogs can automatically identify sensitivity in data classifications and ensure data conforms with policies and rules for better data privacy, security and governance. Not only can you enhance decision making through cross-organizational data usages, but you will reap the benefits of greater operational efficiency of data governance processes.
In addition to these enterprise benefits, the data catalog can help your individual business users:
- Quickly find the data they need with minimal training and technical skills
- Understand complex data as data models are stored and unified across the organization with standard field types, classifications and descriptions
- Keep all business term definitions, KPI calculation rules and data descriptions in one place
- Ensure secure data usage with automatic PII detection, tagging and data flow lineage from the source to the data asset
- Grant granular metadata access segregation between user groups according to company standards
Exploring Different Types of Data Catalogs
Now that we’ve highlighted the key benefits of a data catalog, let’s break down the different kinds of data catalogs your organization should evaluate.
As with any transformation project, there are three major components to consider along your journey: people, process and technology. So, where should you start? The answer to this depends on your business goals, your company culture and how fast you need to move. Some companies may choose to launch an enterprise program and start with people (organizational structures, ownership, etc.) and processes (policies, standard operating procedures, etc.). Others take their small enthusiastic data management group and start a data democratization initiative that promotes offensive data governance — through a data catalog implementation. All of these approaches have their own advantages and disadvantages, but the good news is that the technology market today can offer a tool for most data governance goals, types of user community and required adoption speed.
There are four main categories of data catalogs that exist in the market today:
- Stand-alone solutions offer key and additional data cataloging components within a single tool. Commercial and open source offerings are available. Examples include Alation, Atlan, data.world, Zeenea, Amundsen and DataHub.
- Platform solutions offer important data cataloguing functions with modules that provide additional capabilities like data quality, data privacy or even master data management. Examples include Ataccama, Collibra, IBM, Informatica, Precisely or Talend.
- Cloud-native data catalogs provide key components that are mostly limited to the cloud service provider environment. Use cases, such as orchestration and ETL processes, are the main focus. Examples include AWS Glue, Azure Purview or Google Data Catalog (part of Dataplex).
- Tool-specific data catalogs (or add-ons) support a specific tool. An example of this is in business intelligence that provides the necessary components as well as purpose-related additional cataloguing features, such as Tableau Catalog.
Interestingly, we’re seeing solutions on the market start to shift into different categories. One example is Databricks Unity Catalog, which has been gaining traction. While it was initially considered a tool-specific data catalog, latest developments now make it closer to a cloud-native or even stand-alone solution.
There is a fifth category worth mentioning — a data services catalog for agile software development or data engineering teams. This type of catalog not only provides metadata about various types of products available, but also can provide connection points (like Kafka topic or an API) that can serve as a developer portal. A good example is backstage.io created by Spotify. It’s important to note though that this category of tools does not cover pure data governance solutions.
To determine which type of data catalog best suits your unique business, you should first evaluate your organization’s data maturity level. Our next blog post in this series will cover this topic as well as how to successfully implement one for your unique business