A Practitioner's Guide to the Data Catalog (Part 2)

Petr Travkin

Director, Data Analytics Consulting, EPAM

DATE

Oct 12, 2023

A Practitioner's Guide to the Data Catalog (Part 2)

In Part 1 of our series, we discussed the business challenges that the data catalog can help solve and the tools available on the market today. Next, we’ll discuss the evolution of the data catalog maturity and the best way for you to implement a solution for your unique business.

How the Data Catalog has Evolved

When you begin to evaluate what type of data catalog is optimal for your business, it’s important to understand how the data catalog has evolved and what capabilities are required to move along the maturity scale. We break it down into four levels and identify the key functionality required for each level:

Level 1: Technical metadata hub is a metadata registry for data available in the data platform with ad-hoc curation based on crowdsourcing enabled by advanced users. It performs mostly metadata ingestion from various data sources on-prem and cloud with ad-hoc data modeling and is used by data analysts to find data to build advanced analytics applications. Sometimes this hub can be a good starting point to enable data democratization, especially in agile environments in the “from chaos to structure” implementation approach that we will discuss below. The following capabilities are required for this level of maturity:

Data inventory allows you to register data sources, organize and describe data by ingesting and curating business, technical and operational metadata. This capability includes data source connectivity, data sampling, business glossary, data dictionary, metadata management and data lineage.
Data assessment evaluates data with fitness for use. This includes data profiling, measuring data risk via classification, PII detection and tracking data usage to understand how popular datasets are or to perform audits.
Data discovery enables users to locate the data assets they need through a “Google-like” search, exploration and recommendation. This capability is key to the successful adoption of a data catalog and ensuring sustainable growth with your user community.

Level 2: Curated data inventory is a data registry with foundational governance capabilities, data classification and user collaboration. Metadata can be retrieved from various places including other data catalogs (e.g. cloud native). Since data is more structured, data development teams can leverage it for data search and for understanding context. Data lineage becomes more important and should be provided up to the level of analytics applications. The following capabilities are required for this level of maturity:

Data collaboration enables communication and metadata crowdsourcing via tagging, rating, reviewing, sharing and texting. This is important to facilitate data curation.
AI automation and assistance facilitates data curation by supporting users and taking over manual tasks, enabling data catalogs to scale. Most of the capabilities can be supported by AI, such as data ingestion, data labelling, classification and search.

Level 3: Data governance platform is a catalog that is integrated with data governance processes and where tasks are automated. It serves as a single point for data onboarding, assessment and metrics. Since data is curated and governed, it can be used in business applications consumed by business users. The following capabilities are required for this level of maturity:

Data governance enables data curation activities by defining roles and responsibilities, rules (fullness of asset curation), policies (data retention or archiving), task automation and standardization via workflows (change asset metadata or request access to a dataset), and manual or automated tagging including sensitive data definition.
Adoption tracking and audit allows you to monitor and measure data catalog performance, analyze user behavior for changes tracking and log users’ activity to analyze tool adoption progress.

Level 4: Enterprise data marketplace is a single point of data discovery and access in the enterprise for all categories of data users. The data marketplace can be either internal only or span across multiple external data consumers and providers (thus API integration with external systems is required). The following capabilities are required for this level of maturity:

Augmented discovery augments search and discovery through machine learning functionality (so that search results are ranked according to user behavior).

Data subscription offers users an experience like an online shop. Users can select data assets and store them in a shopping cart or subscribe to them. During checkout, data access is requested, which will be granted based on access rights and data license conditions by users outside the data catalog.
Data delivery allows users to download data directly from the data catalog or access the data via API for a limited period of time (similar to a subscription model).
Intelligent data similarity automatically identifies similar data (through the use of sematic graphs) to provide more suitable recommendations

Implementing Your Data Catalog

Now that you have determined the capabilities required for your data catalog, it’s time to discuss implementation. The data catalog can be implemented at different stages along your data governance journey, so we’ll walk through the advantages and risks of each one that we’ve observed in working with our clients.

Iterative Governed Approach

Based on data sources and data domains with planned governance enhancements, this approach starts with creating an awareness plan and prioritizing data domains and key roles available at the beginning of implementation. It enables fast and safe business user onboarding, thus maximizing business value.

What to consider:

High upfront planning and alignment effort required
Minimum viable training should be provided to key roles
Data catalog tool should be carefully selected based on detailed requirements
Limited collaboration at the beginning with more centralized control
Open source or cloud data catalogs might work here, but most likely will require a lot of customizations

When it might not work:

Agile end user community of advanced data professionals might not need a highly governed data catalog upfront and can curate via crowdsourcing and organic stewardship efforts
Open source or cloud data catalogs with limited capabilities and unfriendly UI

From Chaos to Structure

This approach brings all the metadata in at once and lets users collaborate and curate the information, which can help reveal duplicate datasets and provide a comprehensive picture on the initial state of your data quality. This also means that data governance evolves gradually. If you have an agile end user community that doesn’t need to be highly governed and can curate themselves, this might be a worthwhile approach.

What to consider:

Training should be provided to all advanced catalog users
Licensing and usage costs should be carefully considered as some data catalog tools charge per the amount of datasets profiled and volume of metadata loaded
Some open source or cloud data catalogs can be a good start with minimum upfront investment

When it might not work:

Highly regulated data environment with sensitive data
Governance-first approach to data management

Mixed Approach

With different parts of the catalog following various approaches and permissions to restrict access, this approach works when you have a mixed skill level among your user community and prioritized data domains. It’s possible to start adding business value immediately for part of the domains and grow other domains organically through crowdsourced curation. Some key roles should be available from the beginning, while others emerge organically. Advanced users are not limited with highly curated datasets.

What to consider:

High user access security set-up effort
Minimum viable training should be provided to all catalog users
Focus on flexible and robust security models when selecting the data catalog tool
Highly depends on data governance operating model type (centralized vs federated)

When it might not work:

Open source or cloud data catalogs with limited security capabilities
Centralized data governance operating model with limited representation within data domains

Putting Our Learnings into Practice

The best approach to take depends on many different variables including your data governance strategy, business goals, company culture, DataOps practices and user community. Regardless of which approach you choose, the following steps should be taken to enable a successful data catalog implementation and adoption:

Assess your needs and goals, map them to the data catalog capabilities and create an enablement plan
Review your data processes and technology landscape to define required integrations and customizations
Evaluate your data governance model or create one to enable successful data catalog adoption and operational efficiency
Create a thorough implementation plan, including an MVP phase, to ensure smooth execution and streamline value generation

As you begin or continue on your journey, make sure you understand who your data domain champions and data stewards will be and whether they have time to support your initiative. Having these advocates in your organization, alongside more technical elements of the design — such as business glossary terms, data classification requirements, security and metadata access, and initial workflows — will be key to success.

GET IN TOUCH

Hi! We’d love to hear from you.

Want to talk to us about your business needs?

Frequent Searches

A Practitioner's Guide to the Data Catalog (Part 2)

CATEGORY

Petr Travkin

DATE

Frequent Searches

A Practitioner's Guide to the Data Catalog (Part 2)

CATEGORY

Petr Travkin

DATE

Related Content

Blog

Service

White Paper

EMAIL PREFERENCE CENTER