The data catalog products of today are completely different beasts 

Don’t think of the data catalog as “just” a technology, think of it as a lens: a panoramic aperture you can use to zoom in to get a microscopic – or zoom out to get a macroscopic – view of your business.

“What you’re trying to do is really create a meaningful lens through which you look at all the information in your organization,” explained analyst Eric Kavanagh, host of DM Radio, a weekly data management-themed radio show, during a recent episode that focused on the value of data catalogs.

The data catalog is useful for several reasons, Kavanagh noted. In the first place, it simplifies the discovery of data and assists with the production of knowledge. Individuals can explore data sources manually, search using natural-language terms, or leverage the catalog’s guided self-service features to discover, classify and share data. Secondly, the data catalog can help quickly orient new individuals to the organization and its data assets. Finally, he said, the data catalog automates the discovery of new knowledge. This is the role of its so-called “knowledge graph” component, he explained.

“There are lots of major companies that use knowledge graphs,” Kavanagh told listeners.

Reinventing the data catalog

This brings us to a fact I probably should’ve led with: data catalog products have changed an enormous amount since they first appeared! A person who comes back to data catalogs after a sabbatical of 2-3 years will discover the new data catalogs are not at all like the first-gen products. 

Those products brought a self-service user experience (UX) to data exploration and discovery, exposing more or (usually) less powerful tools for creating, manipulating, and managing metadata.

The problem was the first-gen data catalogs were less adroit at managing data, at least if you needed to enforce policies around governing and reusing it. The new data catalogs are different beasts: they’ve been reconceived as platforms for managing and governing data. They incorporate technologies for “virtualizing” data – which is kind of like data federation, albeit with support for non-relational data, too – as well as for automatically establishing relationships between data. (The knowledge graph.) Improved support for managing data lineage supplies another critical missing link.

For the data catalogs products of today, support for lineage is – or should be – tablestakes.

Discovering a world of meaning

One of the most important (and also neatest!) changes to the modern data catalog is the introduction of knowledge-graphing capabilities. From the outside, the knowledge graph acts like a mysterious, wonder-working technology, a kind of modern reincarnation of the original Mechanical Turk. Not only does it discover connections between data elements, it draws factual conclusions about these connections. In other words, it has the power to create new knowledge, new facts, about the business.

Writing in another venue, my colleague Stephen Swoyer described the knowledge graph as “where the magic happens” in a data fabric or data catalog. I’ll quote him at length, because the topic itself is complicated and he does an adequate job explaining what a knowledge graph is and why it’s useful:

“The knowledge graph identifies and establishes relations between the entities it discovers across different data models. At a formal level, the knowledge graph attempts to “fit” its discoveries into an evolving ontology. [That is, a taxonomy of knowledge.] In this way, it generates a schema of interrelated entities, both abstract (‘customer’) and concrete (‘Jane Doe’), groups them into domains, and, if applicable, establishes relations across domains.

“So, for example, the knowledge graph determines that ‘CSTMR’ and ‘CUST’ are identical to CUSTOMER,’ or that a group of numbers formatted in a certain way … relates to the entity ‘SSN,’ or that this SSN correlates with this CUSTOMER. It is one thing to achieve something like this in a single database with a unified data model; it is quite another to link entities across different data models: for example, ‘CUSTOMER’ in a SaaS sales and marketing app = ‘CUST’ in an on-premises sales data mart = ‘SSN’ in an HR database = ‘EMPLOYEE Jane Doe who has this SSN is also a CUSTOMER.’ This last is completely new knowledge.”

One of Kavanagh’s guests, Juan Sequeda, elaborated on this function in his own comments, explaining that much of the “magic” of the knowledge graph is due to good old-fashioned metadata discovery. 

“A knowledge graph enables that all these things [to be] completely connected. So, in simple terms, a knowledge graph is … integrating data and metadata,” Sequeda, a principal scientist with data catalog vendor Data.World, told DM Radio listeners. “And when you start thinking about your metadata, everything that you’re cataloging [generates metadata] …. You’ve got tables, you’ve got columns, you’ve got dashboards, you’ve got business terminology, you’ve got people. Then you have dashboards [that] are derived from tables that are derived from queries. All of this stuff” begins with metadata.

One data management catalog to rule them all

By now, most large organizations already have a metadata catalog. In fact, most actually have more than one metadata catalog. It isn’t a stretch to say the average large organization is home to a Babel of catalogs, with individual departments, business units and other groups maintaining their own catalogs. 

Sometimes these catalogs take the form of branded metadata catalog technologies; more frequently, they’re tied to business intelligence (BI) tools – as semantic layers – to databases, and to other sources. Remember, the metadata catalog is a recent innovation of the self-service revolution; before there were catalogs, there were semantic layers, data dictionaries, business glossaries, and other types of metadata repositories. One challenge for any large organization is to unify the contents of all of its disparate metadata catalogs. To put it in theological terms, this is analogous to restoring the unified (“prelapsarian”) language that people of all nations and tribes spoke prior to the catastrophe at Babel.

Except that creating and maintaining a unified catalog-of-catalogs, a single, unified view of all enterprise data, is actually doable, Sequeda said.

“Another trend that we’re seeing is that now, every department early on started with their own little catalog, but then you see you start having silos of metadata that need to get connected,” Sequeda told listeners. This is not so simple in the real world, however, he argued. This is because most catalogs are metadata-dependent: they expect to discover, work with and (if necessary) create new metadata. This dependency limits how individuals can acquire data and what they can do with it once they discover it. 

He compared it to buying shoes on Amazon and being instructed to drive somewhere to pick them up.

“You want to be able to find the data you have requested, to go through whatever governance process [you need] to go through [as part of that]. And then you want to be able to go access that data right there at that moment, right?” he told listeners. “And for that … your catalog now starts to become [a tool] not just [for] metadata management, but [for] doing data management, too.”

Revitalizing the data catalog: a new emphasis on management and governance 

Between them, Kavanagh and his panelists sketched out a high-level take on a complex problem.

But each participant seemed willing to confront this problem in all of its complexity, without blinking.

On the subject of data catalogs, panelists were in agreement: the most useful catalogs must incorporate a data management-like feature set, with data virtualization (sometimes still called “data federation”) capabilities augmenting traditional data discovery and new-fangled knowledge-graphing features. (Again, Stephen Swoyer describes this shift in his own take on data fabric technologies.)

In other words, even if the existence of disparate metadata catalogs and resources does amount to an information Babel, all of these metadata layers scattered across the average large organization are kind of like low-hanging fruit. After all, an organization could just use another metadata catalog to discover, classify, and unify these disparate assets, right? But this would just perpetuate the problem!

The high-hanging, hard-to-reach fruit is the enterprise data landscape in its entirety, particularly the thar-be-dragons regions that have yet to be trawled and cataloged by human or machine agents.

Thanks to the cloud, these thar-be-dragons regions are continuously expanding, not shrinking, the DM Radio panel concluded. “There’s different definitions legitimately throughout the enterprise, and with your partners and with your suppliers, you need to first know what they are, and then figure out where the data is and how to collaborate,” said Rick Sherman, managing partner at Athena IT Solutions.

Sherman said the industry as a whole seemed to stick on this problem for quite some time. Part of this had to do with inertia: organizations were stuck using old-school BI tools and semantic layers. The easiest way to get to the data in these semantic layers was via the old-school BI tools themselves.

The good news, he hypothesized, is that we’ve mostly moved through and beyond this era of inertia.

“Now [organizations] need to figure out where the data is and collaborate on the data. I think that’s the other exciting thing about the data catalogs – the ability to expand our collaboration of what data [there] is, where it is, what it means, what it should mean,” he said, noting that (as we’ve discussed) data catalog technologies have evolved to support these use cases. Combined with this, Sherman said, organizations are also more realistic about how they conceive of and practice data governance.

“A lot of the previous efforts with data governance … were a boil-the-ocean kind of deal,” he said. 

The starting point now is “let’s start off with what the business needs to run the business, what the metrics are, what the KPIs are, and work that way,” Sherman explained, referring to a hypothetical SAP system with 40,000 tables: “The other 39,999 tables probably don’t matter. Let’s get to the core data that we need [to do an] analysis of inventory management. Let’s get to the things that are [connected to] business value and do it that way. I think that’s the other thing that business catalogs got.”

One problem was that organizations just expected to do too much with legacy data dictionaries and glossaries. The first-gen data catalogs helped reset this expectation; the next-gen data catalogs – which, again, amount to data fabric-like products, augmented with DV and knowledge-graphing capabilities – give organizations most of the tools they need to manage and govern data.

Another problem was that organizations just didn’t have the human subject-matter expertise to help inexpert users make sense of data. The first-gen data catalogs helped with this problem, too: by making it easier for non-experts to discover and access data, they put pressure on organizations to scale their subject-matter expertise. The new, next-gen data catalogs are poised to benefit from this.

“One of the limiting factors before … was [that] the subject-matter experts or the data gurus on the IT side, there’s only a handful of them. And anytime anybody needed anything, that’s who they went to,” he told listeners. “We needed to get much more scale beyond that. So, the data catalog [gives us] a way to share that institutional knowledge and spread it out because you can’t have one guru, one subject-matter expert in a financial group or marketing group be the limiting factor.”

About Vitaly Chernobyl

Vitaly Chernobyl is a technologist with more than 40 years of experience. Born in Moscow in 1969 to Ukrainian academics, Chernobyl solved his first differential equation when he was 7. By the early-1990s, Chernobyl, then 20, along with his oldest brother, Semyon, had settled in New Rochelle, NY. During this period, he authored a series of now-classic Usenet threads that explored the design of Intel’s then-new i860 RISC microprocessor. In addition to dozens of technical papers, he is the co-author, with Pavel Chichikov, of Eleven Ecstatic Discourses: On Programming Intel’s Revolutionary i860.