The Importance of Discovery in Metadata - Xentity

But…What Is Metadata?

The point of metadata is to describe other data. It is literally the “data within the data”. Metadata is the title and description of an object, its tags and categories, who created it and when, who last modified and when, who can access or update, etc. Metadata has a few key purposes. It could be for provenance. Provenance is where something is findable, its origin, etc. Metadata could also be for cataloging and for describing standards. But the one purpose that’s most blatantly forgot about is discovery.

To understand why discovery is an issue, consider this example. One of the best scenes in Raiders of the Lost Ark is the very end. Indiana Jones finds the ark and the government takes it away from the Nazis and it’s put it in a giant warehouse. The camera zooms out, and the audience sees that the ark has been shoved in with a lot of other things we can only assume are great discoveries and objects of value. It’s been catalogued but it’s never going to be discovered again, which might’ve been what Spielberg and Lucas had been going for. That’s become a very typical technique in using metadata, just shoving it into a “warehouse” and forgetting about it instead of focusing on managing its supply, and by extension its accessibility.

So What?

What’s currently happening in the data science field is that metadata is becoming a tool for discovery (context). The supply and management of metadata is extremely important, and consequently data discovery is critical. At Xentity we’ve been proselytizing this type of metadata-focused Architecture since 2001. Unfortunately a lot of people are still so far behind and are stuck in Supply Chain management of data, which is more about getting that data to be discovered, allowing the people who are on secondaries to search and find out if that data is good for their scientific application, their engineering application, or their business application. They’re more focused on improving on Supply Chain management. They do not worry about the Raiders of the Lost Ark problem. They store their data. We never see it again.

What they need in order to be successful in metadata’s use is to use discovery. In metadata, this is the knowledge that someone wants to discover, which goes back to Knowledge-Information-Data (KID) principles and having a database to allow processing. We don’t know how someone’s going to ask a question of the data. As such, we use what is known as linked data to make data discoverable. Having metadata built into linked data allows you to ask questions such as “What’s the population of Los Angeles County”, “How many counties are there in California”, and “How many counties are there in the United States”.

Those kinds of questions that would involve different joins in a relational database. This is what discovery is, and it is time we start taking it a bit more seriously. Because in a world where the quantity of data is rapidly growing, context is key to keep that metadata supply well-managed.

Now That We Know What Metadata Is, What Is The Current State?

Recently, we have been seeing new semantic technology, which is growing in usage much like the keyword metadata efforts 20 years ago. Semantic technology is a set of methods and tools that provide advanced means for categorizing and processing data. Also, for discovering relationships within varied data sets. About 5 years ago, an organization called schema.org came together to find an ontological language pathology in linked data, which is how you describe all those relationships in a database. And they also came up with one for discovery, so you ask questions such as “What’s the population of Los Angeles County” from platforms such as Google or anyone else participating the schema.org ontology.

There are several different ways we can approach metadata discovery. We take feature data and register it in two or more semantics, such as schema.org ontology to support discovery. So imagine if we registered a lot of geographic features such as roads, rivers, railroad tracks, in water bodies. Once you register semantic datasets, users become knowledgeable of subjects such as how many people are there in this Count (drivers, swimmers, etc.).

You can semantically register anything, but the cool thing about semantics is this: It’s a crowd-built, collaborative schema allowing users to add relationships or features based on their ontological relationship. Furthermore you can decide which vocabulary within schema.org you want to recognize. For example, if you don’t want to recognize some genre of language because it just doesn’t fit your domain you’re not going to let that respond back to your query.

So, Where Do We Go From Here?

The way semantic languages are going to grow for discovery purposes will be through the use of sites like schema.org. First, you’re seeing it grow and grow for people participating. Next, it will improve web search discovery of data. In doing so, it improves the asking of two dimensional questions very rapidly. For example, how many of X and Y is there, or where are A and B at.

Linked data gives you a better chance to connect data to the broader picture and the broader context. It shows most discoveries are really happening when you cross pollinate specializations in data. Imagine offering linked data peppered with more and more data. It has all these different relationships in the context of the natural language that humans have been developing over 250,000 years. We keep bringing more data in the context of this language and in the context of each other’s languages. And in doing so, we improve AI machine learning deep in neural networks.

Artificial intelligence thrives on good training data. Just like a child thrives on repetition. Also, like how a superstar is only amazing after 20,000 hours of practice. Artificial intelligence room is based on confidence intervals and seeing patterns. Also, those confidence levels increase and it can rely on those patterns more and more. Basically, the more we use linked data, the more AI becomes smarter in assisting us. And the more we get into linked data, the more we can feed into artificial intelligence. This in turn feeds into our KID concept by providing more data. Then, in turn, providing more information which grants more knowledge.

Where Does Xentity Fit In All This?

Xentity can help people on linked data training. For starters, we can tell you right now: stop trying to build the perfect business and then open your doors. Do less sooner, get exposure first and then grow. Treat your business like an app. Constantly update it and improve it. Funny enough, scientists struggle with how people use data incorrectly. That is exactly why it is important to have a system. A system where instead of getting it right the first time we get some things right and then improve upon it. Collect data, discover what went right and what went wrong, and then improve on what went wrong. And we can help with that process.