You Don’t Need a Year of Data Cleansing


by Igor Zhuk, Chief Data Scientist

There’s a nearly universal belief in the business world that you can’t use AI without first going through a long, painful process of data cleansing and normalization.

You hear this not only from business leaders, but from technologists.  IBM’s Arvind Krishna recently made headlines by admitting that IBM clients have cancelled or scaled back AI projects because the upfront data work is so hard: “And so you run out of patience along the way, because you spend your first year just collecting and cleansing the data. And you say: ‘Hey, wait a moment, where’s the AI? I’m not getting the benefit.’ And you kind of bail on it.” Krishna also said that 80% of an AI project is data preparation – which, by the way, is also true for advanced business analytic systems that are based on rigid database structure models requiring enormous data management effort to maintain, change or scale.

But wait a minute. One of the most commonly-used definitions of AI is “human-like intelligence” — and humans are very good at understanding garbled and noisy data. You’ve probably seen some variation of this internet meme, which shows how the human mind can perceive patterns in “messy” data:

The human brain is also able to recognize patterns in incomplete data.  Look at this partial image:
You can clearly see it is a car. If I ask you what type of car it is, you might say that it’s probably a sedan but could be a limo. And your guess would be affected by the picture’s context: If the background is the streets of New York City, it’s more likely to be a limo than if the background is a rural area.
Thus, our brain can answer different questions about the same partial data with varying levels of certainty. This enables us to make decisions quickly in the real world of incomplete information given a contextual background.

So, does AI really require a year of data preparation?

The answer is profound: No.

r4’s XEM engine can typically ingest all of a company’s data, combine it with external data about the market, and begin delivering insights and recommendations within 4-6 weeks. And clients typically see results for their first use case within 3-4 months.

Here’s how it’s possible:

All business applications, including AI, used to be built based on the data model that governed a specific problem. Such a data model typically resides in a data warehouse that can be located either on-premise or in the cloud. As the name implies, a data warehouse is a place where all of the company’s data can be stored. Each item of data is tagged and organized, like the shelves in a physical warehouse of products. This tagging and organization is known as the data model (i.e. the physical and logical schemas). Every time a new source of data is to be stored in the warehouse, the data model must be updated so that all the data continues to be understood. This approach results in a powerful storehouse of data, in which every piece of data is understood as to its meaning–and its relationship to every other piece of data. This is called normalization of data.

But as powerful as a data warehouse can be in putting your data to work, it suffers from the problem that Arvind Krishna described above: It can take a very long time to organize, cleanse and normalize all of that data. Worse, every time you want to add new data, you must reorganize the data model. That’s why building up the data for a software or AI project can take a year or even more. As a result, this process is very rigid and typically prevents IT/database professionals from promptly addressing business people’s needs in a fast-moving market environment.

In reaction to the difficulties of setting up and maintaining a data warehouse, an alternative method for enterprise data storage has emerged in recent years with the introduction of cloud infrastructure: a data lake. Unlike a data warehouse, there is little setup required for a data lake. You can pour in new data as easily as adding water to an actual lake.

But there’s a catch. The reason it is so easy to add data to a data lake is that there is no attempt to apply a data model at all. The data is just tossed into the lake as-is. In other words, the burden of understanding the data is still there — it simply shifts from the builders of the data store to the users of the data store. In addition, because you can’t understand the data easily in a data lake, it is very hard to use AI algorithms on that data.

Clearly, a new approach is needed. At r4, we created a model based on a framework of knowledge about the business ecosystem in which enterprises operate, rather than a data model supporting a particular point solution. This “meta-model” is fed by both internal and external data sources connected to it via data feeds. We call it the r4 AI Market Model. It runs all the time, continuously pumping data in and keeping the model up to date. This way, you don’t have the data warehouse setup problem, because the model already exists, and you don’t have the data warehouse maintenance problem, because the meta-model doesn’t need to change with each new data source. And because the model is understood, you avoid the data lake problem of not knowing how to use the data.

To create this meta-model, we take a different approach than the data warehouse. Instead of building a data model based on the data that one might have for a specific business problem, we built the model based on the real-life elements (entities) in the business ecosystem that are represented by the data: people, places, and things. Any market in the world can be described in terms of the interactions among the above entities. People are clients, customers, employees, etc. Places are physical or virtual locations where business is conducted, e.g. stores, warehouse, website, etc. Things are products and services sold by any business. Every existing data model is already capturing some data around people, places, and things – but they are so specific that they are brittle, breaking every time new a data source is added that doesn’t fit the mold. In contrast, our model is general enough that it does not need to be changed with every new data source, and it stores both the data and its relationships so that it can be understood by anyone who needs to use it.

The inspiration for this approach is actually rooted in the “messy” document with the garbled letters shown above. The reason people can read it is that they recognize the pattern of actual words, even when the letters are messy. Similarly, our meta-model, which we call the AI marketplace ontology, recognizes patterns of people, places, and things – the semantic building blocks of all enterprise data – even when the data is messy or/and incomplete.

We have heard many times that our ability to ingest and understand data quickly would be a valuable product, even if that was all r4 could do.  But it isn’t.  The true business value is not in storing data, no matter how cleverly. The value is unlocked when you do something with the data. That’s where the AI comes in.

What could your company do with AI if it only took you a few weeks to wrangle your data instead of a year? We’d like to answer that question with you.