All Models are Wrong


by Paul Signorelli, Chief Solution Architect

Data scientists live by the aphorism “All models are wrong… but some are useful.” This truism, often attributed to twentieth century statistician George Box, is something companies all over the world are learning as their statistical models have become more wrong than useful in the wake of COVID-19.

The reason is quite obvious. Traditional statistical models – whether they are simple regression models or sophisticated deep learning models – rely upon historical patterns repeating themselves to be accurate. The more sophisticated the algorithm, the more sophisticated the pattern it can learn. Either way, if the learned patterns do not reflect the “new” reality, the models will make poor predictions. Until the new patterns become the norm, which could take years, these models will operate with a high degree of error. This risk of error and inaccuracy has made for a challenging new reality in the data science community.

This new reality is something we have always understood at r4, as our approach to solving data science problems has never solely relied upon this traditional statistical approach. From the beginning, we have centered our AI solution around domain-based models that operate in conjunction with purely statistical models. 

A domain-based model is an abstraction that describes selected aspects of a domain. In our case, these domains are the market structures of r4 clients. The model is used to solve problems related to that domain. Software engineers have been using domain models for decades in an effort to model the rules and relationships in the data, and an AI domain-based model, such as the type we use at r4, uses AI and data to learn those relationships.

The domain model graphs the digital relationship between the people, places, and things of the markets of our customers and provides predictive models with information that can supervise them – yielding better results.

One of the biggest advantages of a domain-based approach is its ability to use the vocabulary of the domain and, therefore, it is much better at uncovering causal relationships in the data and providing understanding to the user – two challenges of purely statistical models.

There is a plethora of advantages that we see in our domain-based models, and quantifying these benefits would highlight their historical worth. However, in the age of COVID-19, data scientists must use statistical models to learn with both less and noisier data. The ability of domain-based models to adapt to this data has become a tool in the middle of a pandemic and is, as a result, helping our customers cope with unforeseen data shifts.