Causal versus correlative models
I was recently watching a debate between a professor and a journalist on the upcoming US presidential elections. The professor made some predictions based on a model he had developed. The journalist asked whether it was a causal model. The professor replied that it was a correlative and predictive model that has never failed in the past, but it was not a causal model. The journalist asked whether his predictions would still hold under such-and-such a scenario, to which the professor replied that he wouldn't respond to silly hypotheticals. But he stuck to his model predictions.
This nicely illustrates the problems of training set bias and model domain applicability - namely when a model can be trusted to make prospective predictions. As long as the new data are similar to the training data, they fall within the applicability domain of the model, and one can have confidence in the model predictions - statistically they can be expected to be similar to previous predictions or the reported test set predictions. But as soon as one wanders substantially outside the domain applicability of the original model, all bets are off. I think this is something that many who develop and use generative models seem to have lost sight of. Generative models are designed to produce new and novel data, and they are often used to generate new data very different from the training data. This puts them well outside the applicability domain of the original model. Generating new data is relatively easy; generating data that are useful for specific applications is a different matter altogether.
The solution is, of course, to keep testing the model on the newly generated data against the "ground truth", and to develop new models expanding the domain of applicability as you go along. But this requires more work than just lazily generating new data with a generative model and hoping for the best. The situation is different if we have a causal model. By a causal model, we mean that we understand the underlying processes, and therefore we understand when conditions change such that the original model may no longer be applicable, and thus understand what must be done to extend the model to new domains.
Originally posted on LinkedIn