Are causal inference and prediction that different?

Feb. 16, 2019

Economists discussing machine learning, such as Athey and Mullianathan and Spiess, make much of supposed difference that while most of machine learning work focuses on prediction, in economics it is causal inference rather than prediction which is more important.

But what really is the fundamental difference between causal inference and prediction?

Let us try to define both tasks.

Prediction

For the prediction task we are given training data on an outcome variable YY and a set of predictor variables XX in the form of observed (Xi,Yi)(X_i,Y_i) pairs. Presumably there is a functional relationship Yi=f(Xi)Y_i = f(X_i), but we don’t know what ff is and our task is make use of the training data to predict f(X)f(X) for given values of XX.

Of course the task is interesting only if the value of XX for which we have to make a prediction is not among the values XiX_i seen in the data set, for otherwise we would just output the known YiY_i (ignore for the time being the resource constraints in storing the entire training database and searching it for a known XiX_i).

For us to have some degree of success in the prediction task, ff needs to have some regularity, in the sense that values of of YY at one XX should yield some information about the values of YY at another XX, and our prediction algorithm should be able to capture that regularity, i.e. our prediction function hh [TODO: not introduced before] should in some sense be close to the true ff. This means we must begin our work with knowledge of the general nature of ff. [TODO: the no free lunch theorem, must also explain why we cannot work with a hypothesis set which includes all possible ffs].

Often we work with a statistical version of the prediction problem where we posit that Yi=f(Xi,Zi)Y_i = f(X_i,Z_i) where ZiZ_i is an unobserved quantity which we model as a random variable with some probability distribution. When called upon to make a prediction we are only given XX. Not knowing ZZ, we cannot predict the exact YY, so we require our prediction to be good in some statistical sense, such as expected mean squared error.

In fact, statistics can enter also into the deterministic prediction tast when evaluating the performance of a prediction rule/algorithm. An algorithm may make good predictions for some values of XX and not so good prediction on other values of XX. We can put a probability measure on the domain of XX and use a statistical average case performance metric.

Causal Inference

One way to model the causal inference task is in terms of Rabin’s counterfactual model. There is a binary treatment TiT_i. Each agent has an outcome with treatment and without treatment Yi0Y_{i0} and Yi1Y_{i1}. Our task becomes to predict Yi0Yi1Y_{i0}-Y_{i1}. To help us in this we have data on some individuals, but the fundamental difficulty in causal inference is that for any individual we can measure only one of Yi0Y_{i0} and Yi1Y_{i1}. We may have auxialliary data on the individual in the form of a vector XX of covariates.

But this is just a prediction task! We have Yit=f(Xi,Ti,Zi)Y_{it} = f(X_i,T_i,Z_i), where ZiZ_i is a potential random noise term. Why can’t we treat this as a simple prediction problem?

In treating this as a prediction problem we would once more have to make assumptions to get around the no free lunch theorem. In fact, the way the causal inference literature is different from the prediction literature is in terms of the assumptions that are generally made.

Problems:

Selection bias. ZiZ_i may be correlated with TiT_i. In that case a hypothesis that minimizes error rates over observed data, such as the difference in mean treated minus mean untreated would actually lead to bias when minimized over the whole set of data. This is a problem with estimation method.

One of the assumptions we may make is unconfoundedness. That TiT_i is independent of ZiZ_i conditional on XiX_i. This is a factorization type condition which leads us to the correct estimation method to use. Such conditions are generally not available in the prediction context.

Both prediction and causal inference ask a counterfactual question: what will be the value of an outcome variable at an unobserved point in the domain, where in the casual inference case the domain includes a component for the treatment state. In both cases the task can succeed only if the mapping from domain to outcomes is in some sense regular enough to allow unseen cases to be inferred from seen ones.

Given this similarity in formal structure, the practice of causal inference differs from garden variety prediction essentially in two ways. First, in causal settings we privilege accuracy in prediction of treatment effects over other functions of the outcome variables. Second, the assumptions about the regularity of the function being estimated takes very specific forms of a “factorization” types: A is independent of B given C.

The first difference presumably can be folded into standard prediction methods by choosing the correct performance metric. What about the second? My sense is that the identification assumptions being made in current economic literature are too strong, and we are going to see a causal inference crash when a new generation of work finds the current assumptions “incredible”. So instead of trying to shoehorn machine learning into the framework of current causal inference practice, as Athey and Mullianathan seem to be trying to do, it may be better to step back and think of a more data-driven approach to causal inference that would give more robust even if it gives weaker conclusions. Of course that is easier said that done, otherwise I would be on MakeMyTrip looking for good Delhi → Stockholm → Delhi options.

[There seems to be lost of recent machine learning research that tries to exploit the connections between prediction and causal inference, but I see no connection to older practice in economics and other social sciences. Keywords: transfer learning, domain adaptation, covariate shift.]