Causal inference bake off (Kaggle style!)

10 min readMay 20, 2019

Intro

For a version with fully rendered Latex expression see my original blog post here.

On my last few posts I’ve tried answering high level questions such as “What is Causal inference?”, “How is it different than ML?” and “When should I use it?”.

In this post we finally get our hands dirty with some Kaggle style Causal Inference algorithms bake off! In this competition I’ll pit some well known ML algorithms vs a few specialized Causal Inference (CI) algorithms and find out who’s hot and who’s not!

Causal Inference objectives and the need for specialized algorithms

ATE: Average Treatment Effect

So far we’ve learned that in order to estimate the causal dependence of Y on X we need to use a set of features ZB that satisfies the “Backdoor criteria”. We can then use the g-computation formula:

(1) E(Y|do(x))=∑zBf(x,zB)P(zB)

where Y is our target variable, do(x) is the action of setting a treatment X to a value x (see my previuos post for more details on the do operator) and f(x,zB)=𝔼(Y|x,zB) is some predictor function we can obtain using regular ML algorithms.

One might ask: if we can obtain f(x,zB) using regular ML algorithms why the need for specialised CI algorithms?

The answer is that our objective in CI is different than our objective in classic ML: In ML we seek to predict the absolute value of Y given we observed X take value x: 𝔼(Y|x), while in CI we try to estimate the difference in the expected value of Y across different assignment values x of X. In the CI literature this quantity is termed “Average Treatment Effect” or in short ATE. In the binary treatment case (where X∈{0,1}) it’s defined as:

ATE:=𝔼(Y|do(1))−𝔼(Y|do(0))

and in the general (not necessarily binary treatment) case it’s defined as:

ATE(x):=∂𝔼(Y|do(x))/∂x

To see why accurate estimation of 𝔼(Y|x) doesn’t necessarily mean accurate estimation of the ATE (and thus different objectives might require different algorithms) let’s look at an example:

Imagine we work for a company in the marketing industry. The treatment in this example is an automatic AI bidding robot we’ve developed recently as a value added service for our campaigns management platform. In order to demonstrate it works we sold it in a trial version for a few months, at the end of which we recorded for all our customers whether they took the trial (“treated”) or not (“untreated”), average campaign ROI and company size (“small” or “large”):

The first line reads “The average ROI for campaigns run by small companies who didn’t use our robot (untreated) is 1%. The proportion of those campaigns of the entire campaigns run on our platform is 0.1.

In order to use the g-computation formula (equation 1 above) we need to identify the correct adjustment set ZB.

After talking to some of our customers we learn that large companies usually employ large teams of analysts to optimize their campaigns, resulting in higher ROI compared with smaller companies. Having those large teams also means they have lower tendency to run our trial compared with small companies.

We thus assume the following DAG:

Applying the “Backdoor criteria” to the DAG above we arrive at the insight that company size is a confounding factor and thus ZB=company size. Equation 1 above reads in our case:

𝔼(ROI|do(treatment))=∑company size∈{small, large}(𝔼(ROI|treatment, company size)P(company size))

We thus arrive at the quantities:

𝔼(ROI|do(treated))=2%⋅(0.4+0.1)+5%⋅(0.1+0.4)=3.5%

and

𝔼(ROI|do(untreated))=1%⋅(0.4+0.1)+5%⋅(0.1+0.4)=3%

and finally

ATE=𝔼(ROI|do(treated))−𝔼(ROI|do(untreated))=0.5%

Meaning the treatment (our robot) increases ROI by 0.5% on average.

Next, let’s assume our data is noisy and the average ROI isn’t enough to estimate the expected ROI. We try to estimate the ROI using 2 models. Below we can see the model predictions (with absolute error in parenthesis):

We can see that model 1 is more accurate than model 2 in every row.

If we were to use model 1 to estimate the ATE we’d get:

𝔼(ROI|do(treated))_{mode1}=1.5%⋅(0.4+0.1)+4.5%⋅(0.1+0.4)=3%

and

E(ROI|do(untreated))_{mode1}=1.5%⋅(0.4+0.1)+5.5%⋅(0.1+0.4)=3.5%

ATE_{mode1}=−0.5%

Meaning we estimate our product to decreases ROI by 0.5%! Our estimate is not only wrong in magnitude but also in sign, meaning we can’t use it to market our product.

If we were to use model 2 however we’d get:

𝔼(ROI|do(treated))_{model2}=1%⋅(0.4+0.1)+4%⋅(0.1+0.4)=2.5%

and

𝔼(ROI|do(untreated))_{model2}=0%⋅(0.4+0.1)+4%⋅(0.1+0.4)=2%

and

𝔼(ROI|do(untreated))model2=0%⋅(0.4+0.1)+4%⋅(0.1+0.4)=2%

and finally

ATE_{model2}=0.5%

Arriving at the correct ATE estimate! So even though model 2 is less accurate than model 1 in estimating 𝔼(ROI), it’s better at estimating the ATE.

CATE: Conditional Average Treatment Effect

Looking at the table above we see that while our product increases ROI by 0.5% on average, it increases ROI by 1% for campaigns run by small companies, while not improving at all those run by large ones. We’ll be thus well advised to market our product to small companies.

In the CI literature the treatment effect conditioned on some other features z (such as company size) is fittingly termed “Conditional Average Treatment Effect” (CATE). In cases where the features conditioned on identify individuals uniquely (e.g. when at least one of the features conditioned on is continuous) it is also termed “Individual Treatment Effect”, which is a highly sought after quantity in personalised medicine for example. For the binary case the CATE is defined as:

CATE(z):=𝔼(Y|do(1),z)−𝔼(Y|do(0),z)

In the general (not necessarily binary treatment) case it’s defined as:

CATE(x,z):=∂𝔼(Y|do(X=x),Z=z)/∂x

Looking again at the model predictions above we can compare the actual vs predicted CATE for both models (with absolute error in parentheses):

Again, we can see that while model 1 was more accurate than model 2 in predicting 𝔼(Y), it entirely misses the true CATE while model 2 estimates it perfectly.

To summarize: Specialised CI algorithms might be necessary because different objectives might require different tools.

An example with simulated data

The example above might seem a bit ad hoc (which is true) but it is motivated by possibly real scenarios. I’ll demonstrate by simulating a small (100 observations) dataset from the example problem presented above and fit to it a simple decision tree. Below is the resulting tree:

We can see that the tree completely ignored the treatment! To see why that happened let’s take a look at the dataset distribution:

We can see in the graph above 2 dataset features that could potentially throw off regular ML algorithms when in comes to CI tasks:

The variability due to the treatment is very small compared with that of other features in the dataset leading to an underestimate of the treatment effect (In the graph the variability due to company size is much higher than that of the treatment, which is why the decision tree disregarded the treatment).
The distribution of features among the treatment groups is highly skewed (in the graph we can see the treated units make up the vast majority in the small companies and a small minority in the large companies, making the comparison within each company size unreliable and thus estimating the CATE very hard).

In the showdown below we’ll see if more powerful ML algorithms can still hold their own against algorithms designed specifically for CI tasks.

The competition setup

It’s now time to setup the problem for our competition!

In this competition I’ll use a semi-synthetic dataset generated for the “Atlantic Causal Inference Conference” Data Analysis Challenge. The “real data” part comprises of the feature set Z which contains 58 measurements taken from the Infant Health and Development Program. Those include features such as mother’s age, endocrine condition, child’s bilirubin etc.

Of the full feature set Z, only a subset of 8 features consists the correct adjustment set ZB while the rest are nuisance (meaning they do not affect the treatment nor the target variables). We assume the correct adjustment set ZB is unknown to the data scientist and thus the full feature set Z is being fed to the model. This adds another layer of difficulty for our competitors to overcome.

The target variable Y(continuous) and the treatment variable X (binary) are both simulated according to one of 12 Data Generating Processes (DGP). The DGPs differ by the following 2 traits:

Measurement error/residual noise. One of:

IID
Group Correlated
Heteroskedastic
Non-additive (Non linear)

Estimation difficulty. One of:

Easy
Medium
Hard

Estimation difficulty relates to 3 factors which can be either low (0) or high (1):

Magnitude: the magnitude of the treatment effect
Noise: signal to noise ratio
Confounding: The strength of confounding (how different is the distribution of Z between treatment and control)

Below is a table showing the settings for those factors across the different difficulty scenarios:

Full details about the data generation process can be found here.

From every DGP I simulate M= 20 datasets and measure an algorithm f performance using 2 measurements:

The the first measurement looks at how well the ATE is estimated across all M= 20 simulations:

RMSE_ATE=sqrt(∑_{m=1}^{M}(ATE−ATE^(m))2)

We measure RMSE_ATE once across all 20 simulations.

The second is a type of “explained variability” or R-squared for CATE:

R²_{CATE} = 1−var(CATE(z)−\hat{CATE(z)})/var(CATE(z))

We measure R²_{CATE} once for every simulation m.

Wait, why use a synthetic dataset instead of a real one like in ML?

In ML we can estimate our model f out-of-sample error by using samples {y_i,f(x_i)} (e.g. 1/n(∑i=1n(yi−f(xi))2)). In CI however it’s not that simple. When estimating the CATE often times z_i identifies a unit i uniquely (e.g. if at least one of the features in Z is continuous). Since a unit was either treated or untreated we only observe either {y_i,x_i=1,z_i} or {y_i,x_i=0,z_i}. So unlike in ML, we can’t benchmark our model using samples {y_i|x_i=1,z_i−y_i|x_i=0,z_i,f(1,z_i)−f(0,z_i)}.

The situation I just alluded to is described in the CI literature many times in a problem setup commonly termed “counter factual inference”.

In the case of the ATE the problem is compounded by the fact it’s a population parameter which means even if we knew the true ATE, we’d only have a single sample for a given dataset to benchmark against.

For these reasons we need to use a synthetic/semi-synthetic dataset where we can simulate both {y_i,x_i=1,z_i} and {y_i,x_i=0,z_i}.

The algorithms

And now, let’s present our competitors!

ER: Elastic Net Regression. In this implementation shrinkage is not applied to X to prevent the algorithm from setting the coefficient for X to 0. I also includes pairwise interaction terms between the treatment X and all the features in Z to enable CATE estimation.
RF: Random Forest. In this implementation X is always added to the subset of features randomly selected in each tree node. Using default hyper-parameters.
BART: Bayesian Additive Regression Trees. This algorithm has been demonstrated to have good performance in Causal Inference tasks. Using default parameters.
CF: Causal Forest. A form of generalized Random Forests geared towards Causal Inference tasks. See also this manual. Using default parameters.
BARTC: Baysian Additive Regression Trees — Causal version. This implementation uses TMLE doubly robust estimation.

In this competition I compare out-of-the-box algorithms. For this reason XGBoost and neural nets are not among the competitors as they requires a lot of hyper parameter / architecture fine tuning. I also left out all methods that rely on modeling the assignment mechanism solely (e.g. propensity score re-weighting) as they are mainly geared towards estimation of the ATE.

And now, without further ado, the results!

ATE

Below I plot the estimated ATE box-plots along with the true ATE (dashed line).

We can see that the error distribution doesn’t change the picture much. For that reason RMSE_ATE figures in the table below are averages over all error types:

Next thing we can notice is that for the easy case all algorithms nail the ATE with the exception of RF while for the harder cases they all undershoot by a wide margin with the exception of BARTC which comes pretty close.

CATE

Below I plot R²_{CATE} box-plots and a red line at 0. We note that while in ML we’d usually think of R² = 0 as the baseline which is equal to “guessing” (since if we guess y_i = \bar{y}∀i we get R²=0) that wouldn’t be the case in CI. We note that \bar{CATE}=ATE, meaning in CI the equivalent of guessing y_i = \bar{y}∀i is guessing CATE(z)=ATE ∀z and as we saw above estimating the ATE isn’t always straight forward.

Below I report the resulting average R²_{CATE}, averaging over all M = 20 datasets and all error types:

We can see that the causal inference oriented algorithms all fair better than the regular ML ones. We can further see that BART and BARTC do best, yet all struggle in the hard case.

Final conclusion

It would seem BART based algorithms are best suited for CI tasks among the competing algorithms. This is not entirely unexpected as it was reported in the past that BART does well for CI tasks. It’s also worth mentioning that the author of the package implementing BARTC is a member of the group that put together the dataset we used in this simulation study.

Think you know an algorithm that can pinpoint the CATE even in the hard case? You can write them in the comments, or feel free to use the code that produced this post to add them to the competition.