“Correlation is not causation”​. So what is?

Intro

You can also see the original blog post here.

A typical data science task

We’ll simulate a dataset using a set of equations (also called structural equations):

All data presented in graphs or used to fit models below is simulated from the above equations.

Below are the first few rows of the dataset:

Our goal is to predict the effect of raising marketing spend on sales which is 0.15 from the set of equations above, using product decomposition:

Common analysis approaches

First approach: plot bi-variate relationship

We can see that the relationship seen in the graph is actually the opposite of what we’d expected! It looks like increasing marketing actually decreases sales. Indeed, not only correlation isn’t causation, at times it can show a relation opposite to the true causation.

Fitting a simple linear model:

would yield the following coefficients: (note r is a regression coefficient where’s beta is a true parameter in the structural equations)

Confirming that we get a vastly different effect than the one we were looking for (0.15).

Second approach: Use ML model with all available features

When running the regression

We get the following coefficients:

Now it looks like marketing spend has almost no effect at all! Since we simulated the data from a set of linear equations we know that using more sophisticated models (e.g. XGBoost, GAMs) can’t produce better results (I encourage the skeptic reader to try this out by re-running the Rmd script used to produce this report).

Maybe we should consider the relation between features too…

Also, we notice that marketing probably affects visits to our site and those visits in turn affect sales.

We can visualize these feature inter-dependencies with a directed a-cyclic graph (DAG):

So it would make sense to account for the confounding competition by adding it to our regression. Adding visits to our model however somehow “blocks” or “absorbs” the effect of marketing on sales so we should omit it from our model.

Fitting the model

yields the coefficients below:

Now we finally got the right effect estimate!

The way we got there was a bit shaky though. We came up with general concepts such as “confounding” and “blocking” of features. Trying to apply these to datasets consisting of tens of variables with complicated relationships would probably prove tremendously hard.

So now what? Causal inference!

In causal inference this covariate set is also termed “adjustment set”. Given a model DAG we can utilise various algorithms that rely on rules very much like those mentioned above such as “confounding” and “blocking”, to find the correct adjustment set.

Backdoor criteria

Consider for example the DAG below where we’re interested in finding the effect of x5 on x10:

Using the backdoor-criterion (implemented in the R package “dagitty”) we can find the correct adjustment set:

How to obtain model DAGs?

  • Use domain knowledge
  • Given a few candidate model DAGs one can perform statistical tests to compare their fit to the data at hand
  • Use search algorithms (e.g. those implemented in the R “mgm” or “bnlearn” packages)

I’ll touch upon this subject in more breadth in a future post.

Further reading

For a more in-depth introduction to Causal inference and the DAG machinery I’d recommend getting Pearl’s short book: Causal Inference in Statistics — A Primer

For the full script reproducing this post see Rmd script

You can also check out my blog post and join the discussion

I consider myself an applied researcher. When mining data for value I prefer going the shortest route but once in a while I dig up things worth sharing.