So here’s a mind blower: In some cases having more samples can actually reduce model performance. Don’t believe it? Neither did I! Read on to see how I demonstrate that phenomenon using a simulation study.
On a recent post on my personal blog I’ve discussed a scalable sparse linear regression model I’ve developed at work. One of it’s interesting properties is that it’s an interpolating model — meaning it has 0-training error. This is because it’s over parameterised and thus can fit the training data perfectly.
While 0-training error is usually associated with over-fiting…
If you think there’s a typo in the subtitle think JLO instead (:
We all know python popularity among DS practitioners has soared over the past few years, signalling both aspiring DS on the one hand and organisations on the other to favour python over R in a snowballing dynamic.
One popular way of demonstrating the rise of python is to plot the fraction of questions asked on stack overflow with the tag “pandas”, compared with “dplyr”:
R has many great tools for data wrangling. Two of those are the dplyr and data.table packages. While dplyr has very flexible and intuitive syntax, data.table can be orders of magnitude faster in some scenarios.
One of those scenarios is when performing operations over a very large number of groups. This can happen when for example working with CRM data, where each row describes a touch point or transaction and one is interested with calculating the number of rows per customer, monetary value of all transactions per customer etc.
Recently Rstudio released dtplyr package version 1.0.0 which provides a data.table…
Have you tried using the fread function from data.table package in R? I think multiple benchmarks show it to be way faster than pandas in reading CSV files.
Originally published at https://iyarlin.github.io on April 20, 2020.
It is not often that I find myself thinking “man, I wish we had in R that cool python library!” (maybe because I don’t do much NLP/DL). That is however the case with the dowhy library which “provides a unified interface for causal inference methods and automatically tests many assumptions, thus making inference accessible to non-experts”.
Luckily enough though, the awesome folks at Rstudio have written the reticulate package just for that sort of occasion: It “provides a comprehensive set of tools for interoperability between Python and R”.
In this post I’ll…
Originally published at https://iyarlin.github.io on January 21, 2020.
We’ve seen on a previous post that one of the main differences between classic ML and Causal Inference is the additional step of using the correct adjustment set for the predictor features.
In order to find the correct adjustment set we need a DAG that represents the relationships between all features relevant to our problem.
One way of obtaining the DAG would be consulting domain experts. That however makes the process less accessible to wide audiences and more manual in nature. …
Originally published at https://iyarlin.github.io on October 17, 2019.
I was really struggling with finding a header pic for this post when I came across the one above — titled “Dag scoring and selection” and since it’s sort of the topic of this post I decided to use it!
On my second post I’ve stressed how important it is to use the correct adjustment set when trying to estimate a causal relationship between some treatment and exposure variables. The key to using the correct adjustment set is correctly identifying the underling DAG generating a dataset.
A few approaches I can think…
Originally published at https://iyarlin.github.io on July 23, 2019.
I’ve discussed on several blog posts how Causal Inference involves making inference about unobserved quantities and distributions (e.g. we never observe \(Y|do(x)\)). That means we can’t benchmark different algorithms on Causal Inference tasks (e.g \(ATE/CATE\) estimation) the same way we do in ML because we don’t have any ground truth to benchmark against.
This is yet another article about the never ending “R vs Python” squabble.
The tl;dr version is:
OK, so “what is data science” is probably the title of even more posts than “R vs Python”. I’m going to circumvent that discussion by clarifying that throughout this post when I talk about data science I refer to generating value from data using machine learning, statistics and visualizations. Stuff like database setup and maintenance fall within the scope of data engineering…
For a version with fully rendered Latex expression see my original blog post here.
In this post we finally get our hands dirty with some Kaggle style Causal Inference algorithms bake off! In this competition I’ll pit some well known ML algorithms vs a few specialized Causal Inference (CI) algorithms and find out who’s hot and who’s not!
So far we’ve learned that in order to estimate the causal dependence of…
I consider myself an applied researcher. When mining data for value I prefer going the shortest route but once in a while I dig up things worth sharing.