When you start discussing with people in machine learning, you quickly hear something like *“forget your econometric models, your GLMs, I can easily find a machine learning ‘model’ that can beat yours”. *I am usually very sceptical, especially when I hear “*easily*” or “*always*“. I have no problem about the fact that I use old econometric models, but I had the feeling that things aren’t that easy. I can understand that we might have problems when we do have a lot of features (I am still working on that, I’ll get back to this point soon), but **I have the feeling that I can still capture interactions, and non-linearities with standard econometric models as well as any machine learning algorithm**.

Just to illustrate, consider the following ‘*model*‘

where is (just to illustrate)

> n <- 5000 > rtf <- function(x1, x2) { sin(x1+x2)/(x1+x2) } > xgrid <- seq(1,6,length=31) > ygrid <- seq(1,6,length=31) > zgrid <- outer(xgrid,ygrid,rtf) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green", ticktype="detailed",shade=TRUE)

Here

> df <- data.frame(x1=(runif(n, min=1, max=6)), + x2=(runif(n, min=1, max=6))) > df$m=rtf(df$x1, df$x2) > df$y=df$m+rnorm(n,sd=.07)

To visualise those observations on the graph above, use

> m=persp(xgrid,ygrid,zgrid*0+100,theta=30, + phi=30, col="white") > pt=trans3d(df$x1,df$x2,df$y,m) > points(pt$x,pt$y,cex=.7)

or, if we play with the color (blue on top, when is large, and red on the bottom, when is small)

How could we estimate , based on that sample ? What alternative to regression models can we get ? (that is more or less what I plan to discuss today).

I am an econometrician (that’s what I usually claim). What about a linear regression?

> reg=lm(y~x1+x2,data=df) > p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2)) > xgrid=seq(1,6,length=31) > ygrid=seq(1,6,length=31) > zgrid=outer(xgrid,ygrid,p) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green",shade=TRUE)

It is not great, let us be honest. But we shouldn’t be suprised. It is a nonlinear model, and we try to approximate that nonlinear function by some linear one. We should expect that nonlinearity cannot be captured be a linear regression. Especially when the model is not monotonous. Why not an additive model?

> library(splines) > reg=lm(y~bs(x1)+bs(x2),data=df) > p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2))

It is slightly better. I said ‘slightly’. We need something else, here. Why not try a multivariate adaptive regression splines model (so called MARS)? It is claimed that those models “*can be seen as an extension of linear models that automatically models non-linearities and interactions between variables*“. That what we’re looking for, right?

> library(earth) > reg=earth(y~x1+x2,data=df) > p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2))

We’re still flat, here. Why not use some bivariate splines, directly?

> library(mgcv) > reg=gam(y~s(x1,x2),data=df) > p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2))

Here we’re good! Aren’t we? So with econometric models, it is possible to capture nonlinearities and interractions.

Now, what about alternatives? What are those machine learning algorithm everyone is talking about? A popular one is the regression tree,

> library(rpart) > reg=rpart(y~x1+x2,data=df,method="anova") > p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2),type="vector") > zgrid=outer(xgrid,ygrid,p) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green",shade=TRUE)

It is not that bad. Of course, it is simplistic, because with a tree, predictions are constant on rectangular areas. But still.

Using bagging, or random forests, we should get something nice

> library(randomForest) > reg=randomForest(y~x1+x2,data=df) > p=function(x1,x2) as.numeric(predict(reg,newdata= data.frame(x1=x1,x2=x2),type="response")) > zgrid=outer(xgrid,ygrid,p) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green",shade=TRUE)

Not bad. What else can be consider? -nearest neighbours can be also an interesting alternative.

> k=10 > p=function(x1,x2){ + d=(df$x1-x1)^2+(df$x2-x2)^2 + return(mean(df$y[which(rank(d)<=k)])) + } > zgrid=outer(xgrid,ygrid,Vectorize(p)) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green",shade=TRUE)

Here it is working well, too. Last, but not least, what about boosting techniques? To be honest, it still sounds like magic to me. I can get the heuristics about it (the fact that a set of weak learners can create a strong learner, somehow). I think I also understand the theoretical foundations of gradient boosting techniques – but to be honest, it took me some time, since a gradient descent is usually based on Taylor’s expansion

but here, we have something like

where the part on the right is interpreted as a gradient (but in a much larger space, since it is a functional space). Weak learners – here decision stumps – are used to approximate that complicated function. Why not…

Anyway, I think I understand what this boosting idea is, but so far, I have not been able to write my own function (where I loop, and where I grow a tree on the residuals, and then create new residuals, and another tree, etc). I am still working on it, but if anyone is willing to share some R codes (DIY boosting regression tree), I am interested.

If we use some R functions, we can get

> library(dismo) > reg=gbm.step(data=df, gbm.x = 1:2, gbm.y = 4, + family = "gaussian", tree.complexity = 5, + learning.rate = 0.01, bag.fraction = 0.5) GBM STEP - version 2.9 Performing cross-validation optimisation of a boosted regression tree model for y and using a family of gaussian Using 5000 observations and 2 predictors creating 10 initial models of 50 trees folds are unstratified total mean deviance = 0.0182 tolerance is fixed at 0 ntrees resid. dev. 50 0.0136 now adding trees... 100 0.0109 150 0.0093 200 0.0082 250 0.0075 300 0.0069 350 0.0065

(etc)

1800 0.005 1850 0.005 1900 0.005 1950 0.005 fitting final gbm model with a fixed number of 1450 trees for y mean total deviance = 0.018 mean residual deviance = 0.003 estimated cv deviance = 0.005 ; se = 0 training data correlation = 0.924 cv correlation = 0.852 ; se = 0.013 elapsed time - 0.09 minutes

To visualize the output, use

> p=function(x1,x2) predict(reg,newdata=data.frame(x1=x1,x2=x2), + n.trees=1400) > zgrid=outer(xgrid,ygrid,p) > persp(xgrid,ygrid,zgrid,theta=30, phi=30, + col="green",shade=TRUE)

Nice isn’t it? So clearly, there are alternatives to standard regression models. But econometric models aren’t that bad, when we take into account nonlinearities. And so far, I still feel more confortable with my good old econometric models (but I am still learning about those alternatives).

Apologies, it is Nov-15-2018 and I just saw this.

As one of the “machine learning people”, I agree with your skepticism. Seriously, how much probability theory could any analyst know when they promise results using words such as “always”?

Which model is the “best”? It depends on the problem and the needs of the audience. Some of those “wide” models (random forests, gradient boosted whatever, etc.) can fit data very well, but they produce bulky, impenetrable results. In some fields, such as credit scoring, regulators (and customers and their lawyers) expect a simple explanation of why people have been denied credit. Even deploying those fat models can be computationally burdensome. The wide model which won the Netflix Prize was too fat to deploy.

Further, some people forget the virtue of simpler parametric models: When applicable, they make more efficient use of the available data budget. Often, they extrapolate better, too.

Interesting Post. Albeit the situation is of course a bit artificial. 😉 In many situations RandomForest or SVM would do better.

To compare more different Machine Learning Algorithms, the “mlr” package is very useful, it merges a lot packages of Regression, Classification and Clusteralgorithms, which are implemented in R.

Fantastic work and beautifully explained, excellent code.

Very nice blog-post! When reading this post, I wondered what would be a good approach to decide which model to use, when you have multiple predictors (let’s say, 5 to 10). If I understand your figures right, these only work with two predictors (due to the limitation to three dimensions)?

In short: is there a visual decision aid, whether I should use a linear model or a non-linear-model or a logistic regression (when I have x1 to x5)?

yes, it works fine with 2,

in higher dimension, it will be more difficult to fit a spline

I will try to find some work to work on that !

Great post, Arthur.

I have some questions,

#1, In this line

m=persp(xgrid,ygrid,zgrid*0+100,theta=30, phi=30, col=”white”)

Error in persp.default(xgrid, ygrid, zgrid * 0 + 100, theta = 30, phi = 30, : invalid ‘z’ limits

I change zgrid*0 + 100 to zgrid, the code run ok

#2, how you produce the graph with colors

Thank you so much

m=persp(xgrid,ygrid,zgrid*0+100,theta=30, phi=30, col=”white”,zlim=c(-.5,.5),

ticktype=”detailed”,xlab=””,ylab=””,zlab=””)

pt=trans3d(df$x1,df$x2,df$y,m)

v_col=c(rgb(1,0,0,(20:0)/20),rgb(0,0,1,(0:20)/20))

points(pt$x,pt$y,cex=.7,col=v_col[round(df$y*40)+20])

It works. Thank you so much, Author.

I learn a lot from your post

nice post. I’ve always felt that if you needed to leave the linear world for something better, you should at least test your assumptions and look mote closely at the interactions, which are often artificial and optimized

@Arthur

Good point…it was a poor wording choice on my behalf. I’ve also done bootsraped regression and relative variable importance is really valuable in those cases also.

Great post! It would be interesting to see when randomly hold out some observations, how well these different models predict that holdout sample.

yes, that would be the second step (I was discussing that issue on twitter yesterday, actually)

Great post, Arthur! My question is: what about ease of interpretation? Whereas your GAM models seems to achieve results that are as good as those obtained with randomforests, how can we interpret them? Social scientists are usually more worried about being able to interpret coefficients and estimations than having high predictive accuracy. Is there any way to solve that, or when it comes to interpretation both GAM and randomforests are more or less the same (black boxes)?

Thanks! 🙂

@Danillo, most ML techniques have a way to assess variable importance. It’s very different from traditional regression coefficients, but they can be utilized to assess which predictors most influenced the variance in the response variable.

I have to admit that I like this idea of variable importance…. But actually, using bootstrap and regression, it is also possible…. I do not see why it should be called ML techniques…. Using bootstrap is natural for me, as an econometrician (who like nonparametric statistics)

Thanks a lot for your replies, @Arthur and @Abraham. Do you guys have any particular suggestion on variable importance? I’m interested in the topic but know nothing about it. Thanks again!

There are a lot of variable importance measures. You can find a good overview here:

http://berndbischl.github.io/mlr/tutorial/html/filter_methods/index.html

Mutual Information (information.gain in FSelector package) is a very common one.