Motivation

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics; however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their benefits for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.

Why bother writing from scratch?

While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course, this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly.

In my last blog posts, I introduced two machine learning algorithms in 150 lines of R Code. You can find the other blog posts about coding gradient boosted machines and regression trees from scratch on our blog or in the readme on my GitHub. This blog post is about the random forest, which is probably the most prominent machine learning algorithm. You have probably noticed the asterisk (*) in the title. These things often suggest, that there has to be something off. Like seeing a price for a cell phone plan in a tv commercial and while reading the fine prints you learn that it only applies if you have successfully climbed Mount Everest and you got three giraffes on your yacht. Also, yes your suspicion is justified; unfortunately, the 100 lines of code only apply if we don’t add the code from the regression tree algorithm, which is essential for a random forest. That is why I would strongly recommend reading the blog about regression trees, if you are not familiar with the regression tree algorithm.

Understanding ML with simple and accessible code

In this series we try to produce very generic code, i.e. it won’t produce a state-of-the-art performance. It is instead designed to be very generic and easily readable.

Admittedly, there are tons of great articles out there which explain random forests theoretically accompanied with a hands-on example. That is not the objective of this blog post. If you are interested in a hands-on tutorial with all the necessary theory, I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. The only thing you need to know, besides the fundamentals of a regression tree, is our objective: We want to estimate our real-valued target (y) with a set of real-valued features (X).

Reduce dimensionality with Feature Importance

Fortunately, we do not have to cover too much maths in this tutorial, since that part is already covered in the regression tree tutorial. However, there is one part, which I have added in the code since the last blog post. Regression trees and hence random forests, opposed to a standard OLS regression, can neglect unimportant features in the fitting process. This is one significant advantage of tree-based algorithms and is something which should be covered in our basic algorithm. You can find this enhancement in the new reg_tree_imp.R script on GitHub. We use this function to sprout trees in our forest later on.

Before we jump into the random forest code, I would like to touch very briefly on how we can compute feature importance in a regression tree. Surely, there are tons of ways on how you can calculate feature importance, the following approach is, however, quite intuitive and straightforward.

Evaluating the goodness of a split

A regression tree splits the data by choosing the feature which minimizes a certain criterion, e.g. the squared error of our prediction. Of course, it is possible, that some features will never be chosen for a split, which makes calculating their importance very easy. However, how can we compute importance with chosen features? A first shot could be to count the number of splits for each feature and relativize it by the total number of all splits. This measure is simple and intuitive, but it cannot quantify how impactful the splits were, and this is something we can accomplish with a very simple but more sophisticated metric. This metric is a weighted goodness of fit. We start by defining our goodness of fit for each node. For instance, the mean squared error, which is defined as:

MSE_{node} = frac{1}{n_{node}} sum_{i = 1}^{n_{node}} (y_i - bar{y}_{node}))^2

This metric describes the average squared error we make when we estimate our target y_i with our predictor the average value in our current node bar(y)_{node}. Now we can measure the improvement by splitting the data with the chosen feature and compare the goodness of fit of the parent node with the performance of its children nodes. Essentially, this is more or less the exact step we performed to evaluate the best splitting feature for the current node.

Weighting the impact of a split

Splits at the top of the tree are more impactful as more data reaches the nodes at this stage of the tree. That’s why it makes sense to lay more importance on earlier splits by taking into account the number of observations which reached this node.

w_{node} = frac{n_{node}}{n}

This weight describes the number of observations in the current node measured by the total number of observations. Combining the results above, we can derive a weighted improvement of a splitting feature p in a single node as:

Quantifying improvements by splitting the data in a regression tree

Improvement^{[p]}_{node} = w_{node}MSE_{node} - (w_{left}MSE_{left} + w_{right}MSE_{right})

This weighted improvement is calculated at every node which was split for the respective splitting feature p. To get better interpretability of this improvement, we sum up the improvements for each feature in our tree and normalize it by the overall improvement in our tree.

Quantifying the importance of a feature in a regression tree

Importance^{[p]} = frac{sum_{k=1}^{p} Improvement^{[k]}}{sum_{i=1}^{n_{node}} Improvement^{[i]}}

This is the final feature importance measure used within regression tree algorithm. Again, you can follow these steps within the code of the regression tree. I have tagged all variables, functions and column names involved in the feature importance calculation with imp_* or IMP_* , that should make it a little easier to follow.

The Random forest Algorithm

All right, enough with this regression tree and importance – we are interested in the forest in this blog post. The objective of a random forest is to combine many regression or decision trees. Such a combination of single results is referred to as ensemble techniques. The idea of this technique is very simple but yet very powerful.

Building a regression tree orchestra

In a symphonic orchestra, different groups of instruments are combined to form an ensemble, which creates more powerful and diverse harmonies. Essentially, it is the same in machine learning, because every regression tree we sprout in random forest has the chance to explore the data from a different angle. Our regression tree orchestra has thus different views on the data, which makes the combination very powerful and diverse opposed to a single regression tree.

Simplicity of a random forest

If you are not familiar with the mechanics algorithm, you probably think that the code gets very complicated and hard to follow. Well, to me the amazing part of this algorithm is how simple and yet effective it is. The coding part is not as challenging as you might think. Like in the other blog posts we take a look at the whole code first and then we go through it bit by bit.

The algorithm at one glimpse

#' reg_rf
#' Fits a random forest with a continuous scaled features and target 
#' variable (regression)
#'
#' @param formula an object of class formula
#' @param n_trees an integer specifying the number of trees to sprout
#' @param feature_frac an numeric value defined between [0,1]
#'                     specifies the percentage of total features to be used in
#'                     each regression tree
#' @param data a data.frame or matrix
#'
#' @importFrom plyr raply
#' @return
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
reg_rf <- function(formula, n_trees, feature_frac, data) {

  # source the regression tree function
  source("algorithms/reg_tree_imp.R")

  # load plyr
  require(plyr)

  # define function to sprout a single tree
  sprout_tree <- function(formula, feature_frac, data) {
    # extract features
    features <- all.vars(formula)[-1]

    # extract target
    target <- all.vars(formula)[1]

    # bag the data
    # - randomly sample the data with replacement (duplicate are possible)
    train <-
      data[sample(1:nrow(data), size = nrow(data), replace = TRUE)]

    # randomly sample features
    # - only fit the regression tree with feature_frac * 100 % of the features
    features_sample <- sample(features,
                              size = ceiling(length(features) * feature_frac),
                              replace = FALSE)

    # create new formula
    formula_new <-
      as.formula(paste0(target, " ~ ", paste0(features_sample,
                                              collapse =  " + ")))

    # fit the regression tree
    tree <- reg_tree_imp(formula = formula_new,
                         data = train,
                         minsize = ceiling(nrow(train) * 0.1))

    # save the fit and the importance
    return(list(treefit, treeimportance))
  }

  # apply the rf_tree function n_trees times with plyr::raply
  # - track the progress with a progress bar
  trees <- plyr::raply(
    n_trees,
    sprout_tree(
      formula = formula,
      feature_frac = feature_frac,
      data = data
    ),
    .progress = "text"
  )

  # extract fit
  fits <- do.call("cbind", trees[, 1])

  # calculate the final fit as a mean of all regression trees
  rf_fit <- apply(fits, MARGIN = 1, mean)

  # extract the feature importance
  imp_full <- do.call("rbind", trees[, 2])

  # build the mean feature importance between all trees
  imp <- aggregate(IMPORTANCE ~ FEATURES, FUN = mean, imp_full)

  # build the ratio for interpretation purposes
  impIMPORTANCE <- impIMPORTANCE / sum(impIMPORTANCE)    # export   return(list(fit = rf_fit,               importance = imp[order(impIMPORTANCE, decreasing = TRUE), ]))
}

As you have probably noticed, our algorithm can be roughly divided into two parts. Firstly, a function sprout_tree() and afterwards some lines of code which call this function and process its output. Let us now work through all the code chunk by chunk.

# source the regression tree function
  source("algorithms/reg_tree_imp.R")

  # load plyr
  require(plyr)

  # define function to sprout a single tree
  sprout_tree <- function(formula, feature_frac, data) {
    # extract features
    features <- all.vars(formula)[-1]

    # extract target
    target <- all.vars(formula)[1]

    # bag the data
    # - randomly sample the data with replacement (duplicate are possible)
    train <-
      data[sample(1:nrow(data), size = nrow(data), replace = TRUE)]

    # randomly sample features
    # - only fit the regression tree with feature_frac * 100 % of the features
    features_sample <- sample(features,
                              size = ceiling(length(features) * feature_frac),
                              replace = FALSE)

    # create new formula
    formula_new <-
      as.formula(paste0(target, " ~ ", paste0(features_sample,
                                              collapse =  " + ")))

    # fit the regression tree
    tree <- reg_tree_imp(formula = formula_new,
                         data = train,
                         minsize = ceiling(nrow(train) * 0.1))

    # save the fit and the importance
    return(list(treefit, treeimportance))
  }

Sprouting a single regression tree in a forest

The first part of the code is the sprout_tree() function, which is just a wrapper for the regression tree function reg_tree_imp(), which we source as the first action of our code. Then we extract our target and the features from the formula object.

Creating different angles by bagging the data

Afterwards, we bag the data, which means we are randomly sampling our data with the chance of replacement. Remember when I said every tree will look at our data from a different angle? Well, this is the part where we create the angles. Random sampling with replacement is just a synonym for generating weights on our observations. This means that a specific observation in the data set of a specific tree could be repeated 10 times. The next tree could, however, lose this observation completely. Furthermore, there is another way of creating different angles in our trees: Feature sampling.

Solving collinearity issues between trees in a forest

From our complete feature set X we are randomly sampling a feature_frac * 100 percent to reduce the dimensionality. With feature sampling we can a) compute faster and b) capture angles on our data from supposedly weaker features as we decrease the value of feature_frac . Suppose, we have some degree of multicollinearity between our features. It might occur, that the regression tree will select only a specific feature if we use every feature in every tree.

However, supposedly features with less improvement could bare new valuable information for the model, but are not granted the chance. This is something you can achieve by lowering the dimensionality with the argument feature_frac. If your objective of the analysis is feature selection, e.g. feature importance, you might want to set this parameter to 80%-100% as you will get a more clear cut selection. Well, the rest of the function is fitting the regression tree and exporting the fitted values as well as the importance.

Sprouting a forest

In the next code chunk we start applying the sprout_tree() function n_trees times with the help of plyr::raply(). This function repeatedly applies a function call with the same arguments and combines the results in a list. Remember, we do not need to change anything in the sprout_tree() function, since the angles are created randomly every time, we call the function.

# apply the rf_tree function n_trees times with plyr::raply
  # - track the progress with a progress bar
  trees <- plyr::raply(
    n_trees,
    sprout_tree(
      formula = formula,
      feature_frac = feature_frac,
      data = data
    ),
    .progress = "text"
  )

  # extract fit
  fits <- do.call("cbind", trees[, 1])

  # calculate the final fit as a mean of all regression trees
  rf_fit <- apply(fits, MARGIN = 1, mean)

  # extract the feature importance
  imp_full <- do.call("rbind", trees[, 2])

  # build the mean feature importance between all trees
  imp <- aggregate(IMPORTANCE ~ FEATURES, FUN = mean, imp_full)

  # build the ratio for interpretation purposes
  impIMPORTANCE <- impIMPORTANCE / sum(impIMPORTANCE)    # export   return(list(fit = rf_fit,               importance = imp[order(impIMPORTANCE, decreasing = TRUE), ]))

Afterwards, we combine the single regression tree fits a data frame. By calculating the row mean we are taking the average fitted value of every regression tree in our forest. Our last action is to calculate the feature importance of our ensemble. That’s the mean feature importance of a feature in all trees normalized by the overall mean importance of all variables.

Applying the algorithm

Let us apply the function to see, whether the fit is indeed better compared to a single regression tree. Additionally, we can check out the feature importance. I have created a little example on GitHub, which you can check out. First, we simulate data with the Xy package. In this simulation, five linear variables were used to create our target y. To make it a little spicier, we add five irrelevant variables, which were created in the simulation as well. The challenge now, of course, is whether the algorithm will use any irrelevant feature or if the algorithm can perfectly identify the important features. Our formula is:

eq <- y ~ LIN_1 + LIN_2 + LIN_3 + LIN_4 + LIN_5 + NOISE_1 + NOISE_2 + NOISE_3 + NOISE_4 + NOISE_5

The power of the forest

Neither the random forest nor the regression tree has selected any unnecessary features. However, the regression tree was only split by the two most important variables. Whereas, the random forest selected all five relevant features.

variable importance

This does not mean that the regression tree is not able to find the right answer. It depends on the minimum observation size (minsize) of the tree. Surely the regression tree would eventually find all important features if we lower the minimum size. The random forest, however, found all five essential features with the same minimum size.

Learning strengths and weaknesses

Of course, this was only one example. I would strongly recommend playing around with the functions and examples by yourself because only then you can get a feel for the algorithms and where they shine or fail. Feel free to clone the GitHub repository and play around with the examples. Simulation is always nice because you get a clearcut answer; however, applying the algorithms to familiar real-world data might be beneficial to you as well.

Motivation

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.

While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly.

In my two subsequent blog post I will introduce two machine learning algorithms in 150 lines of R Code. This blog post will be about regression trees, which are the foundation of most tree-based algorithms. You can find the other blog post about coding gradient boosted machines from scratch on our blog. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.

Gathering all puzzle pieces

regression tree

Surely, there are tons of great articles out there which explain regression trees theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in a hands-on tutorial with all necessary theory I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow. The only thing you need to know is our objective: We want to estimate our real-valued target (y) with a set of real-valued features (X).

Most probably you are already familiar with decision trees, which is a machine learning algorithm to solve classification tasks. As the name itself states regression trees solve regressions, i.e. estimation with continuous scaled targets. These kind of trees are the key part of every tree-based method, since the way you grow a tree is more of the same really. The differing parts among the implementations are mostly about the splitting rule. In this tutorial we will program a very simple, but generic implementation of a regression tree.

Fortunately, we do not have to cover much maths in this tutorial, because the algorithm itself is rather a technical than a mathematical challenge. With that said, the technical path I have chosen might not be the most efficient way, but I tried to trade-off efficiency with simplicity.

Anyway, as most of you might know decision or regression trees are rule-based approaches. Meaning, we are trying to split the data into partitions conditional to our feature space. The data partitioning is done with the help of a splitting criterion. There is no common ground on how to do those splits, there are rather multiple different splitting criteria with different pros and cons. We will focus on a rather simple criterion in this tutorial. Bear with me, here comes some maths.

SSE = sum_{i in S_1} (y_i - bar(y)_1)^2 + sum_{i in S_2} (y_i - bar(y)_2)^2

Ok, so what does this state? This is the sum of squared errors determined in two different subsets (S_1 and S_2). As the name suggests, that should be something we want to minimize. In fact, it is the squared distance between the mean and the target within this data subset. In every node of our regression tree we calculate the SSE for every potential split we could do in our data for every feature we have to figure out the best split we can achieve.

Let us have a look at the R Code:

# This is the splitting criterion we minimize (SSE [Sum Of Squared Errors]):
# SSE = sum{i in S_1} (y_i - bar(y)_1)^2 + sum{i in S_2} (y_i - bar(y)_2)^2
sse_var <- function(x, y) {
  splits <- sort(unique(x))
  sse <- c()
  for (i in seq_along(splits)) {
    sp <- splits[i]
    sse[i] <- sum((y[x < sp] - mean(y[x < sp]))^2) +
              sum((y[x >= sp] - mean(y[x >= sp]))^2) 
  }
  split_at <- splits[which.min(sse)]
  return(c(sse = min(sse), split = split_at))
}

The function takes two inputs our numeric feature x and our target real-valued y. We then go ahead and calculate the SSE for every unique value of our x. This means we calculate the SSE for every possible data subset we could obtain conditional on the feature. Often we want to cover more than one feature in our problem, which means that we have to run this function for every feature. As a result, the best splitting rule has the lowest SSE among all possible splits of all features. Once we have determined the best splitting rule we can split our data into these two subsets according to our criterion, which is nothing else than feature x <= split_at and x > split_at. We call these two subsets children and they again can be split into subsets again.

Let us lose some more words on the SSE though, because it reveals our estimator. In this implementation our estimator in the leaf is simply the avarage value of our target within this data subset. This is the simplest version of a regression tree. However, with some additional work you can apply more sophisticated models, e.g. an ordinary least squares fit.

The Algorithm

Enough with the talking, let’s get to the juice. In the following you will see the algorithm in all of its beauty. Afterwards we will breakdown the algorithm in easy-to-digest code chunks.

#' reg_tree
#' Fits a simple regression tree with SSE splitting criterion. The estimator function
#' is the mean.
#' 
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param minsize a numeric value indicating the minimum size of observations
#'                in a leaf
#'
#' @return itemize{
#' item tree - the tree object containing all splitting rules and observations
#' item fit - our fitted values, i.e. X %*% theta
#' item formula - the underlying formula
#' item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
reg_tree <- function(formula, data, minsize) {
  
  # coerce to data.frame
  data <- as.data.frame(data)
  
  # handle formula
  formula <- terms.formula(formula)
  
  # get the design matrix
  X <- model.matrix(formula, data)
  
  # extract target
  y <- data[, as.character(formula)[2]]
  
  # initialize while loop
  do_splits <- TRUE
  
  # create output data.frame with splitting rules and observations
  tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA,
                          TERMINAL = "SPLIT",
                          stringsAsFactors = FALSE)
  
  # keep splitting until there are only leafs left
  while(do_splits) {
    
    # which parents have to be splitted
    to_calculate <- which(tree_infoTERMINAL == "SPLIT")          for (j in to_calculate) {              # handle root node       if (!is.na(tree_info[j, "FILTER"])) {         # subset data according to the filter         this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"])))         # get the design matrix         X <- model.matrix(formula, this_data)       } else {         this_data <- data       }              # estimate splitting criteria       splitting <- apply(X,  MARGIN = 2, FUN = sse_var, y = y)              # get the min SSE       tmp_splitter <- which.min(splitting[1,])              # define maxnode       mn <- max(tree_infoNODE)
      
      # paste filter rules
      tmp_filter <- c(paste(names(tmp_splitter), ">=", 
                            splitting[2,tmp_splitter]),
                      paste(names(tmp_splitter), "<", 
                            splitting[2,tmp_splitter]))
      
      # Error handling! check if the splitting rule has already been invoked
      split_here  <- !sapply(tmp_filter,
                             FUN = function(x,y) any(grepl(x, x = y)),
                             y = tree_infoFILTER)              # append the splitting rules       if (!is.na(tree_info[j, "FILTER"])) {         tmp_filter  <- paste(tree_info[j, "FILTER"],                               tmp_filter, sep = " & ")       }               # get the number of observations in current node       tmp_nobs <- sapply(tmp_filter,                          FUN = function(i, x) {                            nrow(subset(x = x, subset = eval(parse(text = i))))                          },                          x = this_data)                # insufficient minsize for split       if (any(tmp_nobs <= minsize)) {         split_here <- rep(FALSE, 2)       }              # create children data frame       children <- data.frame(NODE = c(mn+1, mn+2),                              NOBS = tmp_nobs,                              FILTER = tmp_filter,                              TERMINAL = rep("SPLIT", 2),                              row.names = NULL)[split_here,]              # overwrite state of current node       tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT")               # bind everything       tree_info <- rbind(tree_info, children)              # check if there are any open splits left       do_splits <- !all(tree_infoTERMINAL != "SPLIT")
    } # end for
  } # end while
  
  # calculate fitted values
  leafs <- tree_info[tree_infoTERMINAL == "LEAF", ]   fitted <- c()   for (i in seq_len(nrow(leafs))) {     # extract index     ind <- as.numeric(rownames(subset(data, eval(parse(text = leafs[i, "FILTER"])))))     # estimator is the mean y value of the leaf     fitted[ind] <- mean(y[ind])   }      # return everything   return(list(tree = tree_info, fit = fitted, formula = formula, data = data)) } </code></pre> Uff, that was a lot of scrolling. Ok, so what do we have here. At first glance we see two loops a while loop and a for loop, which are essential to the algorithm. But let us start bit by bit. Let's have a look at the first code chunk. <pre><code>  # coerce to data.frame   data <- as.data.frame(data)      # handle formula   formula <- terms.formula(formula)      # get the design matrix   X <- model.matrix(formula, data)      # extract target   y <- data[, as.character(formula)[2]]      # initialize while loop   do_splits <- TRUE      # create output data.frame with splitting rules and observations   tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA,                            TERMINAL = "SPLIT",                           stringsAsFactors = FALSE) </code></pre> Essentially, this is everything before the while loop. Here, we see some input handling in the beginning and the extraction of our design matrix <code>X</code> and our target <code>y</code>. All our featuers are within this design matrix. <code>do_splits</code> is the condition for our while loop, which we will cover in a bit. The data frame <code>tree_info</code> is the key storage element in our algorithm, because it will contain every information we need about our tree. The essential piece of this object is the filter column. This column saves the paths (filter) we have to take through (to apply to) our data to get to a leaf (a terminal node) in our regression tree. We initiate this <code>data.frame</code> with <code>NODE = 1</code>, since at the beginning we are in the root node of our tree with the whole data set at our expense. Furthermore, there is a column called <code>TERMINAL</code>, which controls the state of the node. We have three different states <code>SPLIT</code>, <code>LEAF</code> and <code>PARENT</code>. When we describe a node with the <code>SPLIT</code> state, we mark it for a potential split. The state <code>PARENT</code> indicates, that we have already split this node. Lastly, the state <code>LEAF</code> marks terminal nodes of our regression tree. <h3>Reaching the treetop</h3> When do we reach such a terminal node? In many implementations there is a minimum size parameter, where we determine valid splits by the amount of observations of its children. If the children have lower data points than the minimum size the split is invalid and will not be done. Imagine not having this parameter in our case, we could end up with leafs covering only one observation. Another termination rule is if the lowest SSE is at a split we have already invoked within the branch and hence has already been invoked in this branch. The split at such a point would be nonsense, since we would end up with the same subset over and over again.  Basically, this is everything there is to the algorithm. The while and for loop just ensure, that we estimate and create every node in our <code>tree_info</code>. Thus, you can perceive the <code>tree_info</code> as sort of a job list, since we create new jobs within this data frame. Let us walk through the actual R code. <pre><code>  # keep splitting until there are only leafs left   while(do_splits) {          # which parents have to be splitted     to_calculate <- which(tree_infoTERMINAL == "SPLIT")
    
    for (j in to_calculate) {
      
      # handle root node
      if (!is.na(tree_info[j, "FILTER"])) {
        # subset data according to the filter
        this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"])))
        # get the design matrix
        X <- model.matrix(formula, this_data)
      } else {
        this_data <- data
      }
      
      # estimate splitting criteria
      splitting <- apply(X,  MARGIN = 2, FUN = sse_var, y = y)
      
      # get the min SSE
      tmp_splitter <- which.min(splitting[1,])
      
      # define maxnode
      mn <- max(tree_infoNODE)              # paste filter rules       tmp_filter <- c(paste(names(tmp_splitter), ">=",                              splitting[2,tmp_splitter]),                       paste(names(tmp_splitter), "<",                              splitting[2,tmp_splitter]))              # Error handling! check if the splitting rule has already been invoked       split_here  <- !sapply(tmp_filter,                              FUN = function(x,y) any(grepl(x, x = y)),                              y = tree_infoFILTER)
      
      # append the splitting rules
      if (!is.na(tree_info[j, "FILTER"])) {
        tmp_filter  <- paste(tree_info[j, "FILTER"], 
                             tmp_filter, sep = " & ")
      } 
      
      # get the number of observations in current node
      tmp_nobs <- sapply(tmp_filter,
                         FUN = function(i, x) {
                           nrow(subset(x = x, subset = eval(parse(text = i))))
                         },
                         x = this_data)  
      
      # insufficient minsize for split
      if (any(tmp_nobs <= minsize)) {
        split_here <- rep(FALSE, 2)
      }
      
      # create children data frame
      children <- data.frame(NODE = c(mn+1, mn+2),
                             NOBS = tmp_nobs,
                             FILTER = tmp_filter,
                             TERMINAL = rep("SPLIT", 2),
                             row.names = NULL)[split_here,]
      
      # overwrite state of current node
      tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT")
       
      # bind everything
      tree_info <- rbind(tree_info, children)
      
      # check if there are any open splits left
      do_splits <- !all(tree_infoTERMINAL != "SPLIT")     } # end for   } # end while </code></pre> The while loop covers our tree depth and our for loop all splits within this certain depth. Within the while loop we seek every row in our <code>tree_info</code>, which we still have to estimate, i.e. all nodes in the <code>"SPLIT"</code> state. In the first iteration it would be the first row, our root node. The for loop iterates over all potential splitting nodes. The first if condition ensures that we filter our data according to the tree depth. Of course, there is no filter in the root node that is why we take the data as is. But imagine calculating possible splits for a parent in depth level two. The filter would look similar to this one: <code>"feature_1 > 5.33 & feature_2 <= 3.22"</code>. <h3>How do we apply splitting?</h3> Afterwards we seek the minimum SSE by applying the <code>sse_var</code> function to every feature. Note, that in this version we can only handle numeric features. Once we have found the best splitting variable, here named <code>tmp_splitter</code>, we build the according filter rule in object <code>tmp_filter</code>.  We still have to check if this is a valid split, i.e. we have not invoked this split for this branch yet and we have sufficient observations in our children. Our indicator <code>split_here</code> rules whether we split our node. Well, that is about it. In the last passage of the loop we prepare the output for this node. That is, handling the state of the calculated node and adding the children information to our job list <code>tree_info</code>. After the for loop we have to check whether our tree is fully grown. The variable <code>do_splits</code> checks whether there are any nodes left we have to calculate. We terminate the calculation if there are no <code>"SPLIT"</code> nodes left in our <code>tree_info</code>. <h3>Extracting our estimates</h3> <pre><code> # calculate fitted values   leafs <- tree_info[tree_infoTERMINAL == "LEAF", ]
  fitted <- c()
  for (i in seq_len(nrow(leafs))) {
    # extract index
    ind <- as.numeric(rownames(subset(data, eval(parse(
        					   text = leafs[i, "FILTER"])))))
    # estimator is the mean y value of the leaf
    fitted[ind] <- mean(y[ind])
  }

At the end of our calculation we have a filter rule for every leaf in our tree. With the help of these filters we can easily calculate the fitted values by simply applying the filter on our data and calculating our fit, i.e. the mean of our target in this leaf. I am sure by now you can think of a way to implement more sophisticated estimators, which I would leave up to you.

Well, that’s a regression tree with minimum size restriction. I have created a little runthrough with data from my simulation package on my GitHub, which you can checkout and try everything on your own. Make sure to checkout my other blog post about coding gradient boosted machines from scratch.

Motivation

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project or problem. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.

While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I am a fan of implementing the algorithms in R by myself. Of course this means digging through the maths as well. However, you can challenge your understanding of the algorithm directly.

In my two subsequent blog posts I will introduce two machine learning algorithms in under 150 lines of R Code. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.

gradient boosting machine

This blog post will introduce you to gradient boosted machines. Surely, there are tons of great articles out there which explain gradient boosting theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in the theory and the evolution of the algorithm I would strongly recommend reading the paper of Bühlmann and Hothorn (2007).(1) The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow.

Gathering all puzzle pieces

Gradient boosting is a very special machine learning algorithm, because it is rather a vehicle for machine learning algorithms rather than a machine learning algorithm itself. That is because you can incorporate any machine learning algorithm within gradient boosting. I admit that sounds quite confusing, but it will be clear by the end of this post.

Anyway, what do we need to boost those gradients? Well, we need to define our problem (a little maths is necessary 🙂 ).

We have the following problem: textbf{y} = textbf{X} boldsymbol{theta} + textbf{u}

This is just a simple regression equation, where we state that we want to map our features textbf{X} onto a real valued target textbf{y} with estimator boldsymbol{theta} and an error term (residuals) textbf{u}. This is nothing out of the ordinary every machine learning algorithm starts here or somewhere close to this equation. However, unlike other machine learning algorithms, specifying a loss funciton is mandatory.

# this is our loss function
# y - the target
# yhat - the fitted values
loss <- function(y, yhat) return(mean(1/2*(y-yhat)^2))

Our objective is to find the minimum of this loss function, which is an adaption of the mean squared error. The algorithm exploits the fact that in a minimum there is no slope in our loss function. The gradient of our function, which is the negative partial derivative with respect to yhat, describes this slope. So we can actually reformulate our objective to searching a gradient of zero or close to zero to ultimately fulfill our goal of minimzing the loss.

# the derivative of our loss function is the gradient
# y - the target
# yhat - the fitted values
gradient <- function(y, yhat) {return(y - yhat)}

Now comes the interesting part of the algorithm. In our case the gradient coincides with the residuals u = y – yhat. Remember, we want the gradient to be zero or close to zero. Or put differently, we want the state y = yhat. The algorithm exploits this fact by simply iteratively fitting (boosting) the residuals (the gradient) instead of y itself. This means we update our gradient in every iteration to improve our fit in general. This step describes our movement to the minimum of our loss function. We will clarify this observation with a little example once we see the theory in action.

The algorithm

Let us first have a look at the algorithm as a whole.

# grad_boost
#' Fit a boosted linear model
#'
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param nu a numeric within the range of [0,1], the learning rate
#' @param stop a numeric value determining the total boosting iterations
#' @param loss.fun a convex loss function which is continuously differentiable
#' @param grad.fun a function which computes the gradient of the loss function
#' @param yhat.init a numeric value determining the starting point
#'
#' @return itemize{
#' item theta - this is our estimator
#' item u - this is the last gradient value
#' item fit - our fitted values, i.e. X %*% theta
#' item formula - the underlying formula
#' item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
grad_boost <- function(formula, data, nu = 0.01, stop, 
                       grad.fun, loss.fun, yhat.init = 0) {
  
  # coerce to data.frame
  data <- as.data.frame(data)
  
  # handle formula
  formula <- terms.formula(formula)
  
  # get the design matrix
  X <- model.matrix(formula, data)
    
  # extract target
  y <- data[, as.character(formula)[2]]

  # initial fit
  fit <- yhat.init
  
  # initialize the gradient with yhat.init
  u <- grad.fun(y = y, yhat = fit)
  
  # initial working parameter (our theta)
  # this is just an empty body
  theta <- rep(0, ncol(X))
  
  # track loss
  loss <- c()
  
  # boost from 1 to stop
  for (i in 1:stop) {
    
    # estimate working parameter (via OLS)
    # theta_i = (X'X)^-1 X'y
    # This is the (base procedure) where you can literally place 
    # any optimization algorithm
    base_prod <- lm.fit(x = X, y = u)
    theta_i <- coef(base_prod)
  
    # update our theta
    theta <- theta + nu*as.vector(theta_i)
    
    # new fitted values
    fit <- fit + nu * fitted(base_prod)
    
    # update gradient
    u <- grad.fun(y = y, yhat = fit)
    
    # update loss 
    loss <- append(loss, loss.fun(y = y, yhat = fit))
  }  
  
  # naming the estimator
  names(theta) <- colnames(X)
  
  # define return
  return(list(theta = theta, u = u, fit = fit, loss = loss, 
              formula = formula, data = data))
}

All the magic happens within the for loop, which is coded in less than 30 lines of code. Everything else is just input/output handling and so on.

Let us breakdown what is happening.

# coerce to data.frame
  data <- as.data.frame(data)
  
  # handle formula
  formula <- terms.formula(formula)
  
  # get the design matrix
  X <- model.matrix(formula, data)
    
  # extract target
  y <- data[, as.character(formula)[2]]

  # initial fit
  fit <- f.init
  
  # initialize the gradient with yhat.init
  u <- grad.fun(y = y, yhat = fit)
  
  # initial working parameter (our theta)
  # this is just an empty body
  theta <- rep(0, ncol(X))
  
  # track loss
  loss <- c()

The first part is just input handling, but we can see that we have to initialize the gradient, by starting with a best guess fit yhat.init, e.g. zero. After initializing, we need some containers to store our loss and our estimator theta.

The next part is all about our optimization, where we iteratively approaching the minimum of our prespecified loss function.

# boost from 1 to stop
for (i in 1:stop) {
    
    # estimate working parameter (via OLS)
    # theta_i = (X'X)^-1 X'y
    # This is the (base procedure) where you can literally place 
    # any optimization algorithm
    base_prod <- lm.fit(x = X, y = u)
    theta_i <- coef(base_prod)
  
    # update our theta
    theta <- theta + nu*as.vector(theta_i)
    
    # new fitted values
    fit <- fit + nu * fitted(base_prod)
    
    # update gradient
    u <- grad.fun(y = y, yhat = fit)
    
    # update loss 
    loss <- append(loss, loss.fun(y = y, yhat = fit))
}

Unveiling the magic

Alright, let’s get to the gravy. We will go through the first iteration step by step. The first thing we are doing is to estimate the gradient with a simple regression. Yes, a simple regression which you unfortunately have to accept here for a moment. Remember how we initialized the gradient. In the first iteration we are doing ordinary least squares with our features X and our target y, because the gradient coincides with our target, i.e. u = y - 0. What happens next is the last missing puzzle piece to gradient boosting. We are simply updating our estimator theta the fitted values fit and our gradient u. But what does this accomplish? Suppose, this multiplier nu is equal to one, then the following is quite straightforward. We are stating that our new fit – we initialized it with zero – is plain ordinary least squares and our estimator theta is the OLS estimate as well. After updating the gradient we observe that the gradient in iteration two is actually the residuals from the OLS fit of iteration one.

Let us recap that in some code real quick. In the first iteration we were actually fitting y, since u = y - 0 and in the second iteration we are fitting u = resid(base_prod), since u = y – fitted(base_prod). Thus, if we repeat this often enough u should converge to zero, since our fit will improve everytime. But why should it improve everytime? The algorithm allows us to fit the gradient – which again are the residuals – at every iteration with completely new estimates theta_i. Naturally, this guess will be good enough to let the gradient converge towards zero or put differently minimize the distance between our target y and our fit yhat. And yes, this means that with gradient boosting we can theoretically reach the equilibrium of y = yhat, which we desire. But do we though? Remember we could theoretically reach that state in every problem we estimate with gradient boosting. Well, the important point to note here is that we can only do this for the data we have at hand. Chances are very high when observing new data, that our gradient boosted machine is garbage, because our estimator theta does not cover the underlying effect of our featuers on the target, but covers rather a combination which minimizes our loss function with the data we fed it. That is why we have to control the boosting iterations. We have to terminate the iterative fitting at the right time, to get a theta which explains the effect we are interested in rather than minimizing loss functions. There are so-called early-stopping techniques which terminate the learning by an exceeding of an out of sample risk to avoid this phenomenom called overfitting the data.

Anyway, let us come back to the parameter nu. This is the shrinkage (learning rate) parameter. It controls how big the steps towards the minimum of our loss function are. Suppose, the same scenario as above with nu = 0.1, that would mean we would not get the full OLS fit, but only 10% of it. As a result, we would need more iterations to converge to the minimum, however, empirically smaller learning rates have been found to yield better predictive accuracies.(2)

At the beginning I have stated that gradient boosting is rather a vehicle for machine learning algorithms than a machine learning algorithm itself. Essentially, this is because instead of lm.fit we could have used any other optimization function here. Most gradient boosted machines out there use tree-based algorithms, e.g. xgboost. This makes the gradient boosted machine to a very unique machine learning algorithm.

I have created a little runthrough with data from my simulation function on my GitHub, which you can checkout and try everything on your own step by step. Make sure to checkout my other blog post about coding regression trees from scratch.

References

  1. Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4), 477-505
  2. Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232.

In a recent project, I have developed a gradient boosting algorithm to estimate price elasticities. Surely, it is necessary to validate if the functionalities of the algorithm are working as intended. I started using nonlinear time series data from another blog post about lag selection as a validation basis. Unfortunately, at that time I did not wrap the simulation code into a time series simulation function. Thus, I wondered whether there are any simulation packages or functions for supervised learning problems. My colleagues were confronted with a similar problem and they have used the simulation functions developed by Max Kuhn in his supervised learning framework texttt{caret}(1). I started using LPH07_01() to validate whether features of the underlying data generating process were really selected. While the function does its job as expected, I realized that I needed more degrees of freedom in my simulation. One of the crucial settings I wanted to manipulate is the degree of nonlinearity and the associated functional shape, since the boosting algorithm uses P-spline(2) base learners. At this point, I decided to code my own simulation function and yep here it is. The function is called Xy() and you can get it from my github account.

Main functionalities of my simulation are salvaged from the texttt{caret} counterpart. For instance, one can add irrelevant features and multicollinearity within boldsymbol{X}. However, Xy() is additionally allowing collinearity patterns between relevant and irrelevant features. Nonlinearity is completely adjustable by manipulating the nlfun argument. Additionally, an interaction depth can be specified via the interaction parameter. Altering this argument does however not imply that boldsymbol{y} will be completely generated by interactions. Chance will decide for each true feature whether an interaction should be formed. For instance, if the user decides in favor of an interaction depth of – say – two, it is likely that boldsymbol{y} will be generated by some bivariate interaction terms.

In this example, we can see that the process generating boldsymbol{y} is composed of main and interaction effects of degree two.

# Simulating regression data with 15 true and 5 irrelevant features
reg_sim = Xy(n = 1000,   
             linvars = 5,    
             nlinvars = 10,   
             noisevars = 5,   
             nlfun = function(x) {x^2},  
             interaction = 2,  
             sig = c(1,4),   
             cor = c(0.1,0.3),  
             weights = c(-5,5),  
             cov.mat = NULL,  
             sig.e = 4,  
             noise.coll = FALSE, 
             plot = TRUE)  

# Get the underlying process of y 
reg_sim[["dgp"]]  

"y =  3.74NLIN_1 - 4.48NLIN_2 * -3.54NLIN_3 - 2.39NLIN_3 * -2.27LIN_4
    - 1.45NLIN_4 + 2.67NLIN_5 * -4.17LIN_3  
    - 3.63NLIN_6 - 2.39NLIN_3 * 4.66NLIN_7 + 0.77NLIN_8  
    - 0.42NLIN_9 * 1.93LIN_2 - 2.27NLIN_9 * 0.29NLIN_10 - 1.49LIN_1  
    + 1.39LIN_2 + 4.43NLIN_10 * 3.6LIN_3 - 3.26NLIN_6 * 3.51LIN_4  
    + 4.47LIN_1 * 4.38LIN_5 + e ~ N(0,4)"

I liked the idea of a user-specified interval from which variances of the true features are sampled, since this opens the possibility to play around with different variations. The covariance of boldsymbol{X} as well as the weights of boldsymbol{X} are also sampled from a user specified interval. If you have a specific covariance structure in mind you can define it by using the cov.mat argument.

The function returns a list with the simulated data, a string describing the composition of boldsymbol{y} and if applicable a plot of the true effects.

Here are the true effects of our example:

The function comes with a texttt{roxygen2}(3) body which should – hopefully – clarify the tasks of each argument.

There are several things I want to add in upcoming versions. On top of my list are classification tasks, binary or categorical features in general, data with autocorrelative patterns and a user-specified formula describing the composition of boldsymbol{y}.

I hope you enjoy this simulation tool as much as I do. As mentioned above, I did not actively search CRAN for simulation packages or functions, so if you know a good package or function you can drop me an e-mail. Also, feel free to contact me with input, ideas or some fun puppy pictures.

Referenzen

  1. Kuhn, Max. „Caret package.“ Journal of statistical software 28.5 (2008): 1-26.
  2. Eilers, Paul HC, and Brian D. Marx. „Flexible smoothing with B-splines and penalties.“ Statistical science (1996): 89-102.
  3. Wickham, Hadley, Peter Danenberg, and Manuel Eugster. „roxygen2: In-source documentation for R.“ R package version 3.0 (2013).

In a recent project, I have developed a gradient boosting algorithm to estimate price elasticities. Surely, it is necessary to validate if the functionalities of the algorithm are working as intended. I started using nonlinear time series data from another blog post about lag selection as a validation basis. Unfortunately, at that time I did not wrap the simulation code into a time series simulation function. Thus, I wondered whether there are any simulation packages or functions for supervised learning problems. My colleagues were confronted with a similar problem and they have used the simulation functions developed by Max Kuhn in his supervised learning framework texttt{caret}(1). I started using LPH07_01() to validate whether features of the underlying data generating process were really selected. While the function does its job as expected, I realized that I needed more degrees of freedom in my simulation. One of the crucial settings I wanted to manipulate is the degree of nonlinearity and the associated functional shape, since the boosting algorithm uses P-spline(2) base learners. At this point, I decided to code my own simulation function and yep here it is. The function is called Xy() and you can get it from my github account.

Main functionalities of my simulation are salvaged from the texttt{caret} counterpart. For instance, one can add irrelevant features and multicollinearity within boldsymbol{X}. However, Xy() is additionally allowing collinearity patterns between relevant and irrelevant features. Nonlinearity is completely adjustable by manipulating the nlfun argument. Additionally, an interaction depth can be specified via the interaction parameter. Altering this argument does however not imply that boldsymbol{y} will be completely generated by interactions. Chance will decide for each true feature whether an interaction should be formed. For instance, if the user decides in favor of an interaction depth of – say – two, it is likely that boldsymbol{y} will be generated by some bivariate interaction terms.

In this example, we can see that the process generating boldsymbol{y} is composed of main and interaction effects of degree two.

# Simulating regression data with 15 true and 5 irrelevant features
reg_sim = Xy(n = 1000,   
             linvars = 5,    
             nlinvars = 10,   
             noisevars = 5,   
             nlfun = function(x) {x^2},  
             interaction = 2,  
             sig = c(1,4),   
             cor = c(0.1,0.3),  
             weights = c(-5,5),  
             cov.mat = NULL,  
             sig.e = 4,  
             noise.coll = FALSE, 
             plot = TRUE)  

# Get the underlying process of y 
reg_sim[["dgp"]]  

"y =  3.74NLIN_1 - 4.48NLIN_2 * -3.54NLIN_3 - 2.39NLIN_3 * -2.27LIN_4
    - 1.45NLIN_4 + 2.67NLIN_5 * -4.17LIN_3  
    - 3.63NLIN_6 - 2.39NLIN_3 * 4.66NLIN_7 + 0.77NLIN_8  
    - 0.42NLIN_9 * 1.93LIN_2 - 2.27NLIN_9 * 0.29NLIN_10 - 1.49LIN_1  
    + 1.39LIN_2 + 4.43NLIN_10 * 3.6LIN_3 - 3.26NLIN_6 * 3.51LIN_4  
    + 4.47LIN_1 * 4.38LIN_5 + e ~ N(0,4)"

I liked the idea of a user-specified interval from which variances of the true features are sampled, since this opens the possibility to play around with different variations. The covariance of boldsymbol{X} as well as the weights of boldsymbol{X} are also sampled from a user specified interval. If you have a specific covariance structure in mind you can define it by using the cov.mat argument.

The function returns a list with the simulated data, a string describing the composition of boldsymbol{y} and if applicable a plot of the true effects.

Here are the true effects of our example:

The function comes with a texttt{roxygen2}(3) body which should – hopefully – clarify the tasks of each argument.

There are several things I want to add in upcoming versions. On top of my list are classification tasks, binary or categorical features in general, data with autocorrelative patterns and a user-specified formula describing the composition of boldsymbol{y}.

I hope you enjoy this simulation tool as much as I do. As mentioned above, I did not actively search CRAN for simulation packages or functions, so if you know a good package or function you can drop me an e-mail. Also, feel free to contact me with input, ideas or some fun puppy pictures.

Referenzen

  1. Kuhn, Max. „Caret package.“ Journal of statistical software 28.5 (2008): 1-26.
  2. Eilers, Paul HC, and Brian D. Marx. „Flexible smoothing with B-splines and penalties.“ Statistical science (1996): 89-102.
  3. Wickham, Hadley, Peter Danenberg, and Manuel Eugster. „roxygen2: In-source documentation for R.“ R package version 3.0 (2013).