In my last blog post, I have elaborated on the Bagging algorithm and showed its prediction performance via simulation. Here, I want to go into the details on how to simulate the bias and variance of a nonparametric regression fitting method using `R`

. These kinds of questions arise here at STATWORX when developing, for example, new machine learning algorithms or testing established ones which shall generalize well to new unseen data.

## Decomposing the mean squared error

We consider the basic regression setup where we observe a real-valued sample and our aim is to predict an outcome based on some predictor variables („covariates“) via a regression fitting method. Usually, we can measure our target only with noise ,

To measure our prediction accuracy, we will use the mean squared error (MSE) which can be decomposed as follows:

is our prediction obtained by the regression method at hand.

From the above expression, we observe that the MSE consists of two parts:

- : measures the (squared) difference between the true underlying process and the mean of our predictions, i.e.
- : measures the variation of our predictions around its mean, i.e.

In general, it will not be possible to minimize both expressions as they are competing with each other. This is what is called the bias-variance tradeoff. More complex models (e.g. higher order polynomials in polynomial regression) will result in low bias while yielding high variance as we fit characteristic features of the data that are not necessary to predict the true outcome (and vice versa, cf. Figure 1).

## Monte Carlo Setup & Simulation Code

To illustrate this, we consider a simple toy model.

where , and

As a fitting procedure, we will use a cubic smoothing spline (`smooth.spline`

). For the purpose of this blog post, we only need to know that a smoothing spline divides our univariate predictor space into intervals and fits each one to a cubic polynomial to approximate our target. The complexity of our smoothing spline can be controlled via the degrees of freedom (function argument `df`

). If you are interested in the (impressive) mathematical details of smoothing splines, check out this script by Ryan Tibshirani.

Figure 1 shows the bias-variance tradeoff from above. For relatively low degrees of freedom we obtain a model (red line) which is too simple and does not approximate the true data generating process well (black dashed line) – Note: for we obtain the least-squares fit.

Here, the bias will be relatively large while the variance will remain low. On the other hand, choosing relatively high degrees of freedoms leads to a model which overfits the data (green line). In this case, we clearly observe that the model is fitting characteristic features of the data which are not relevant for approximating the true data generating process.

To make this tradeoff more rigorous, we explicitly plot the bias and variance. For this, we conduct a Monte Carlo simulation. As a side note, to run the code snippets below, you only need the `stats`

module which is contained in the standard `R`

module scope.

For validation purposes, we use a training and test dataset (`*_test`

and `*_train`

, respectively). On the training set, we construct the algorithm (obtain coefficient estimates, etc.) while on the test set, we make our predictions.

```
# set a random seed to make the results reproducible
set.seed(123)
n_sim <- 200
n_df <- 40
n_sample <- 100
# setup containers to store the results
prediction_matrix <- matrix(NA, nrow = n_sim, ncol = n_sample)
mse_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
results <- matrix(NA, nrow = 3, ncol = n_df)
# Train data -----
x_train <- runif(n_sample, -0.5, 0.5)
f_train <- 0.8*x_train+sin(6*x_train)
epsilon_train <- replicate(n_sim, rnorm(n_sample, 0, sqrt(2)))
y_train <- replicate(n_sim,f_train) + epsilon_train
# Test data -----
x_test <- runif(n_sample, -0.5, 0.5)
f_test <- 0.8*x_test+sin(6*x_test)
```

The bias-variance tradeoff can be modelled in `R`

using two `for`

-loops. The outer one will control the complexity of the smoothing splines (counter: `df_iter`

). The Monte Carlo Simulation with 200 iterations (`n_sim`

) to obtain the prediction matrix for the variance and bias is run in the inner loop.

```
# outer for-loop
for (df_iter in seq(n_df)){
# inner for-loop
for (mc_iter in seq(n_sim)){
cspline <- smooth.spline(x_train, y_train[, mc_iter], df=df_iter+1)
cspline_predict <- predict(cspline, x_test)
prediction_matrix[mc_iter, 1:n_sample] <- cspline_predicty - f_test)^2)
}
var_matrix <- apply(prediction_matrix, 2, FUN = var)
bias_matrix <- apply(prediction_matrix, 2, FUN = mean)
squared_bias <- (bias_matrix - f_test)^2
results[1, df_iter] <- mean(var_matrix)
results[2, df_iter] <- mean(squared_bias)
}
results[3,1:n_df] <- apply(mse_temp, 2, FUN = mean)
```

To model and from the above MSE-equation, we have to approximate those theoretical (population) terms by means of a Monte Carlo simulation (inner `for`

-loop). We run (`n_sim`

) Monte Carlo iterations and save the predictions obtained by the `smooth.spline`

-object in a prediction matrix. To approximate at each test sample point , by , we take the average of each column. denotes the prediction of the algorithm obtained at some sample point in iteration . Similar considerations can be made to obtain an approximation for .

## Bias-variance tradeoff as a function of the degrees of freedom

Figure 2 shows the simulated bias-variance tradeoff (as a function of the degrees of freedom). We clearly observe the complexity considerations of Figure 1. Here, the bias is quickly decreasing to zero while the variance exhibits linear increments with increasing degrees of freedoms. Hence, for higher degrees of freedom, the MSE is driven mostly by the variance.

Comparing this Figure with Figure 1, we note that for , the bias contributes substantially more than the variance to the MSE. If we increase the degrees of freedom to the bias tends to zero, characteristic features of the data are fitted and the MSE consists mostly of the variance.

With small modifications, you can use this code to explore the bias-variance tradeoff of other regression fitting and also Machine Learning methods such as Boosting or Random Forest. I leave it to you to find out which hyperparameters induce the bias-variance tradeoff in these algorithms.

You can find the full code on my Github Account. If you spot any mistakes or if you are interested in discussing applied statistic and econometrics topics, feel free to contact me.

# References

The simulation setup in this blog post follows closely the one from Buhlmann.

- Buhlmann, P. and Yu, B. (2003). Boosting with the L2-loss: Regression and classification. J. Amer. Statist.

Assoc. 98, 324–339.

In this blog we will explore the Bagging algorithm and a computational more efficient variant thereof, Subagging. With minor modifications these algorithms are also known as Random Forest and are widely applied here at STATWORX, in industry and academia.

Almost all statistical prediction and learning problems encounter a bias-variance tradeoff. This is particularly pronounce for so-called unstable predictors. While yielding low biased estimates due to flexible adaption to the data, those kind of predictors react very sensitive to small changes in the underlying dataset and have hence high variance. A common example are Regression Tree predictors.

Bagging bypasses this tradeoff by reducing the variance of the unstable predictor, while leaving its bias mostly unaffected.

## Method

In particular, Bagging uses repeated bootstrap sampling to construct multiple versions of the same prediction model (e.g. Regression Trees) and averages over the resulting predictions.

Let’s see how Bagging works in detail:

- Construct a bootstrap sample (with replacement) of the original i.i.d. data at hand .
- Fit a Regression Tree to the bootstrap sample – we will denote the tree predictor by .
- Repeat Steps one and two many times and calculate .

OK – so let us take a glimpse into the construction phase: We draw in total different bootstrap samples simultaneously from the original data. Then to each of those samples a tree is fitted and the (in-sample) fitted values are averaged in Step 3 yielding the Bagged predictor.

The variance-reduction happens in Step 3. To see this, consider the following toy example.

Let be i.i.d. random variables with and and let . Easy re-formulations show that

We observe that indeed the variance of the mean is weakly smaller than for the individual random variables while the sample mean is unbiased.

It is widely discussed in the literature why Bagging works and it remains an open research question. Bühlmann and Yu (2002) propose a subsampling variant of Bagging, called Subagging, which is more traceable from a theoretical point of view.

In particular, Bühlmann and Yu (2002) replace the bootstrap procedure of Bagging by subsampling without replacement. Essentially, we are only changing Step 1 of our Bagging algorithm by randomly drawing times without replacement from our original data with and get hence a subset of size . With this variant at hand, it is possible to state upper bounds for the variance and mean squared error of the predictor given an appropriate choice of the subsample size .

## Simulation Set-Up

As the theory is a little bit cumbersome and involves knowledge in real analysis, we simulate the main findings of Bühlmann and Yu (2002).

Let’s compare the mean-squared prediction error (MSPE) of the Regression Tree, Bagged and Subagged predictor and illustrate the theory part a little bit more.

In order to do this, we consider the following model

where is the regression function, is the design matrix generated from a uniform distribution and is the error term ().

For the true data-generating process (DGP), we consider the following model which is quite frequently used in the machine learning literature and termed „Friedman #1“-model:

where is the -th column of the design matrix (for ).

As you can see, this model is highly non-linear – Regression Tree models shall therefore be appropriate to approximate our DGP.

To evaluate the prediction performance of Bagging and Subagging predictors we conduct a Monte Carlo simulation in Python.

We first import the relevant packages.

```
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> numpy <span class="hljs-keyword"><span class="hljs-keyword">as</span></span> np
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.model_selection
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> sklearn.ensemble
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> simulation_class
<span class="hljs-keyword"><span class="hljs-keyword">import</span></span> math
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.metrics <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> mean_squared_error
<span class="hljs-keyword"><span class="hljs-keyword">from</span></span> sklearn.tree <span class="hljs-keyword"><span class="hljs-keyword">import</span></span> DecisionTreeRegressor
```

The module `simulation_class`

is a user-specified class that we will not discuss in this blog post but in a subsequent one.

Further, we specify the simulation set-up:

```
<span class="hljs-comment"><span class="hljs-comment"># Number of regressors</span></span>
n_reg = <span class="hljs-number"><span class="hljs-number">10</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Observations</span></span>
n_obs = <span class="hljs-number"><span class="hljs-number">500</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Simulation runs</span></span>
n_sim = <span class="hljs-number"><span class="hljs-number">50</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Number of trees, i.e. number of bootstrap samples (Step 1)</span></span>
n_tree = <span class="hljs-number"><span class="hljs-number">50</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Error Variance</span></span>
sigma = <span class="hljs-number"><span class="hljs-number">1</span></span>
<span class="hljs-comment"><span class="hljs-comment"># Grid for subsample size</span></span>
start_grid = <span class="hljs-number"><span class="hljs-number">0.1</span></span>
end_grid = <span class="hljs-number"><span class="hljs-number">1</span></span>
n_grid = <span class="hljs-number"><span class="hljs-number">100</span></span>
grid_range = np.linspace(start_grid, end_grid, num = n_grid)
```

Below we will explain in more detail for what we need the grid specification.

To store our simulation results we set up containers.

```
<span class="hljs-comment"><span class="hljs-comment"># Container Set-up</span></span>
mse_temp_bagging = np.empty(shape = (n_obs, n_sim))
mse_temp_subagging = np.empty(shape = (n_obs, n_sim))
y_predict_bagging = np.empty(shape = (n_obs, n_sim))
y_predict_subagging = np.empty(shape = (n_obs, n_sim))
mse_decomp = np.empty(shape = (len(grid_range),<span class="hljs-number"><span class="hljs-number">2</span></span>))
```

With this initialization at hand, we generate the test and train data by the `simulation_class module`

.

```
<span class="hljs-comment"><span class="hljs-comment">#Creation of Simulation-Data</span></span>
train_setup = simulation_class.simulation(n_reg = n_reg,
n_obs = n_obs,
n_sim = n_sim,
sigma = sigma,
random_seed_design = <span class="hljs-number"><span class="hljs-number">0</span></span>,
random_seed_noise = <span class="hljs-number"><span class="hljs-number">1</span></span>)
test_setup = simulation_class.simulation(n_reg = n_reg,
n_obs = n_obs,
n_sim = n_sim,
sigma = sigma,
random_seed_design = <span class="hljs-number"><span class="hljs-number">2</span></span>,
random_seed_noise = <span class="hljs-number"><span class="hljs-number">3</span></span>)
f_train = train_setup.friedman_model()
X_train, y_train = train_setup.error_term(f_train)
f_test = test_setup.friedman_model()
X_test, y_test = test_setup.error_term(f_test)
```

As we have generated the data for our „Friedman #1“-model we are now able to simulate the mean squared error of the Bagged predictor and Subagged predictor. In Python, both algorithms are implemented via the `BaggingRegressor`

method of the `sklearn.ensemble`

package. Observe that for the Subagged predictor we need to specify the parameter `max_samples`

in the `BaggingRegressor`

. This ensures that we can draw a subsample size with subsample fraction from the original data. Indeed, for the subsample fraction we have already specified the grid above by the variable `grid_range`

.

```
<span class="hljs-comment"><span class="hljs-comment">#Subagging-Simulation</span></span>
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> index, a <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> enumerate(grid_range):
<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">0</span></span>, n_sim):
<span class="hljs-comment"><span class="hljs-comment"># bagged estimator</span></span>
bagging = sklearn.ensemble.BaggingRegressor(
bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">True</span></span>,
n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
y_predict_bagging[:,i] = bagging.fit(
X_train,
y_train[:,i]).predict(X_test)
mse_temp_bagging[:,i] = mean_squared_error(
y_test[:,i],
y_predict_bagging[:,i])
<span class="hljs-comment"><span class="hljs-comment"># subagged estimator</span></span>
subagging = sklearn.ensemble.BaggingRegressor(
max_samples = math.ceil(a*n_obs),
bootstrap = <span class="hljs-keyword"><span class="hljs-keyword">False</span></span>,
n_estimators = <span class="hljs-number"><span class="hljs-number">50</span></span>)
y_predict_subagging[:,i] = subagging.fit(
X_train,
y_train[:,i]).predict(X_test)
mse_temp_subagging[:,i] = mean_squared_error(
y_test[:,i],
y_predict_subagging[:,i])
mse_decomp[index, <span class="hljs-number"><span class="hljs-number">1</span></span>] = np.mean(mse_temp_bagging)
mse_decomp[index, <span class="hljs-number"><span class="hljs-number">2</span></span>] = np.mean(mse_temp_subagging)
```

On my GitHub-Account you can find additional code which also calculates the simulated bias and variance for the fully grown tree and the Bagged tree.

## Results

The results of our above simulation can be found in Figure 1.

Let us first compare the performance in terms of MSPE of the Regression Tree and the Bagged predictor. Table 1 shows us that Bagging drastically reduces the MSPE by decreasing the variance while almost not affecting the bias. (Recall – the mean squared prediction error is just the sum of the squared bias of the estimate, variance of the estimate and the variance of the error term (not reported).)

*Table 1: Performance of fully grown tree and Bagged Predictor*

Predictor | Tree (fully grown) | Bagged Tree |
---|---|---|

3.47 | 2.94 | |

6.13 | 0.35 | |

10.61 | 4.26 |

Figure 1 displays the MSPE as a function of the subsample fraction for the Bagged and Subagged predictor (our above code). Together with Figure 1 and Table 1, we make several observations:

- We see that both the Bagged and Subagged predictor outperform a single tree (in terms of MSPE).
- For a subsampling fraction of approximately 0.5, Subagging achieves nearly the same prediction performance as Bagging while coming at a lower computational cost.

## References

- Breiman, L.: Bagging predictors. Machine Learning,
**24**, 123–140 (1996). - Bühlmann, P., Yu, B.: Analyzing bagging. Annals of Statistics
**30**, 927–961 (2002).