Management Summary

In modern companies, information in text form can be found in many places in day-to-day business. Depending on the business context, this can involve invoices, emails, customer input (such as reviews or inquiries), product descriptions, explanations, FAQs, and applications. Until recently, these information sources were reserved mainly for human beings, as the understanding of a text is a technologically challenging problem for machines.
Due to recent achievements in deep learning, several different NLP (“Natural Language Processing”) tasks can now be solved with outstanding quality.
In this article, you will learn how NLP applications solve various business problems through five practical examples, which ensured an increase in efficiency and innovation in their field of application.

Introduction

Natural Language Processing (NLP) is undoubtedly an area that has received special attention in the Big Data environment in the recent past. The interest in the topic, as measured by Google, has more than doubled in the last three years. This shows that innovative NLP technologies have long since ceased to be an issue only for big players such as Apple, Google, or Amazon. Instead, a general democratization of the technology can be observed. One of the reasons for this is that according to an IBM estimate, about 80% of “global information” is not available in structured databases, but unstructured, natural language. NLP will play a key role in the future when it comes to making this information usable. Thus, the successful use of NLP technologies will become one of the success factors for digitization in companies.

To give you an idea of the possibilities NLP opens up in the business context today, I will present five practical use cases and explain the solutions behind them in the following.

What is NLP? – A Short Overview

As a research topic that had already occupied linguists and computer scientists in the 1950s, NLP had a barely visible existence on the application side in the 20th century.

The main reason for this was the availability of the necessary training data. Although the availability of unstructured data, in the form of texts, has generally increased exponentially, especially with the rise of the Internet, there was still a lack of suitable data for model training. This can be explained by the fact that the early NLP models mostly had to be trained under supervision (so-called supervised learning). However, supervised learning requires that training data must be provided with a dedicated target variable. This means that, for example, in the case of text classification, the text corpus must be manually annotated by humans before the model training.

This changed at the end of the 2010s when a new model generation of artificial neural networks led to a paradigm shift. These so-called “Language Models” are based on huge text corpora of Facebook, Google, etc., (pre-)trained by randomly masking individual words in the texts and predicting them in the course of training. This is so-called self-supervised learning, which no longer requires a separate target variable. In the course of the training, these models learn a contextual understanding of texts.

The advantage of this approach is that the same model can be readjusted for various downstream tasks (e.g., text classification, sentiment analysis, named entity recognition) with the help of the learned contextual understanding. This process is called transfer learning. In practice, these pre-trained models can be downloaded so that only the fine-tuning for the specific application must be done by additional data. Consequently, high-performance NLP applications can now be developed with little development effort.

To learn more about Language Models (especially the so-called Transformer Models like “BERT”, resp. “roBERTa”, etc.) as well as trends and obstacles in the field of NLP, please read the article on NLP trends by our colleague Dominique Lade. [https://www.statworx.com/de/blog/neue-trends-im-natural-language-processing-wie-nlp-massentauglich-wird/].

The 5 Use Cases

Text Classification in the Recruitment Process

A medical research institute wants to make its recruitment process of study participants more efficient.

For testing a new drug, different, interdependent requirements are placed on the persons in question (e.g., age, general health status, presence/absence of previous illnesses, medications, genetic dispositions, etc.). Checking all these requirements is very time-consuming. Usually, it takes about one hour per potential study participant to view and assess relevant information. The main reason for this is that the clinical notes contain patient information that exceeds structured data such as laboratory values and medication: Unstructured information in text form can also be found in the medical reports, physician’s letters, and discharge reports. Especially the evaluation of the latter data requires a lot of reading time and is therefore very time-consuming. To speed up the process, the research institute is developing a machine learning model that pre-selects promising candidates. The experts then only have to validate the proposed group of people.

The NLP Solution

From a methodological point of view, this problem is a so-called text classification. Based on a text, a prognosis is created for a previously defined target variable. To train the model, it is necessary – as usual in supervised learning – to annotate the data, in this case the medical documents, with the target variable. Since a classification problem has to be solved here (suitable or unsuitable study participants), the experts manually assess the suitability for the study for some persons in the pool. If a person is suitable, they are marked with a one (=positive case), otherwise with a zero (=negative case). Based on these training examples, the model can now learn the relationships between the persons’ medical documents and their suitability.

To cope with the complexity of the problem, a correspondingly complex model called ClinicalBERT is used. This is a language model based on BERT (Bidirectional Encoder Representations from Transformers), which was additionally trained on a data set of clinical texts. Thus, ClinicalBERT can generate so-called representations of all medical documentation for each person. In the last step, the neural network of ClinicalBERT is completed by a task-specific component. In this case, it is a binary classification: For each person, a probability of suitability should be output. Through a corresponding linear layer, the high-dimensional text documentation is finally transformed into a single number, the suitability probability. In a gradient procedure, the model now learns the suitability probabilities based on the training examples.

Further Application Scenarios of Text Classification:

Text classification often takes place in the form of sentiment analysis. This involves classifying texts into predefined sentiment categories (e.g., negative/positive). This information is particularly important in the financial world or for social media monitoring. Text classification can also be used in various contexts where it is vital to sort documents according to their type (e.g., invoices, letters, reminders).

Name Entity Recognition for Usability Improvement of a News Page

A publishing house offers its readers on a news page a large number of articles on various topics. In the course of optimization measures, one would like to implement a better recommender system so that for each article, further suitable (complementary or similar) articles are suggested. Also, the search function on the landing page is to be improved so that the customer can quickly find the article he or she is looking for.

To create a good data basis for these purposes, the publisher decided to use Named Entity Recognition (NER) to assign automated tags to the texts, improving both the recommender system and the search function. After successful implementation, significantly more suggested articles are clicked on, and the search function has become much more convenient. As a result, the readers spend substantially more time on the page.

The NLP Solution

To solve the problem, one must first understand how NER works:

NER is about assigning words or entire phrases to content categories. For example, “Peter” can be identified as a person, “Frankfurt am Main” is a place, and “24.12.2020” is a time specification. There are also much more complicated cases. For this purpose, compare the following pairs of sentences:

  1. In the past, Adam didn’t know how to parallel park. (park = from the verb “to park”)
  2. Yesterday I took my dog for a walk in the park. (park = open green area)

It is perfectly evident to humans that the word “park” has a different meaning in each of the two sentences. However, this seemingly simple distinction is anything but trivial for the computer. An entity recognition model could characterize the two sentences as follows:

  1. “[In the past] (time reference), [Adam] (person) didn’t know how to parallel [park] (verb).”
  2. [Yesterday] (time reference) [I] (person) took my dog for a walk in the [park] (location).

In the past, rule-based algorithms would have been used to solve the above NER problem, but here too, the machine learning approach is gaining ground:

The present multiclass classification problem of entity determination is again addressed using the BERT model. Additionally, the model is trained on an annotated data set in which the entities are manually identified. The most comprehensive publicly accessible database in the English language is the Groningen Meaning Bank (GMB). After successful training, the model can correctly determine previously unknown words from the context resulting from the sentence. Thus, the model recognizes that prepositions like “in, at, after…” are followed by a location, but more complex contexts are also used to determine the entity.

Further Application Scenarios of NER:

NER is a classic information retrieval task and is central to many other NER tasks, such as chatbots and question-answer systems. Also, NER is often used for text cataloging, where the type of text is determined based on valid recognized entities.

A Chatbot for a Long-Distance Bus Company

A long-distance bus company would like to increase its accessibility and expand the communication channels with the customer. In addition to its homepage and app, the company wants to offer a third way to the customer, namely a Whatsapp-Chatbot. The goal is to perform specific actions in the conversation with the chatbot, such as searching, booking, and canceling trips. In addition, the chatbot is intended to create a reliable way of informing passengers about delays.

With the introduction of the chatbot, not only existing passengers can be reached more quickly, but also, contact can be established with new customers* who have not yet installed an app.

The NLP solution

Depending on the requirements that are placed on the chatbot, you can choose between different chatbot architectures.

Over the years, four main chatbot paradigms have been tested: In a first generation, the inquiry was examined for well-known patterns and accordingly adapted prefabricated answers were spent (“pattern matching”). More sophisticated is the so-called “grounding”, in which information extracted from knowledge libraries (e.g., Wikipedia) is organized in a network by Named Entity Recognition (see above). Such a network has the advantage that not only registered knowledge can be retrieved, but also unregistered knowledge can be inferred by the network structure. In “searching”, question-answer pairs from the conversation history (or from previously registered logs) are directly used to find a suitable answer. The use of machine learning models is the most proven approach to generate suitable answers (“generative models”) dynamically.

The best way to implement this modern chatbot with clearly definable competencies for the company is to use existing frameworks such as Google Dialogflow. This is a platform for configuring chatbots that have the elements of all previously mentioned chatbot paradigms. For this purpose, parameters such as intents, entities, and actions are passed.

An intend (“user intention”) is, for example, the timetable information. By giving different example phrases (“How do I get from … to … from … to …”, “When is the next bus from … to …”) to a language model, the chatbot can assign even unseen input to the correct intend (see text classification).

Furthermore, different travel locations and times are defined as entities. If the chatbot now captures an intend with matching entities (see NER), an action, in this case a database query, can be triggered. Finally, an intend-answer with the relevant information is given and adapted to all information in the chat history specified by the user (“stateful”).

Further Application Scenarios of Chatbots:

There are many possible applications in customer service, depending on the complexity of the scenario, from the automatic preparation (e.g., sorting) of a customer order to the complete processing of a customer experience.

A Question-Answering System as a Voice Assistant for Technical Questions About the Automobile

An automobile manufacturer discovers that many of its customers do not get along well with the manuals that come with the cars. Often, finding the relevant information takes too long, or it is not found at all. Therefore, it was decided to offer a Voice Assistant to provide precise answers to technical questions in addition to the static manual. In the future, drivers will be able to speak comfortably with their center console when they want to service their vehicle or request technical information.

The NLP solution

Question-answer systems have been around for decades, as they are at the forefront of artificial intelligence. A question-answer system that would always find a correct answer, taking into account all available data, could also be called “General AI”. A significant difficulty on the way to General AI is that the area the system needs to know about is unlimited. In contrast, question-answer systems provide good results when the area is delimited, as is the case with the automotive assistant. In general, the more specific the area, the better results can be expected.

For the implementation of the question-answer system, two data types from the manual are used: structured data, such as technical specifications of the components and key figures of the model, and unstructured data, such as instructions for action. All data is transformed into question-answer form in a preparatory step using other NLP techniques (classification, NER). This data is transferred to a version of BERT that has already been pre-trained on a large question-answer data set (“SQuAD”). The model is thus able to answer questions that have already been fed into the system and provide educated guesses for unseen questions.

Further Application Scenarios of Question-Answer Systems:

With the help of question-answer systems, company-internal search engines can be extended by functionalities. In e-commerce, answers to factual questions can be given automatically based on article descriptions and reviews.

Automatic Text Summaries (Text Generation) of Damage Descriptions for a Property Insurance

An insurance company wants to increase the efficiency of its claim settlement department. It has been noticed that some claims complaints from the customer lead to internal conflicts of responsibility. The reason for this is simple: customers usually describe the claims over several pages, and an increased training period is needed to be able to judge whether or not to process the case. Thus, it often happens that a damage description must be read thoroughly to understand that the damage itself does not need to be processed. Now, a system that generates automated summaries is to remedy this situation. As a result of the implementation, the claim handlers can now make responsibility decisions much faster.

The NLP solution

One can differentiate between two different approaches to the text summary problem: In the extraction, the most important sentences are identified from the input text and are then used as a summary in the simplest case. In abstraction, a text is transformed by a model into a newly generated summary text. The second approach is much more complex since paraphrasing, generalization, or the inclusion of further knowledge is possible here. Therefore, this approach has a higher potential to generate meaningful summaries but is also more error-prone. Modern text summary algorithms use the second approach or a combination of both methods.

A so-called sequence-to-sequence model is used to solve the insurance use case, which assigns a word sequence (the damage description) to another word sequence (the summary). This is usually a recurrent neural network (RNN), trained based on text summary pairs. The training process is designed to model the probability of the next word depending on the last words (and additionally, an “inner state” of the model). Similarly, the model effectively writes the summary “from left to right” by successively predicting the next word. An alternative approach is to have the input numerically encoded by the Language Model BERT and to have a GPT decoder autoregressively summarize the text based on this numerical representation. With the help of model parameters, it can be adjusted in both cases how long the summary should be.

Further Application Scenarios of Text Generation:

Such a scenario is conceivable in many places: Automated report writing, text generation based on retail sales data analysis, electronic medical record summaries, or textual weather forecasts from weather data are possible applications. Text generation is also used in other NLP use cases such as chatbots and Q&A systems.

Outlook

These five application examples of text classification, chatbots, question-answer systems, NER, and text summaries show that there are many processes in all kinds of companies that can be optimized with NLP solutions.

NLP is not only an exciting field of research but also a technology whose applicability in the business environment is continually growing.

In the future, NLP will not only become a foundation of a data-driven corporate culture but also already holds a considerable innovation potential through direct application, in which it is worth investing.

At STATWORX, we already have years of experience in the development of customized NLP solutions. Here are two of our case studies on NLP: Social Media Recruiting with NLP & Supplier Recommendation Tool. We are happy to provide you with individual advice on this and many other topics.

Did you ever want to make your machine learning model available to other people, but didn’t know how? Or maybe you just heard about the term API, and want to know what’s behind it? Then this post is for you! Here at STATWORX, we use and write APIs daily. For this article, I wrote down how you can build your own API for a machine learning model that you create and the meaning of some of the most important concepts like REST. After reading this short article, you will know how to make requests to your API within a Python program. So have fun reading and learning!

Table of Contents

What is an API?

API is short for Application Programming Interface. It allows users to interact with the underlying functionality of some written code by accessing the interface. There is a multitude of APIs, and chances are good that you already heard about the type of API, we are going to talk about in this blog post: The web API. This specific type of API allows users to interact with functionality over the internet. In this example, we are building an API that will provide predictions through our trained machine learning model. In a real-world setting, this kind of API could be embedded in some type of application, where a user enters new data and receives a prediction in return. APIs are very flexible and easy to maintain, making them a handy tool in the daily work of a Data Scientist or Data Engineer. An example of a publicly available machine learning API is Time Door. It provides Time Series tools that you can integrate into your applications. APIs can also be used to make data available, not only machine learning models.
API Illustration

And what is REST?

Representational State Transfer (or REST) is an approach that entails a specific style of communication through web services. When using some of the REST best practices to implement an API, we call that API a “REST API”. There are other approaches to web communication, too (such as the Simple Object Access Protocol: SOAP), but REST generally runs on less bandwidth, making it preferable to serve your machine learning models. In a REST API, the four most important types of requests are:
  • GET
  • PUT
  • POST
  • DELETE
For our little machine learning application, we will mostly focus on the POST method, since it is very versatile, and lots of clients can’t send GET methods. It’s important to mention that APIs are stateless. This means that they don’t save the inputs you give during an API call, so they don’t preserve the state. That’s significant because it allows multiple users and applications to use the API at the same time, without one user request interfering with another.

The Model

For this How-To-article, I decided to serve a machine learning model trained on the famous iris dataset. If you don’t know the dataset, you can check it out here. When making predictions, we will have four input parameters: sepal length, sepal width, petal length, and finally, petal width. Those will help to decide which type of iris flower the input is. For this example I used the scikit-learn implementation of a simple KNN (K-nearest neighbor) algorithm to predict the type of iris:
# model.py
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib
import numpy as np


def train(X,y):

    # train test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

    knn = KNeighborsClassifier(n_neighbors=1)

    # fit the model
    knn.fit(X_train, y_train)
    preds = knn.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f'Successfully trained model with an accuracy of {acc:.2f}')

    return knn

if __name__ == '__main__':

    iris_data = datasets.load_iris()
    X = iris_data['data']
    y = iris_data['target']

    labels = {0 : 'iris-setosa',
              1 : 'iris-versicolor',
              2 : 'iris-virginica'}

    # rename integer labels to actual flower names
    y = np.vectorize(labels.__getitem__)(y)

    mdl = train(X,y)

    # serialize model
    joblib.dump(mdl, 'iris.mdl')
As you can see, I trained the model with 70% of the data and then validated with 30% out of sample test data. After the model training has taken place, I serialize the model with the joblib library. Joblib is basically an alternative to pickle, which preserves the persistence of scikit estimators, which include a large number of numpy arrays (such as the KNN model, which contains all the training data). After the file is saved as a joblib file (the file ending thereby is not important by the way, so don’t be confused that some people call it .model or .joblib), it can be loaded again later in our application.

The API with Python and Flask

To build an API from our trained model, we will be using the popular web development package Flask and Flask-RESTful. Further, we import joblib to load our model and numpy to handle the input and output data. In a new script, namely app.py, we can now set up an instance of a Flask app and an API and load the trained model (this requires saving the model in the same directory as the script):
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')
The second step now is to create a class, which is responsible for our prediction. This class will be a child class of the Flask-RESTful class Resource. This lets our class inherit the respective class methods and allows Flask to do the work behind your API without needing to implement everything. In this class, we can also define the methods (REST requests) that we talked about before. So now we implement a Predict class with a .post() method we talked about earlier. The post method allows the user to send a body along with the default API parameters. Usually, we want the body to be in JSON format. Since this body is not delivered directly in the URL, but as a text, we have to parse this text and fetch the arguments. The flask _restful package offers the RequestParser class for that. We simply add all the arguments we expect to find in the JSON input with the .add_argument() method and parse them into a dictionary. We then convert it into an array and return the prediction of our model as JSON.
class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200
You might be wondering what the 200 is that we are returning at the end: For APIs, some HTTP status codes are displayed when sending requests. You all might be familiar with the famous 404 - page not found code. 200 just means that the request has been received successfully. You basically let the user know that everything went according to plan. In the end, you just have to add the Predict class as a resource to the API, and write the main function:
API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')
The '/predict' you see in the .add_resource() call, is the so-called API endpoint. Through this endpoint, users of your API will be able to access and send (in this case) POST requests. If you don’t define a port, port 5000 will be the default. You can see the whole code for the app again here:
# app.py
from flask import Flask
from flask_restful import Api, Resource, reqparse
from sklearn.externals import joblib
import numpy as np

APP = Flask(__name__)
API = Api(APP)

IRIS_MODEL = joblib.load('iris.mdl')


class Predict(Resource):

    @staticmethod
    def post():
        parser = reqparse.RequestParser()
        parser.add_argument('petal_length')
        parser.add_argument('petal_width')
        parser.add_argument('sepal_length')
        parser.add_argument('sepal_width')

        args = parser.parse_args()  # creates dict

        X_new = np.fromiter(args.values(), dtype=float)  # convert input to array

        out = {'Prediction': IRIS_MODEL.predict([X_new])[0]}

        return out, 200


API.add_resource(Predict, '/predict')

if __name__ == '__main__':
    APP.run(debug=True, port='1080')

Run the API

Now it’s time to run and test our API! To run the app, simply open a terminal in the same directory as your app.py script and run this command.
python run app.py
You should now get a notification, that the API runs on your localhost in the port you defined. There are several ways of accessing the API once it is deployed. For debugging and testing purposes, I usually use tools like Postman. We can also access the API from within a Python application, just like another user might want to do to use your model in their code. We use the requests module, by first defining the URL to access and the body to send along with our HTTP request:
import requests

url = 'http://127.0.0.1:1080/predict'  # localhost and the defined port + endpoint
body = {
    "petal_length": 2,
    "sepal_length": 2,
    "petal_width": 0.5,
    "sepal_width": 3
}
response = requests.post(url, data=body)
response.json()
The output should look something like this:
Out[1]: {'Prediction': 'iris-versicolor'}
That’s how easy it is to include an API call in your Python code! Please note that this API is just running on your localhost. You would have to deploy the API to a live server (e.g., on AWS) for others to access it.

Conclusion

In this blog article, you got a brief overview of how to build a REST API to serve your machine learning model with a web interface. Further, you now understand how to integrate simple API requests into your Python code. For the next step, maybe try securing your APIs? If you are interested in learning how to build an API with R, you should check out this post. I hope that this gave you a solid introduction to the concept and that you will be building your own APIs immediately. Happy coding!  

Because You Are Interested In Data Science, You Are Interested In This Blog Post

If you love streaming movies and tv series online as much as we do here at STATWORX, you’ve probably stumbled upon recommendations like “Customers who viewed this item also viewed…” or “Because you have seen …, you like …”. Amazon, Netflix, HBO, Disney+, etc. all recommend their products and movies based on your previous user behavior – But how do these companies know what their customers like? The answer is collaborative filtering. In this blog post, I will first explain how collaborative filtering works. Secondly, I’m going to show you how to develop your own small movie recommender with the R package recommenderlab and provide it in a shiny application.

Different Approaches

There are several approaches to give a recommendation. In the user-based collaborative filtering (UBCF), the users are in the focus of the recommendation system. For a new proposal, the similarities between new and existing users are first calculated. Afterward, either the n most similar users or all users with a similarity above a specified threshold are consulted. The average ratings of the products are formed via these users and, if necessary, weighed according to their similarity. Then, the x highest rated products are displayed to the new user as a suggestion. For the item-based collaborative filtering IBCF, however, the focus is on the products. For every two products, the similarity between them is calculated in terms of their ratings. For each product, the k most similar products are identified, and for each user, the products that best match their previous purchases are suggested. Those and other collaborative filtering methods are implemented in the recommenderlab package:
  • ALS_realRatingMatrix: Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm.
  • ALS_implicit_realRatingMatrix: Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm.
  • IBCF_realRatingMatrix: Recommender based on item-based collaborative filtering.
  • LIBMF_realRatingMatrix: Matrix factorization with LIBMF via package recosystem.
  • POPULAR_realRatingMatrix: Recommender based on item popularity.
  • RANDOM_realRatingMatrix: Produce random recommendations (real ratings).
  • RERECOMMEND_realRatingMatrix: Re-recommends highly-rated items (real ratings).
  • SVD_realRatingMatrix: Recommender based on SVD approximation with column-mean imputation.
  • SVDF_realRatingMatrix: Recommender based on Funk SVD with gradient descend.
  • UBCF_realRatingMatrix: Recommender based on user-based collaborative filtering.

Developing your own Movie Recommender

Dataset

To create our recommender, we use the data from movielens. These are film ratings from 0.5 (= bad) to 5 (= good) for over 9000 films from more than 600 users. The movieId is a unique mapping variable to merge the different datasets.
head(movie_data)
  movieId                              title                                      genres
1       1                   Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2       2                     Jumanji (1995)                  Adventure|Children|Fantasy
3       3            Grumpier Old Men (1995)                              Comedy|Romance
4       4           Waiting to Exhale (1995)                        Comedy|Drama|Romance
5       5 Father of the Bride Part II (1995)                                      Comedy
6       6                        Heat (1995)                       Action|Crime|Thriller
head(ratings_data)
  userId movieId rating timestamp
1      1       1      4 964982703
2      1       3      4 964981247
3      1       6      4 964982224
4      1      47      5 964983815
5      1      50      5 964982931
6      1      70      3 964982400
To better understand the film ratings better, we display the number of different ranks and the average rating per film. We see that in most cases, there is no evaluation by a user. Furthermore, the average ratings contain a lot of “smooth” ranks. These are movies that only have individual ratings, and therefore, the average score is determined by individual users.
# ranting_vector
0         0.5    1      1.5    2      2.5   3      3.5    4       4.5   5
5830804   1370   2811   1791   7551   5550  20047  13136  26818   8551  13211
Average Movie Ratings
In order not to let individual users influence the movie ratings too much, the movies are reduced to those that have at least 50 ratings.
Average Movie Ratings - filtered
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.208   3.444   3.748   3.665   3.944   4.429
Under the assumption that the ratings of users who regularly give their opinion are more precise, we also only consider users who have given at least 50 ratings. For the films filtered above, we receive the following average ratings per user:
Average Movie Ratings - relevant
You can see that the distribution of the average ratings is left-skewed, which means that many users tend to give rather good ratings. To compensate for this skewness, we normalize the data.
ratings_movies_norm <- normalize(ratings_movies)

Model Training and Evaluation

To train our recommender and subsequently evaluate it, we carry out a 10-fold cross-validation. Also, we train both an IBCF and a UBCF recommender, which in turn calculate the similarity measure via cosine similarity and Pearson correlation. A random recommendation is used as a benchmark. To evaluate how many recommendations can be given, different numbers are tested via the vector n_recommendations.
eval_sets <- evaluationScheme(data = ratings_movies_norm,
                              method = "cross-validation",
                              k = 10,
                              given = 5,
                              goodRating = 0)

models_to_evaluate <- list(
  `IBCF Cosinus` = list(name = "IBCF", 
                        param = list(method = "cosine")),
  `IBCF Pearson` = list(name = "IBCF", 
                        param = list(method = "pearson")),
  `UBCF Cosinus` = list(name = "UBCF",
                        param = list(method = "cosine")),
  `UBCF Pearson` = list(name = "UBCF",
                        param = list(method = "pearson")),
  `Zufälliger Vorschlag` = list(name = "RANDOM", param=NULL)
)

n_recommendations <- c(1, 5, seq(10, 100, 10))

list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)
We then have the results displayed graphically for analysis.
Different models
We see that the best performing model is built by using UBCF and the Pearson correlation as a similarity measure. The model consistently achieves the highest true positive rate for the various false-positive rates and thus delivers the most relevant recommendations. Furthermore, we want to maximize the recall, which is also guaranteed at every level by the UBCF Pearson model. Since the n most similar users (parameter nn) are used to calculate the recommendations, we will examine the results of the model for different numbers of users.
vector_nn <- c(5, 10, 20, 30, 40)

models_to_evaluate <- lapply(vector_nn, function(nn){
  list(name = "UBCF",
       param = list(method = "pearson", nn = vector_nn))
})
names(models_to_evaluate) <- paste0("UBCF mit ", vector_nn, "Nutzern")
list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)
Different users

Conclusion

Our user based collaborative filtering model with the Pearson correlation as a similarity measure and 40 users as a recommendation delivers the best results. To test the model by yourself and get movie suggestions for your own flavor, I created a small Shiny App. However, there is no guarantee that the suggested movies really meet the individual taste. Not only is the underlying data set relatively small and can still be distorted by user ratings, but the tech giants also use other data such as age, gender, user behavior, etc. for their models. But what I can say is: Data Scientists who read this blog post also read the other blog posts by STATWORX.

Shiny-App

Here you can find the Shiny App. To get your own movie recommendation, select up to 10 movies from the dropdown list, rate them on a scale from 0 (= bad) to 5 (= good) and press the run button. Please note that the app is located on a free account of shinyapps.io. This makes it available for 25 hours per month. If the 25 hours are used and therefore the app is this month no longer available, you will find the code here to run it on your local RStudio. At STATWORX we are excited that a new promising field of Machine Learning has evolved in recent years: Causal Machine Learning. In short, Causal Machine Learning is the scientific study of Machine Learning algorithms which allow estimating causal effects. Over the last few years, different Causal Machine Learning algorithms have been developed, combining the advances from Machine Learning with the theory of causal inference to estimate different types of causal effects. My colleague Markus has already introduced some of these algorithms in an earlier blog post. As Causal Machine Learning is a rather complex topic, I will write a series of blog posts to slowly dive into this new fascinating world of data science. In my first blog post, I gave an introduction into the topic, focusing on what Causal Machine Learning is and why it is important in practice and for the future of data science. In this second blog post, I will introduce the so-called Causal Forest, one of the most popular Causal Machine Learning algorithms to estimate heterogeneous treatment effects.

Why Heterogeneous Treatment Effects?

In Causal Forests, the goal is to estimate heterogeneity in treatment effects. As explained in my previous blog post, a treatment effect refers to a causal effect of a treatment or intervention on an outcome variable of scientific or political interest. For example the causal effect of a subsidised training programme on earnings. As individual treatment effects are unobservable, the practice focuses on estimating unbiased and consistent averages of the individual treatment effect. The most common parameter thereof is the average treatment effect, which is the mean of all individual treatment effects in the entire population of interest. However, sometimes treatment effects may vary widely between different subgroups in the population, bet it larger or smaller than the average treatment effect. In some cases, it might therefore be more interesting to estimate these different, i.e. heterogeneous treatment effects.
In most applications it is also interesting to look beyond the average effects in order to understand how the causal effects vary with observable characteristics. (Knaus, Lechner & Strittmatter, 2018)
The estimation of heterogeneous treatment effects can assist in answering questions like: For whom are there big or small treatment effects? For which subgroup does a treatment generate beneficial or adverse effects? In the field of marketing, for example, the estimation of heterogeneous treatment effects can help to optimise resource allocation by answering the question of which customers respond the most to a certain marketing campaign or for which customers is the causal effect of intervention strategies on their churn behaviour the highest. Or when it comes to pricing, it might be interesting to quantify how a change in price has varying impact on sales among different age or income groups.

Where Old Estimation Methods Fail

Estimating heterogeneous treatment effects is nothing new. Econometrics and other social sciences have long been studying which variables predict a smaller or larger than average treatment effect, which in statistical terms is also known as Moderation. One of the most traditional ways to find heterogeneous treatment effects is to use a Multiple Linear Regression with interaction terms between the variables of interest (i.e. the ones which might lead to treatment heterogeneity) and the treatment indicator. In this blog post, I will always assume that the data is from a randomised experiment, such that the assumptions to identify treatment effects are valid without further complications. We then conclude that the treatment effect depends on the variables whose interaction term is statistically significant. For example, if we have only one variable, the regression model would look as follows:

    \[Y = beta_0 + beta_1 w + beta_2 x_1 + beta_3 (w * x_1),\]

where w is the treatment indicator and x_1 is the variable of interest. In that case, if beta_3 is significant, we know that the treatment effect depends on variable x_1. The treatment effect for each observation can then be calculated as

    \[beta_1 + beta_3 * x_1,\]

which is dependent on the value of x_1 and therefore heterogeneous among the different observations. So why is there a need for more advanced methods to estimate heterogeneous treatment effects? The example above was very simple, it only included one variable. However, usually, we have more than one variable which might influence the treatment effect. To see which variables predict heterogeneous treatment effects, we have to include many interaction terms, not only between each variable and the treatment indicator but also for all possible combinations of variables with and without the treatment indicator. If we have p variables and one treatment, this gives a total number of parameters of:

    \[displaystylesum_{k = 0}^{p + 1} {p + 1 choose k}.\]

So, for example if we had 5 variables, we would have to include a total number of 64 parameters into our Linear Regression Model. This approach suffers from a lack of statistical power and could also cause computational issues. The use of a Multiple Linear Regression also imposes linear relationships unless more interactions with polynomials are included. Because Machine Learning algorithms can handle enormous numbers of variables and combining them in nonlinear and highly interactive ways, researchers have found ways to better estimate heterogeneous treatment effects by combining the field of Machine Learning with the study of Causal Inference.

Generalised Random Forests

Over recent years, different Machine Learning algorithms have been developed to estimate heterogeneous treatment effects. Most of them are based on the idea of Decision Trees or Random Forests, just like the one I focus on in this blog post: Generalised Random Forests by Athey, Tibshirani and Wager (2018). Generalised Random Forests follows the idea of Random Forests and apart from heterogeneous treatment effect estimation, this algorithm can also be used for non-parametric quantile regression and instrumental variable regression. It keeps the main structure of Random Forests such as the recursive partitioning, subsampling and random split selection. However, instead of averaging over the trees Generalised Random Forests estimate a weighting function and uses the resulting weights to solve a local GMM model. To estimate heterogeneous treatment effects, this algorithm has two important additional features, which distinguish it from standard Random Forests.

1. Splitting Criterion

The first important difference to Random Forests is the splitting criterion. In Random Forests, where we want to predict an outcome variable Y, the split at each tree node is performed by minimising the mean squared error of the outcome variable Y. In other words, the variable and value to split at each tree node are chosen such that the greatest reduction in the mean squared error with regard to the outcomes Y is achieved. After each tree partition has been completed, the tree’s prediction for a new observation x is obtained by letting it trickle through all the way from tree’s root into a terminal node, and then taking the average of outcomes Y of all the observations x that fell into the same node during training. The Random Forest prediction is then calculated as the average of the predicted tree values. In Causal Forests, we want to estimate treatment effects. As stated by the Fundamental Problem of Causal Inference however, we can never observe a treatment effect on an individual level. Therefore, the prediction of a treatment effect is given by the difference in the average outcomes Y between the treated and the untreated observations in a terminal node. Without going into too much detail, to find most heterogeneous but also accurate treatment effects, the splitting criterion is adapted such that it searches for a partitioning where the treatment effects differ the most including a correction that accounts for how the splits affect the variance of the parameter estimates.
tree branches

2. Honesty

Random Forests are usually evaluated by applying them to a test set and measure the accuracy of the predictions of Y using an error measure such as the mean squared error. Because we can never observe treatment effects, this form of performance measure is not possible in Causal Forests. When estimating causal effects, one, therefore, evaluates their accuracy by examining the bias, standard error and the related confidence interval of the estimates. To ensure that an estimate is as accurate as possible, the bias should asymptotically disappear and the standard error and, thus, the confidence interval, should be as small as possible. To enable this statistical inference in their Generalised Random Forest, Athey, Tibshirani and Wager introduce so-called honest trees. In order to make a tree honest, the training data is split into two subsamples: a splitting subsample and an estimating subsample. The splitting subsample is used to perform the splits and thus grow the tree. The estimating subsample is then used to make the predictions. That is, all observations in the estimating subsample are dropped down the previously-grown tree until it falls into a terminal node. The prediction of the treatment effects is then given by the difference in the average outcomes between the treated and the untreated observations of the estimating subsample in the terminal nodes. With such honest trees, the estimates of a Causal Forest are consistent (i.e. the bias vanishes asymptotically) and asymptotically Gaussian which together with the estimator for the asymptotic variance allow valid confidence intervals.

Causal Forest in Action

To show the advantages of Causal Forests compared to old estimation methods, in the following I will compare the Generalised Random Forest to a Regression with interaction terms in a small simulation study. I use simulated data to be able to compare the estimated treatment effects with the actual treatment effects, which, as we know, would not be observable in real data. To compare the two algorithms with respect to the estimation of heterogeneous treatment effects, I test them on two different data sets, one with and one wihtout heterogeneity in the treatment effect:
Data Set Heterogeneity Heterogeneity Variables Variables Observations
1 No Heterogeneity x_1x_{10} 20000
2 Heterogeneity x_1 and x_2 x_1x_{10} 20000
This means that in the first data set, all observations have the same treatment effect. In this case, the average treatment effect and the heterogeneous treatment effects are the same. In the second data set, the treatment effect varies with the variables x_1 and x_2. Without going into too much detail here (I will probably write a separate blog post only about causal data generating processes), the relationship between those heterogeneity variables (x_1 and x_2) and the treatment effect is not linear. Both simulated data sets have 20’000 observations containing an outcome variable Y and 10 covariates with values between zero and one. To evaluate the two algorithms, the data sets are split in a train (75%) and a test set (25%). For the Causal Forest, I use the causal_forest() from the grf-package with tune.parameters = "all". I compare this to an lm() model, which includes all variables, the treatment indicator and the necessary interaction terms of the heterogeneity variables and the treatment indicator: Linear Regression Model for data set with heterogeneity:

    \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + dots + beta_{10} x_{10} + beta_{11} w +\]

    \[beta_{12} (w * x_1) + beta_{13} (w * x_2) + beta_{14} (x_1 * x_2) + beta_{15} (w * x_1 * x_2)\]

Linear Regression Model for data set with no heterogeneity:

    \[Y = beta_0 + beta_1 x_1 + beta_2 x_2 + … + beta_{10} x_{10} + beta_{11} w\]

where x_1x_{10} are the heterogeneity variables and w is the treatment indicator (i.e. w = 1 if treated and w = 0 if not treated). As already explained above, we usually do not know which variables affect the treatment effect and have therefore to include all possible interaction terms into the Linear Regression Model to see which variables lead to treatment effect heterogeneity. In the case of 10 variables as we have it here, this means we would have to include a total of 2048 parameters in our Linear Regression Model. However, since the heterogeneity variables are known in the simulated data, here, I only include the interaction terms for those variables.
Data Set Metric grf lm
No Heterogeneity RMSE 0.01 0.00
Heterogeneity RMSE 0.08 0.45
Looking at the results, we can see that without heterogeneity, the treatment effect is equally well predicted by the Causal Forest (RMSE of 0.01) and the Linear Regression (RMSE of 0.00). However, as the heterogeneity level increases, the Causal Forest is far more accurate (RMSE of 0.08) than the Linear Regression (RMSE of 0.45). As expected, the Causal Forest seems to be better at detecting the underlying non-linear relationship between the heterogeneity variables and the treatment effect than the Linear Regression Model, which can also be seen in the plots below. Thus, even if we already know which variables influence the treatment effect and only need to include the necessary interaction terms, the Linear Regression Model is still less accurate than the Causal Forest due to its lack of modelling flexibility.
treatment effect hexplot

Outlook

I hope that this blog post has helped you to understand what Causal Forests are and what advantages they bring in estimating heterogeneous treatment effects compared to old estimation methods. In my upcoming blog posts on Causal Machine Learning, I will explore this new field of data science further. I will, for example, take a look at the problems of using classical Machine Learning algorithms to estimate causal effects in more detail or introduce different data generating processes to evaluate Causal Machine Learning methods in simulation studies.

References

  • Athey, S., Tibshirani, J., & Wager, S. (2019). Generalised random forests. The Annals of Statistics, 47(2), 1148-1178.
  • Knaus, M. C., Lechner, M., & Strittmatter, A. (2018). Machine learning estimation of heterogeneous causal effects: Empirical monte carlo evidence. arXiv:1810.13237v2.
A major problem arises when comparing forecasting methods and models across different time series. This is a challenge we regularly face at STATWORX. Unit dependent measures like the MAE (Mean Absolute Error) and the RMSE (Root Mean Squared Error) turn out to be unsuitable and hardly helpful if the time series is measured in different units. However, if this is not the case, both measures provide valuable information. The MAE is perfectly interpretable as it embodies the average absolute deviation from the actual values. The RMSE, on the other hand, is not that easy to interpret, more vulnerable to extreme values but still often used in practice. MAE =frac{1}{n} sum_{i =1}^{n}{|{rm Actual}_i - {rm Forecast}_i}| mathrm{RMSE= }sqrt{frac{mathrm{1}}{mathrm{n}}mathrm{ } sum_{mathrm{i = 1}}^{mathrm{n}}{mathrm{(}{mathrm{Actual}}_mathrm{i}mathrm{-} {mathrm{Forecast}}_mathrm{i}mathrm{)} }^mathrm{2}} One of the most commonly used measures that avoids this problem is called MAPE (Mean Absolute Percentage Error). It solves the problem of the mentioned approaches as it does not depend on the unit of the time series. Furthermore, decision-makers without a statistical background can easily interpret and understand this measure. Despite its popularity, the MAPE was and is still criticized. MAPE =frac{1}{n} sum_{i =1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{{rm Actual}_i}|}*100 In this article, I evaluate these critical arguments and prove that at least some of them are highly questionable. The second part of my article concentrates on true weaknesses of the MAPE, some of them well-known but others hiding in the shadows. In the third section, I discuss various alternatives and summarize under which circumstances the use of the MAPE seems to be appropriate (and when it’s not).

Table of Contents

What the MAPE is FALSELY blamed for!

It Puts Heavier Penalties on Negative Errors Than on Positive Errors

Most sources dealing with the MAPE point out this “major” issue of the measure. The statement is primarily based on two different arguments. First, they claim that interchanging the actual value with the forecasted value proofs their point (Makridakis 1993). Case 1: {Actual}_1 = 150 & {Forecast}_1 = 100 (positive error) {rm APE}_1 = |frac{{rm Actual}_1 - {rm Forecast}_1}{{rm Actual}_1}| *100 = |frac{150 - 100}{150}| *100 = 33.33 Percent Case 2: {Actual}_2 = 100 & {Forecast}_2 = 150 (negative error) {rm APE}_2 = |frac{{rm Actual}_2 - {rm Forecast}_2}{{rm Actual}_2}| *100 = |frac{100 - 150}{100}| *100 = 50 Percent It is true that Case 1 (positive error of 50) is related to a lower APE (Absolute Percentage Error) than Case 2 (negative error of 50). However, the reason here is not that the error is positive or negative but simply that the actual value changes. If the actual value stays constant, the APE is equal for both types of errors (Goodwin & Lawton 1999). That is clarified by the following example. Case 3: {Actual}_3 = 100 & {Forecast}_3 = 50 {rm APE}_3 = |frac{{rm Actual}_3 - {rm Forecast}_3}{{rm Actual}_3}| *100 = |frac{100 - 50}{100}| *100 = 50 Percent Case 4: {Actual}_4= 100 & {Forecast}_4 = 150 {rm APE}_4 = |frac{{rm Actual}_4 - {rm Forecast}_4}{{rm Actual}_4}| *100 = |frac{100 - 150}{100}| *100 = 50 Percent The second, equally invalid argument supporting the asymmetry of the MAPE arises from the assumption about the predicted data. As the MAPE is mainly suited to be used to evaluate predictions on a ratio scale, the MAPE is bounded on the lower side by an error of 100% (Armstrong & Collopy 1992). However, this does not imply that the MAPE overweights or underweights some types of errors, but that these errors are not possible.

Its TRUE weaknesses!

It Fails if Some of the Actual Values Are Equal to Zero

This statement is a well-known problem of the focal measure. However, that and the latter argument were the reason for the development of a modified form of the MAPE, the SMAPE (“Symmetric” Mean Absolute Percentage). Ironically, in contrast to the original MAPE, this modified form suffers from true asymmetry (Goodwin & Lawton 1999). I will clarify this argument in the last section of the article.

Particularly Small Actual Values Bias the Mape

If any true values are very close to zero, the corresponding absolute percentage errors will be extremely high and therefore bias the informativity of the MAPE (Hyndman & Koehler 2006). The following graph clarifies this point. Although all three forecasts have the same absolute errors, the MAPE of the time series with only one extremely small value is approximately twice as high as the MAPE of the other forecasts. This issue implies that the MAPE should be used carefully if there are extremely small observations and directly motivates the last and often ignored the weakness of the MAPE.
extreme-update

The Mape Implies Only Which Forecast Is Proportionally Better

As mentioned at the beginning of this article, one advantage of using the MAPE for comparison between forecasts of different time series is its unit independency. However, it is essential to keep in mind that the MAPE only implies which forecast is proportionally better. The following graph shows three different time series and their corresponding forecasts. The only difference between them is their general level. The same absolute errors lead, therefore, to profoundly different MAPEs. This article critically questions, if it is reasonable to use such a percentage-based measure for the comparison between forecasts for different time series. If the different time series aren’t behaving in a somehow comparable level (as shown in the following graphic), using the MAPE to infer if a forecast is generally better for one time series than for another relies on the assumption that the same absolute errors are less problematic for time series on higher levels than for time series on lower levels:
If a time series fluctuates around 100, then predicting 101 is way better than predicting 2 for a time series fluctuating around 1.”
That might be true in some cases. However, in general, this a questionable or at least an assumption people should always be aware of when using the MAPE to compare forecasts between different time series.
level-update

Summary

In summary, the discussed findings show that the MAPE should be used with caution as an instrument for comparing forecasts across different time series. A necessary condition is that the time series only contains strictly positive values. Second only some extremely small values have the potential to bias the MAPE heavily. Last, the MAPE depends systematically on the level of the time series as it is a percentage based error. This article critically questions if it is meaningful to generalize from being a proportionally better forecast to being a generally better forecast.

BETTER alternatives!

The discussed implies that the MAPE alone is often not very useful when the objective is to compare accuracy between different forecasts for different time series. Although relying only on one easily understandable measure appears to be comfortable, it comes with a high risk of drawing misleading conclusions. In general, it is always recommended to use different measures combined. In addition to numerical measures, a visualization of the time series, including the actual and the forecasted values always provides valuable information. However, if one single numeric measure is the only option, there are some excellent alternatives.

Scaled Measures

Scaled measures compare the measure of a forecast, for example, the MAE relative to the MAE of a benchmark method. Similar measures can be defined using RMSE, MAPE, or other measures. Common benchmark methods are the “random walk”, the “naïve” method and the “mean” method. These measures are easy to interpret as they show how the focal model compares to the benchmark methods. However, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method. Relative MAE = frac{{rm MAE}_{focal model}}{{rm MAE}_{benchmark model}}

Scaled Errors

Scaled errors approaches also try to remove the scale of the data by comparing the forecasted values to those obtained by some benchmark forecast method, like the naïve method. The MASE (Mean Absolute Scaled Error), proposed by Hydnmann & Koehler 2006, is defined slightly different dependent on the seasonality of the time series. In the simple case of a non-seasonal time series, the error of the focal forecast is scaled based on the in-sample MAE from the naïve forecast method. One major advantage is that it can handle actual values of zero and that it is not biased by very extreme values. Once again, it is important to keep in mind that relative measures rely on the selection of the benchmark method and on how good the time series can be forecasted by the selected method. Non-Seasonal MASE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{frac{1}{T-1}sum_{t=2}^{T}{|{rm Actual}_t-{rm Actual}_{t-1}|}}|} Seasonal MASE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{frac{1}{T-M}sum_{t=m+1}^{T}{|{rm Actual}_t-{rm Actual}_{t-m}|}}|}

SDMAE

In my understanding, the basic idea of using the MAPE to compare different time series between forecasts is that the same absolute error is assumed to be less problematic for time series on higher levels than for time series on lower levels. Based on the examples shown earlier, I think that this idea is at least questionable. I argue that how good or bad a specific absolute error is evaluated should not depend on the general level of the time series but on its variation. Accordingly, the following measure the SDMAE (Standard Deviation adjusted Mean Absolute Error) is a product of the discussed issues and my imagination. It can be used for evaluating forecasts for times series containing negative values and does not suffer from actual values being equal to zero nor particularly small. Note that this measure is not defined for time series that do not fluctuate at all. Furthermore, there might be other limitations of this measure, that I am currently not aware of. SDMAE = frac{{rm MAE}_{focal model}}{{rm SD}_{actual values}}
SDMAE-update

Summary

I suggest using a combination of different measures to get a comprehensive understanding of the performance of the different forecasts. I also suggest complementing the MAPE with a visualization of the time series, including the actual and the forecasted values, the MAE, and a Scaled Measure or Scaled Error approach. The SDMAE should be seen as an alternative approach that was not discussed by a broader audience so far. I am thankful for your critical thoughts and comments on this idea.

Worse alternatives!

SMAPE

The SMAPE was created, to solve and respond to the problems of the MAPE. However, this did neither solve the problem of extreme small actual values nor the level dependency of the MAPE. The reason is that extreme small actual values are typically related to extreme small predictions (Hyndman & Koehler 2006). Additional, and in contrast to the unmodified MAPE, the SMAPE raises the problem of asymmetry (Goodwin & Lawton 1999). This is clarified through the following graphic, whereas the ” APE” relates to the MAPE and the “SAPE” relates to the SMAPE. It shows that the SAPE is higher for positive errors than for negative errors and therefore, asymmetric. The SMAPE is not recommended to be used by several scientists (Hyndman & Koehler 2006). SMAPE=frac{1}{n}sum_{i = 1}^{n}{|frac{{rm Actual}_i - {rm Forecast}_i}{({rm Actual}_i+{rm Forecast}_1)/2}|*100} On the asymmetry of the symmetric MAPE (Goodwin & Lawton 1999)
ape-vs-modified-ape

References

  • Goodwin, P., & Lawton, R. (1999). On the asymmetry of the symmetric MAPE. International journal of forecasting, 15(4), 405-408.
  • Hyndman, R. J., & Koehler, A. B. (2006). Another look at measures of forecast accuracy. International journal of forecasting, 22(4), 679-688.
  • Makridakis, S. (1993). Accuracy measures: theoretical and practical concerns. International Journal of Forecasting, 9(4), 527-529.
  • Armstrong, J. S., & Collopy, F. (1992). Error measures for generalizing about forecasting methods: Empirical comparisons. International journal of forecasting, 8(1), 69-80.
  In my last blog post, I have elaborated on the Bagging algorithm and showed its prediction performance via simulation. Here, I want to go into the details on how to simulate the bias and variance of a nonparametric regression fitting method using R. These kinds of questions arise here at STATWORX when developing, for example, new machine learning algorithms or testing established ones which shall generalize well to new unseen data.

Decomposing the mean squared error

We consider the basic regression setup where we observe a real-valued sample (y_i, x_i){i=1}^n and our aim is to predict an outcome i = 1, dots, n based on some predictor variables (“covariates”) x_i via a regression fitting method. Usually, we can measure our target only with noise epsilon_i, y_i = f(x_i) + epsilon_i, quad i = 1, dots, n. To measure our prediction accuracy, we will use the mean squared error (MSE) which can be decomposed as follows: MSE = n^{-1} sum_{i=1}^{n}[E(hat{f}(x_i)) - f(x_i)]^2 + n^{-1} sum_{i=1}^{n}Var(hat{f}(x_i)) MSE =Bias^2 + Variance hat{f}(x_i) is our prediction obtained by the regression method at hand. From the above expression, we observe that the MSE consists of two parts:
  • Bias^2: measures the (squared) difference between the true underlying process and the mean of our predictions, i.e. [E(hat{f}(x_i)) - f(x_i)]^2.
  • Variance: measures the variation of our predictions around its mean, i.e. Var(hat{f}(x_i)).
In general, it will not be possible to minimize both expressions as they are competing with each other. This is what is called the bias-variance tradeoff. More complex models (e.g. higher order polynomials in polynomial regression) will result in low bias while yielding high variance as we fit characteristic features of the data that are not necessary to predict the true outcome (and vice versa, cf. Figure 1).

Monte Carlo Setup & Simulation Code

To illustrate this, we consider a simple toy model. y_i = 0.8x_i + sin(6x_i) + epsilon_i, quad i=1, dots, n where n=100, x_i sim U([0,1]) and epsilon_i sim mathcal{N}(0, 2). As a fitting procedure, we will use a cubic smoothing spline (smooth.spline). For the purpose of this blog post, we only need to know that a smoothing spline divides our univariate predictor space into n+1 intervals and fits each one to a cubic polynomial to approximate our target. The complexity of our smoothing spline can be controlled via the degrees of freedom (function argument df). If you are interested in the (impressive) mathematical details of smoothing splines, check out this script by Ryan Tibshirani. Figure 1 shows the bias-variance tradeoff from above. For relatively low degrees of freedom we obtain a model (red line) which is too simple and does not approximate the true data generating process well (black dashed line) – Note: for df rightarrow 0, we obtain the least-squares fit. Here, the bias will be relatively large while the variance will remain low. On the other hand, choosing relatively high degrees of freedoms leads to a model which overfits the data (green line). In this case, we clearly observe that the model is fitting characteristic features of the data which are not relevant for approximating the true data generating process.
Figure 1: In-sample fit of a (cubic) smoothing spline with varying degrees of freedoms
Figure 1: In-sample fit of a (cubic) smoothing spline with varying degrees of freedoms
To make this tradeoff more rigorous, we explicitly plot the bias and variance. For this, we conduct a Monte Carlo simulation. As a side note, to run the code snippets below, you only need the stats module which is contained in the standard R module scope. For validation purposes, we use a training and test dataset (*_test and *_train, respectively). On the training set, we construct the algorithm (obtain coefficient estimates, etc.) while on the test set, we make our predictions.
# set a random seed to make the results reproducible
set.seed(123)

n_sim <- 200
n_df <- 40
n_sample <- 100

# setup containers to store the results
prediction_matrix <- matrix(NA, nrow = n_sim, ncol = n_sample)
mse_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
results <- matrix(NA, nrow = 3, ncol = n_df)

# Train data -----
x_train <- runif(n_sample, -0.5, 0.5)
f_train <- 0.8*x_train+sin(6*x_train)

epsilon_train <- replicate(n_sim, rnorm(n_sample, 0, sqrt(2)))
y_train <- replicate(n_sim,f_train) + epsilon_train

# Test data -----
x_test <- runif(n_sample, -0.5, 0.5)
f_test <- 0.8*x_test+sin(6*x_test)
The bias-variance tradeoff can be modelled in R using two for-loops. The outer one will control the complexity of the smoothing splines (counter: df_iter). The Monte Carlo Simulation with 200 iterations (n_sim) to obtain the prediction matrix for the variance and bias is run in the inner loop.
# outer for-loop
for (df_iter in seq(n_df)){
  # inner for-loop
  for (mc_iter in seq(n_sim)){
    cspline <- smooth.spline(x_train, y_train[, mc_iter], df=df_iter+1)
    cspline_predict <- predict(cspline, x_test)
    prediction_matrix[mc_iter, 1:n_sample] <- cspline_predicty      mse_temp[mc_iter, df_iter] <- mean((cspline_predicty - f_test)^2)
  }

  var_matrix <- apply(prediction_matrix, 2, FUN = var)
  bias_matrix <- apply(prediction_matrix, 2, FUN = mean)

  squared_bias <- (bias_matrix - f_test)^2

  results[1, df_iter] <- mean(var_matrix)
  results[2, df_iter] <- mean(squared_bias)
}

results[3,1:n_df] <- apply(mse_temp, 2, FUN = mean)
To model E(cdot) and Var(cdot) from the above MSE-equation, we have to approximate those theoretical (population) terms by means of a Monte Carlo simulation (inner for-loop). We run B=50 (n_sim) Monte Carlo iterations and save the predictions obtained by the smooth.spline-object in a (B times n) prediction matrix. To approximate E(hat{f}(x_i)) at each test sample point x_i, (i = 1, dots, n), by frac{1}{B} sum_{b=1}^{B} hat{f}_b(x_i), we take the average of each column. hat{f}_b(x_i) denotes the prediction of the algorithm obtained at some sample point x_i in iteration b. Similar considerations can be made to obtain an approximation for Var(hat{f}(x_i )).

Bias-variance tradeoff as a function of the degrees of freedom

Figure 2 shows the simulated bias-variance tradeoff (as a function of the degrees of freedom). We clearly observe the complexity considerations of Figure 1. Here, the bias is quickly decreasing to zero while the variance exhibits linear increments with increasing degrees of freedoms. Hence, for higher degrees of freedom, the MSE is driven mostly by the variance. Comparing this Figure with Figure 1, we note that for df=3, the bias contributes substantially more than the variance to the MSE. If we increase the degrees of freedom to df=35 the bias tends to zero, characteristic features of the data are fitted and the MSE consists mostly of the variance.
Figure 2: Bias-Variance Tradeoff of a (cubic) smoothing spline
Figure 2: Bias-Variance Tradeoff of a (cubic) smoothing spline
With small modifications, you can use this code to explore the bias-variance tradeoff of other regression fitting and also Machine Learning methods such as Boosting or Random Forest. I leave it to you to find out which hyperparameters induce the bias-variance tradeoff in these algorithms. You can find the full code on my Github Account. If you spot any mistakes or if you are interested in discussing applied statistic and econometrics topics, feel free to contact me.

References

The simulation setup in this blog post follows closely the one from Buhlmann.
  • Buhlmann, P. and Yu, B. (2003). Boosting with the L2-loss: Regression and classification. J. Amer. Statist. Assoc. 98, 324–339.

Motivation

There are dozens of machine learning algorithms out there. It is impossible to learn all their mechanics, however, many algorithms sprout from the most established algorithms, e.g. ordinary least squares, gradient boosting, support vector machines, tree-based algorithms and neural networks. At STATWORX we discuss algorithms daily to evaluate their usefulness for a specific project. In any case, understanding these core algorithms is key to most machine learning algorithms in the literature.

While I like reading machine learning research papers, the maths is sometimes hard to follow. That is why I like implementing the algorithms in R by myself. Of course this means digging through the maths and the algorithms as well. However, you can challenge your understanding of the algorithm directly.

In my two subsequent blog post I will introduce two machine learning algorithms in 150 lines of R Code. This blog post will be about regression trees, which are the foundation of most tree-based algorithms. You can find the other blog post about coding gradient boosted machines from scratch on our blog. The algorithms will cover all core mechanics, while being very generic. You can find all code on my GitHub.

Gathering all puzzle pieces

regression tree

Surely, there are tons of great articles out there which explain regression trees theoretically accompanied with a hands-on example. This is not the objective of this blog post. If you are interested in a hands-on tutorial with all necessary theory I strongly recommend this tutorial. The objective of this blog post is to establish the theory of the algorithm by writing simple R code. You do not need any prior knowledge of the algorithm to follow. The only thing you need to know is our objective: We want to estimate our real-valued target (y) with a set of real-valued features (X).

Most probably you are already familiar with decision trees, which is a machine learning algorithm to solve classification tasks. As the name itself states regression trees solve regressions, i.e. estimation with continuous scaled targets. These kind of trees are the key part of every tree-based method, since the way you grow a tree is more of the same really. The differing parts among the implementations are mostly about the splitting rule. In this tutorial we will program a very simple, but generic implementation of a regression tree.

Fortunately, we do not have to cover much maths in this tutorial, because the algorithm itself is rather a technical than a mathematical challenge. With that said, the technical path I have chosen might not be the most efficient way, but I tried to trade-off efficiency with simplicity.

Anyway, as most of you might know decision or regression trees are rule-based approaches. Meaning, we are trying to split the data into partitions conditional to our feature space. The data partitioning is done with the help of a splitting criterion. There is no common ground on how to do those splits, there are rather multiple different splitting criteria with different pros and cons. We will focus on a rather simple criterion in this tutorial. Bear with me, here comes some maths.

SSE = sum_{i in S_1} (y_i - bar(y)_1)^2 + sum_{i in S_2} (y_i - bar(y)_2)^2

Ok, so what does this state? This is the sum of squared errors determined in two different subsets (S_1 and S_2). As the name suggests, that should be something we want to minimize. In fact, it is the squared distance between the mean and the target within this data subset. In every node of our regression tree we calculate the SSE for every potential split we could do in our data for every feature we have to figure out the best split we can achieve.

Let us have a look at the R Code:

# This is the splitting criterion we minimize (SSE [Sum Of Squared Errors]):
# SSE = sum{i in S_1} (y_i - bar(y)_1)^2 + sum{i in S_2} (y_i - bar(y)_2)^2
sse_var <- function(x, y) {
  splits <- sort(unique(x))
  sse <- c()
  for (i in seq_along(splits)) {
    sp <- splits[i]
    sse[i] <- sum((y[x < sp] - mean(y[x < sp]))^2) +
              sum((y[x >= sp] - mean(y[x >= sp]))^2) 
  }
  split_at <- splits[which.min(sse)]
  return(c(sse = min(sse), split = split_at))
}

The function takes two inputs our numeric feature x and our target real-valued y. We then go ahead and calculate the SSE for every unique value of our x. This means we calculate the SSE for every possible data subset we could obtain conditional on the feature. Often we want to cover more than one feature in our problem, which means that we have to run this function for every feature. As a result, the best splitting rule has the lowest SSE among all possible splits of all features. Once we have determined the best splitting rule we can split our data into these two subsets according to our criterion, which is nothing else than feature x <= split_at and x > split_at. We call these two subsets children and they again can be split into subsets again.

Let us lose some more words on the SSE though, because it reveals our estimator. In this implementation our estimator in the leaf is simply the avarage value of our target within this data subset. This is the simplest version of a regression tree. However, with some additional work you can apply more sophisticated models, e.g. an ordinary least squares fit.

The Algorithm

Enough with the talking, let’s get to the juice. In the following you will see the algorithm in all of its beauty. Afterwards we will breakdown the algorithm in easy-to-digest code chunks.

#' reg_tree
#' Fits a simple regression tree with SSE splitting criterion. The estimator function
#' is the mean.
#' 
#' @param formula an object of class formula
#' @param data a data.frame or matrix
#' @param minsize a numeric value indicating the minimum size of observations
#'                in a leaf
#'
#' @return itemize{
#' item tree - the tree object containing all splitting rules and observations
#' item fit - our fitted values, i.e. X %*% theta
#' item formula - the underlying formula
#' item data - the underlying data
#' }
#' @export
#'
#' @examples # Complete runthrough see: www.github.com/andrebleier/cheapml
reg_tree <- function(formula, data, minsize) {
  
  # coerce to data.frame
  data <- as.data.frame(data)
  
  # handle formula
  formula <- terms.formula(formula)
  
  # get the design matrix
  X <- model.matrix(formula, data)
  
  # extract target
  y <- data[, as.character(formula)[2]]
  
  # initialize while loop
  do_splits <- TRUE
  
  # create output data.frame with splitting rules and observations
  tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA,
                          TERMINAL = "SPLIT",
                          stringsAsFactors = FALSE)
  
  # keep splitting until there are only leafs left
  while(do_splits) {
    
    # which parents have to be splitted
    to_calculate <- which(tree_infoTERMINAL == "SPLIT")          for (j in to_calculate) {              # handle root node       if (!is.na(tree_info[j, "FILTER"])) {         # subset data according to the filter         this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"])))         # get the design matrix         X <- model.matrix(formula, this_data)       } else {         this_data <- data       }              # estimate splitting criteria       splitting <- apply(X,  MARGIN = 2, FUN = sse_var, y = y)              # get the min SSE       tmp_splitter <- which.min(splitting[1,])              # define maxnode       mn <- max(tree_infoNODE)
      
      # paste filter rules
      tmp_filter <- c(paste(names(tmp_splitter), ">=", 
                            splitting[2,tmp_splitter]),
                      paste(names(tmp_splitter), "<", 
                            splitting[2,tmp_splitter]))
      
      # Error handling! check if the splitting rule has already been invoked
      split_here  <- !sapply(tmp_filter,
                             FUN = function(x,y) any(grepl(x, x = y)),
                             y = tree_infoFILTER)              # append the splitting rules       if (!is.na(tree_info[j, "FILTER"])) {         tmp_filter  <- paste(tree_info[j, "FILTER"],                               tmp_filter, sep = " & ")       }               # get the number of observations in current node       tmp_nobs <- sapply(tmp_filter,                          FUN = function(i, x) {                            nrow(subset(x = x, subset = eval(parse(text = i))))                          },                          x = this_data)                # insufficient minsize for split       if (any(tmp_nobs <= minsize)) {         split_here <- rep(FALSE, 2)       }              # create children data frame       children <- data.frame(NODE = c(mn+1, mn+2),                              NOBS = tmp_nobs,                              FILTER = tmp_filter,                              TERMINAL = rep("SPLIT", 2),                              row.names = NULL)[split_here,]              # overwrite state of current node       tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT")               # bind everything       tree_info <- rbind(tree_info, children)              # check if there are any open splits left       do_splits <- !all(tree_infoTERMINAL != "SPLIT")
    } # end for
  } # end while
  
  # calculate fitted values
  leafs <- tree_info[tree_infoTERMINAL == "LEAF", ]   fitted <- c()   for (i in seq_len(nrow(leafs))) {     # extract index     ind <- as.numeric(rownames(subset(data, eval(parse(text = leafs[i, "FILTER"])))))     # estimator is the mean y value of the leaf     fitted[ind] <- mean(y[ind])   }      # return everything   return(list(tree = tree_info, fit = fitted, formula = formula, data = data)) } </code></pre> Uff, that was a lot of scrolling. Ok, so what do we have here. At first glance we see two loops a while loop and a for loop, which are essential to the algorithm. But let us start bit by bit. Let's have a look at the first code chunk. <pre><code>  # coerce to data.frame   data <- as.data.frame(data)      # handle formula   formula <- terms.formula(formula)      # get the design matrix   X <- model.matrix(formula, data)      # extract target   y <- data[, as.character(formula)[2]]      # initialize while loop   do_splits <- TRUE      # create output data.frame with splitting rules and observations   tree_info <- data.frame(NODE = 1, NOBS = nrow(data), FILTER = NA,                            TERMINAL = "SPLIT",                           stringsAsFactors = FALSE) </code></pre> Essentially, this is everything before the while loop. Here, we see some input handling in the beginning and the extraction of our design matrix <code>X</code> and our target <code>y</code>. All our featuers are within this design matrix. <code>do_splits</code> is the condition for our while loop, which we will cover in a bit. The data frame <code>tree_info</code> is the key storage element in our algorithm, because it will contain every information we need about our tree. The essential piece of this object is the filter column. This column saves the paths (filter) we have to take through (to apply to) our data to get to a leaf (a terminal node) in our regression tree. We initiate this <code>data.frame</code> with <code>NODE = 1</code>, since at the beginning we are in the root node of our tree with the whole data set at our expense. Furthermore, there is a column called <code>TERMINAL</code>, which controls the state of the node. We have three different states <code>SPLIT</code>, <code>LEAF</code> and <code>PARENT</code>. When we describe a node with the <code>SPLIT</code> state, we mark it for a potential split. The state <code>PARENT</code> indicates, that we have already split this node. Lastly, the state <code>LEAF</code> marks terminal nodes of our regression tree. <h3>Reaching the treetop</h3> When do we reach such a terminal node? In many implementations there is a minimum size parameter, where we determine valid splits by the amount of observations of its children. If the children have lower data points than the minimum size the split is invalid and will not be done. Imagine not having this parameter in our case, we could end up with leafs covering only one observation. Another termination rule is if the lowest SSE is at a split we have already invoked within the branch and hence has already been invoked in this branch. The split at such a point would be nonsense, since we would end up with the same subset over and over again.  Basically, this is everything there is to the algorithm. The while and for loop just ensure, that we estimate and create every node in our <code>tree_info</code>. Thus, you can perceive the <code>tree_info</code> as sort of a job list, since we create new jobs within this data frame. Let us walk through the actual R code. <pre><code>  # keep splitting until there are only leafs left   while(do_splits) {          # which parents have to be splitted     to_calculate <- which(tree_infoTERMINAL == "SPLIT")
    
    for (j in to_calculate) {
      
      # handle root node
      if (!is.na(tree_info[j, "FILTER"])) {
        # subset data according to the filter
        this_data <- subset(data, eval(parse(text = tree_info[j, "FILTER"])))
        # get the design matrix
        X <- model.matrix(formula, this_data)
      } else {
        this_data <- data
      }
      
      # estimate splitting criteria
      splitting <- apply(X,  MARGIN = 2, FUN = sse_var, y = y)
      
      # get the min SSE
      tmp_splitter <- which.min(splitting[1,])
      
      # define maxnode
      mn <- max(tree_infoNODE)              # paste filter rules       tmp_filter <- c(paste(names(tmp_splitter), ">=",                              splitting[2,tmp_splitter]),                       paste(names(tmp_splitter), "<",                              splitting[2,tmp_splitter]))              # Error handling! check if the splitting rule has already been invoked       split_here  <- !sapply(tmp_filter,                              FUN = function(x,y) any(grepl(x, x = y)),                              y = tree_infoFILTER)
      
      # append the splitting rules
      if (!is.na(tree_info[j, "FILTER"])) {
        tmp_filter  <- paste(tree_info[j, "FILTER"], 
                             tmp_filter, sep = " & ")
      } 
      
      # get the number of observations in current node
      tmp_nobs <- sapply(tmp_filter,
                         FUN = function(i, x) {
                           nrow(subset(x = x, subset = eval(parse(text = i))))
                         },
                         x = this_data)  
      
      # insufficient minsize for split
      if (any(tmp_nobs <= minsize)) {
        split_here <- rep(FALSE, 2)
      }
      
      # create children data frame
      children <- data.frame(NODE = c(mn+1, mn+2),
                             NOBS = tmp_nobs,
                             FILTER = tmp_filter,
                             TERMINAL = rep("SPLIT", 2),
                             row.names = NULL)[split_here,]
      
      # overwrite state of current node
      tree_info[j, "TERMINAL"] <- ifelse(all(!split_here), "LEAF", "PARENT")
       
      # bind everything
      tree_info <- rbind(tree_info, children)
      
      # check if there are any open splits left
      do_splits <- !all(tree_infoTERMINAL != "SPLIT")     } # end for   } # end while </code></pre> The while loop covers our tree depth and our for loop all splits within this certain depth. Within the while loop we seek every row in our <code>tree_info</code>, which we still have to estimate, i.e. all nodes in the <code>"SPLIT"</code> state. In the first iteration it would be the first row, our root node. The for loop iterates over all potential splitting nodes. The first if condition ensures that we filter our data according to the tree depth. Of course, there is no filter in the root node that is why we take the data as is. But imagine calculating possible splits for a parent in depth level two. The filter would look similar to this one: <code>"feature_1 > 5.33 & feature_2 <= 3.22"</code>. <h3>How do we apply splitting?</h3> Afterwards we seek the minimum SSE by applying the <code>sse_var</code> function to every feature. Note, that in this version we can only handle numeric features. Once we have found the best splitting variable, here named <code>tmp_splitter</code>, we build the according filter rule in object <code>tmp_filter</code>.  We still have to check if this is a valid split, i.e. we have not invoked this split for this branch yet and we have sufficient observations in our children. Our indicator <code>split_here</code> rules whether we split our node. Well, that is about it. In the last passage of the loop we prepare the output for this node. That is, handling the state of the calculated node and adding the children information to our job list <code>tree_info</code>. After the for loop we have to check whether our tree is fully grown. The variable <code>do_splits</code> checks whether there are any nodes left we have to calculate. We terminate the calculation if there are no <code>"SPLIT"</code> nodes left in our <code>tree_info</code>. <h3>Extracting our estimates</h3> <pre><code> # calculate fitted values   leafs <- tree_info[tree_infoTERMINAL == "LEAF", ]
  fitted <- c()
  for (i in seq_len(nrow(leafs))) {
    # extract index
    ind <- as.numeric(rownames(subset(data, eval(parse(
        					   text = leafs[i, "FILTER"])))))
    # estimator is the mean y value of the leaf
    fitted[ind] <- mean(y[ind])
  }

At the end of our calculation we have a filter rule for every leaf in our tree. With the help of these filters we can easily calculate the fitted values by simply applying the filter on our data and calculating our fit, i.e. the mean of our target in this leaf. I am sure by now you can think of a way to implement more sophisticated estimators, which I would leave up to you.

Well, that’s a regression tree with minimum size restriction. I have created a little runthrough with data from my simulation package on my GitHub, which you can checkout and try everything on your own. Make sure to checkout my other blog post about coding gradient boosted machines from scratch.

Introduction

One of the highlights of this year’s H2O World was a Kaggle Grandmaster Panel. The attendees, Gilberto Titericz (Airbnb), Mathias Müller (H2O.ai), Dmitry Larko (H2O.ai), Marios Michailidis (H2O.ai), and Mark Landry (H2O.ai), answered various questions about Kaggle and data science in general.

One of the questions from the audience was which tools and algorithms the Grandmasters frequently use. As expected, every single of them named the gradient boosting implementation XGBoost (Chen and Guestrin 2016). This is not surprising, since it is long known that XGBoost is at the moment the probably most used algorithm in data science.

The popularity of XGBoost manifests itself in various blog posts. Including tutorials for R and Python, Hyperparameter for XGBoost, and even using XGBoost with Nvidia’s CUDA GPU support.

At STATWORX, we also frequently leverage XGBoost’s power for external and internal projects (see Sales Forecasting Automative Use-Case). One question that we lately asked ourselves was how big the difference between the two base learners (also called boosters) offered by XGBoost is? This posts tries to answer this question in a more systematic way.

Weak Learner

Gradient boosting can be interpreted as a combination of single models (so called base learners or weak learners) to an ensemble model (Natekin and Knoll 2013).
In theory, any base learner can be used in the boosting framework, whereas some base learners have proven themselves to be particularly useful: Linear and penalized models (Hastie et al. 2008), (B/P-) splines (Huang and Yang 2004), and especially decision trees (James et al. 2013). Among the less frequently used base learners are random effects (Tutz and Groll 2009), radial basis functions (Gomez-Verdejo et al. 2002), markov random fields (Dietterich et al. 2004) und wavlets (Dubossarsky et al. 2016).

Chen and Guestrin (2016) describe XGBoost as an additive function, given the data D = left{ left( x_{ i }, y_{ i } right) right}, of the following form:

    \[hat{y_{ i }} = Phi(x_{ i }) = sum_{ k = 1 }^K{ f_{ k }(x_{ i }), f_{ k } } in F\]

In their original paper, f_{ k }(x)forall k = 1, ... , K is defined as an classification or regression tree (CART). Apart from that, the alert reader of the technical documentation knows that one can alter the functional form of f_k(x) by using the booster argument in R/Python

# Example of XGBoost for regression in R with trees (CART)
xgboost(data = train_DMatrix,
        obj = "reg:linear".
        eval_metric = "rmse",
        booster = "gbtree")

One can choose between decision trees (gbtree and dart) and linear models (gblinear). Unfortunately, there is only limited literature on the comparison of different base learners for boosting (see for example Joshi et al. 2002). To our knowledge, for the special case of XGBoost no systematic comparison is available.

Simulation and Setup

In order to compare linear with tree base learners, we propose the following Monte Carlo simulation:

1) Draw a random number n from a uniform distribution [100, 2500].
2) Simulate four datasets, two for classification and two for regression, each having n observations.
3) On each dataset, train a boosting model with tree and linear base learners, respectively.
4) Calculate an appropriate error metric for each model on each dataset.

Repeat the outlined procedure m = 100 times.

As for simulation, we use the functions twoClassSim(), LPH07_1(), LPH07_2(), SLC14_1() from the caret package. In addition to the relevant features, a varying number of (correlated) random features was added. Note that in order to match real life data, all data generating processes involve non-linear components. For further details, we advise the reader to take a look at the caret package documentation.

For each dataset, we apply the same (random) splitting strategy, where 70% of the data goes to training, 15% is used for validation, and the last 15% is used for testing. Regarding hyperparameter tuning, we use a grid-search strategy in combination with 10-fold crossvalidation on the training data. Regardless of the base learner type, L1 (alpha) and L2 (lambda) regularization were tuned using a shared parameter space.
For tree boosting, the learning rate (eta) was held constant at 0.3 while tuning the optimal tree size (max_depth). Finally, we used a fixed number of 1000 boosting iterations (nrounds) in combination with ten early stopping rounds (early_stopping_rounds) on the validation frame. The final performance was evaluated by applying the model with the best crossvalidated parameters on the test dataset.

Results

Figure 1 and Figure 2 show the distributions of out of sample classification errors (AUC) and regression errors (RMSE) for both datasets. Associated summary statistics can be found in Table 1.

Table 1: Error summary statistics by datasets and base learners
Base learner Dataset Type Error metric Average error Error std.
Linear 1 Classification AUC 0.904 0.031
Tree 1 Classification AUC 0.934 0.090
Linear 2 Classification AUC 0.733 0.087
Tree 2 Classification AUC 0.730 0.062
Linear 3 Regression RMSE 45.182 2.915
Tree 3 Regression RMSE 17.207 9.067
Linear 4 Regression RMSE 17.383 1.454
Tree 4 Regression RMSE 6.595 3.104

For the first dataset, the models using tree learners are on average better than the models with linear learners. However, the tree models exhibit a greater variance. The relationships are reversed for the second dataset. On average, the linear models are slightly better and the tree models exhibit a lower variance.

result-oos-classification

In contrast to the classification case, there is for both regression datasets a substantial difference in performance in favor of the tree models. For the third dataset, the tree models are on average better than their linear counterparts. Also, the variance of the results is substantially higher for the tree models. The results are similar for the fourth dataset. The tree models are again better on average than their linear counterparts, but feature a higher variation.

result-oos-regression

Summary

The results from a Monte Carlo simulation with 100 artificial datasets indicate that XGBoost with tree and linear base learners yields comparable results for classification problems, while tree learners are superior for regression problems. Based on this result, there is no single recommendation which model specification one should use when trying to minimize the model bias. In addition, tree based XGBoost models suffer from higher estimation variance compared to their linear counterparts. This finding is probably related to the more sophisticated parameter space of tree models. The complete code can be found on github.

References

Introduction

One of the highlights of this year’s H2O World was a Kaggle Grandmaster Panel. The attendees, Gilberto Titericz (Airbnb), Mathias Müller (H2O.ai), Dmitry Larko (H2O.ai), Marios Michailidis (H2O.ai), and Mark Landry (H2O.ai), answered various questions about Kaggle and data science in general.

One of the questions from the audience was which tools and algorithms the Grandmasters frequently use. As expected, every single of them named the gradient boosting implementation XGBoost (Chen and Guestrin 2016). This is not surprising, since it is long known that XGBoost is at the moment the probably most used algorithm in data science.

The popularity of XGBoost manifests itself in various blog posts. Including tutorials for R and Python, Hyperparameter for XGBoost, and even using XGBoost with Nvidia’s CUDA GPU support.

At STATWORX, we also frequently leverage XGBoost’s power for external and internal projects (see Sales Forecasting Automative Use-Case). One question that we lately asked ourselves was how big the difference between the two base learners (also called boosters) offered by XGBoost is? This posts tries to answer this question in a more systematic way.

Weak Learner

Gradient boosting can be interpreted as a combination of single models (so called base learners or weak learners) to an ensemble model (Natekin and Knoll 2013).
In theory, any base learner can be used in the boosting framework, whereas some base learners have proven themselves to be particularly useful: Linear and penalized models (Hastie et al. 2008), (B/P-) splines (Huang and Yang 2004), and especially decision trees (James et al. 2013). Among the less frequently used base learners are random effects (Tutz and Groll 2009), radial basis functions (Gomez-Verdejo et al. 2002), markov random fields (Dietterich et al. 2004) und wavlets (Dubossarsky et al. 2016).

Chen and Guestrin (2016) describe XGBoost as an additive function, given the data D = left{ left( x_{ i }, y_{ i } right) right}, of the following form:

    \[hat{y_{ i }} = Phi(x_{ i }) = sum_{ k = 1 }^K{ f_{ k }(x_{ i }), f_{ k } } in F\]

In their original paper, f_{ k }(x)forall k = 1, ... , K is defined as an classification or regression tree (CART). Apart from that, the alert reader of the technical documentation knows that one can alter the functional form of f_k(x) by using the booster argument in R/Python

# Example of XGBoost for regression in R with trees (CART)
xgboost(data = train_DMatrix,
        obj = "reg:linear".
        eval_metric = "rmse",
        booster = "gbtree")

One can choose between decision trees (gbtree and dart) and linear models (gblinear). Unfortunately, there is only limited literature on the comparison of different base learners for boosting (see for example Joshi et al. 2002). To our knowledge, for the special case of XGBoost no systematic comparison is available.

Simulation and Setup

In order to compare linear with tree base learners, we propose the following Monte Carlo simulation:

1) Draw a random number n from a uniform distribution [100, 2500].
2) Simulate four datasets, two for classification and two for regression, each having n observations.
3) On each dataset, train a boosting model with tree and linear base learners, respectively.
4) Calculate an appropriate error metric for each model on each dataset.

Repeat the outlined procedure m = 100 times.

As for simulation, we use the functions twoClassSim(), LPH07_1(), LPH07_2(), SLC14_1() from the caret package. In addition to the relevant features, a varying number of (correlated) random features was added. Note that in order to match real life data, all data generating processes involve non-linear components. For further details, we advise the reader to take a look at the caret package documentation.

For each dataset, we apply the same (random) splitting strategy, where 70% of the data goes to training, 15% is used for validation, and the last 15% is used for testing. Regarding hyperparameter tuning, we use a grid-search strategy in combination with 10-fold crossvalidation on the training data. Regardless of the base learner type, L1 (alpha) and L2 (lambda) regularization were tuned using a shared parameter space.
For tree boosting, the learning rate (eta) was held constant at 0.3 while tuning the optimal tree size (max_depth). Finally, we used a fixed number of 1000 boosting iterations (nrounds) in combination with ten early stopping rounds (early_stopping_rounds) on the validation frame. The final performance was evaluated by applying the model with the best crossvalidated parameters on the test dataset.

Results

Figure 1 and Figure 2 show the distributions of out of sample classification errors (AUC) and regression errors (RMSE) for both datasets. Associated summary statistics can be found in Table 1.

Table 1: Error summary statistics by datasets and base learners
Base learner Dataset Type Error metric Average error Error std.
Linear 1 Classification AUC 0.904 0.031
Tree 1 Classification AUC 0.934 0.090
Linear 2 Classification AUC 0.733 0.087
Tree 2 Classification AUC 0.730 0.062
Linear 3 Regression RMSE 45.182 2.915
Tree 3 Regression RMSE 17.207 9.067
Linear 4 Regression RMSE 17.383 1.454
Tree 4 Regression RMSE 6.595 3.104

For the first dataset, the models using tree learners are on average better than the models with linear learners. However, the tree models exhibit a greater variance. The relationships are reversed for the second dataset. On average, the linear models are slightly better and the tree models exhibit a lower variance.

result-oos-classification

In contrast to the classification case, there is for both regression datasets a substantial difference in performance in favor of the tree models. For the third dataset, the tree models are on average better than their linear counterparts. Also, the variance of the results is substantially higher for the tree models. The results are similar for the fourth dataset. The tree models are again better on average than their linear counterparts, but feature a higher variation.

result-oos-regression

Summary

The results from a Monte Carlo simulation with 100 artificial datasets indicate that XGBoost with tree and linear base learners yields comparable results for classification problems, while tree learners are superior for regression problems. Based on this result, there is no single recommendation which model specification one should use when trying to minimize the model bias. In addition, tree based XGBoost models suffer from higher estimation variance compared to their linear counterparts. This finding is probably related to the more sophisticated parameter space of tree models. The complete code can be found on github.

References