Machine Learning Impact

The conceptual fundamentals for Machine Learning (ML) were developed in the second half of the 20th century. But computational limitations and sparsity of data postponed the enthusiasm around artificial intelligence (AI) to recent years. Since then, computers have become exponentially faster, and cloud services have emerged with nearly limitless resources. The progress in computational power, combined with the abundance of data, makes Machine Learning algorithms applicable in many fields today.

AI systems are beating human domain experts at complex games, such as the board game Go or video games like Dota2. Surprisingly, the algorithms can find ways to solve the task that human experts haven’t even considered. In this sense, humans can learn from their behavior.

All these success stories have to be put in context. ML algorithms are well suited for specialized tasks; however, they still generalize poorly as of today.

One recent exception is an enormous model in Natural Language Processing – the use of a human language (e.g., English) by a computer. The model is called GPT-3 and has performed exceptionally well in multiple tasks. It is an objective of the AI research community to make models applicable for different jobs.

Machine Learning is a branch of Artificial Intelligence

In the early beginnings of artificial intelligence, applications were built on rule-based programs, which means humans encoded their knowledge into the computer. This approach is extremely rigid since only scenarios are covered that the developer considered, and there is no learning taking place. With the significant increase in computing power and the accompanied data generation, algorithms can learn tasks without human interaction. The terms algorithm and model are used interchangeably here.

The process of extracting knowledge from data is called Machine Learning, and it is a subarea of AI. There are various Machine Learning models, and they all use different approaches. Most of them are based on two elements: parameters and an objective function. The objective function returns a value, which signals the performance of the model, and parameters can be thought of as adjustable screws. Hence, the goal is to find those parameters yielding the best possible performance of the model on a specific dataset.

The format of the data determines which algorithms are applicable. Data can be structured or unstructured. Structured data is arranged in a table-like form, whereas unstructured data represents images, audio, or text.

Moreover, data can be labeled or unlabeled. In the case of Labeled data, as one could guess, each data sample has a tag. For example, in figure 1, each image in the dataset is tagged with a description of the animal seen in the picture. Unlabeled data is simply a dataset without any tags. As you can see in figure 1, the dataset no longer has tags with the information that the image represents.

Darstellung von gekennzeichneten und nicht gekennzeichneten Daten anhand von Bildern. Figure 1: Illustration of labeled and unlabeled data using images.

When working with unstructured data, there often isn’t a natural tag that can be collected. Usually, humans have to go through all examples and tag them with predefined labels. However, models need a lot of data to learn a task – similar to humans, who encounter a lot of information in their first few years of life before they succeed in walking and talking. That’s what motivated Fei Fei Li, former Director of Stanford’s AI Lab, to create a large database with cleanly labeled images, the ImageNet. Currently, ImageNet encompasses more than 14 million images with more than 20,000 categories – you can, for example, find images showing a banana several hundred times. ImageNet became the largest database with labeled images, and it is the starting point for most state-of-the-art Computer Vision models.

We are encountering ML models in our daily lives. Some are practical, like Google Translate; others are fun, like Snapchat Filters. Our interaction with artificial intelligence will most likely increase in the next few years. Given the potential impact of ML models on our future lives, let me present to you the five branches of ML and their key concepts.

Supervised Learning

What is Supervised Learning?

In Supervised Learning tasks, the training of a model is based on data with known labels. Those models take data as input and yield a prediction as output. The predictions are then compared to the truth, which is depicted by the labels. The objective is to minimize the discrepancy between truth and prediction.

Supervised tasks can be divided into two areas: Classification and Regression. Classification problems predict to which class an input belongs. For instance, predicting whether there is a dog or a cat on an image. You can further differentiate between binary classification, where only two classes are involved or multiclass classification involving multiple classes. On the other hand, Regression problems predict a real-valued number. A typical Regression problem is sales forecasting, e.g., predicting how many products will be sold next month.

Use Case

Farmers are growing and supplying perishable goods. One of the decisions a tomato farmer has to take after the harvest is how to bundle his product. Tomatoes that look esthetic should be offered to the end-users, while tomatoes with minor beauty flaws can be sold to intermediate producers, e.g., tomato sauce producers. On the other hand, inedible crops should be filtered out and used as a natural fertilizer.

This is a specialized and repetitive job that Machine Learning algorithms can automate easily. The quality of the tomatoes can be classified using Computer Vision. Each tomato is scanned by a sensor and evaluated by a model. The model assigns each tomato to a specific group.

Before using such models in production, they have to be trained on data. Input for the model would be images of tomatoes with corresponding labels (e.g., end-user, intermediate producer, fertilizer). The output of the model would be the probability measuring the model’s certainty. A high probability signals that the model is confident about the classification. In cases where the model is not sure, a human expert could have a second look at it. This way, there is a prefiltering of clear cases, while inconclusive cases have to be determined individually.

Deploying the model could provide a more efficient process of crop classification. Moreover, the use case applies to numerous quality control/management problems.

Frameworks

The maturity of frameworks for Supervised Learning is high compared to other areas of Machine Learning. Most relevant programming languages have a mature extension dedicated to Supervised Learning tasks. Furthermore, cloud providers and AI platforms reduce the hurdle to benefit from Supervised Learning models with user-friendly interfaces and tools.

Considerations

For Supervised Learning tasks, labeled data is necessary. In many cases, it is expensive to gather labeled data. E.g., in the use case above, at least one human expert had to label each sample by hand. It gets even more costly when considerable knowledge is expected for labeling, such as recognizing tumors on x-ray images. Therefore, various companies have specialized in providing labeling services. Even the big cloud providers – Microsoft, Amazon, and Google – are offering these services. The most prominent one is Amazon Web Services Mechanical Turk.

However, data are abundant – but often unlabeled. In the field of Unsupervised Learning, algorithms were developed to exploit unlabeled data.

Unsupervised Learning

What is Unsupervised Learning?

Unsupervised Learning algorithms focus on solving problems without depending on labeled data. Contrary to Supervised Learning, there is no ground truth available. The performance of the model is evaluated on the input data itself.

This field of Machine Learning can be summarized in three subareas – dimensionality reduction, clustering, and anomaly detection. These subareas are presented in the following.

Datasets can be enormous due to the number of examples and the number of features they contain. In general, the number of examples should be a lot higher than the number of features to ensure that models can find a pattern in the data. There is a problem when features outweigh the number of examples in the data.

Dimensionality reduction algorithms tackle this problem by creating an abstraction of the original dataset while keeping as much information as possible. That means that a dataset with 100 features and just 50 examples can be compressed to a new dataset with 10 features and 50 examples. The new features are constructed in a way that they contain as much information as possible. A small amount of information from the original dataset is sacrificed to ensure that a model yields reliable predictions.

In clustering approaches, we seek to partition the observations into distinct subgroups. The goal is to obtain clusters where observations within clusters are quite similar, while observations in different clusters vary.

Each observation of the dataset can be thought of as a point in a room. The recorded features determine the location of the point. When considering two observations, a distance can be calculated. The distance is then a signal of how similar these two points are.

Clustering algorithms initially create random clusters and iteratively adjust these to minimize the distances of points within a cluster. The resulting clusters contain observations with similar characteristics.

In anomaly detection, the goal is to identify observations that seem strange conditional on the dataset. These observations can also be called outliers or exceptions. An example is fraudulent bank activity, where the payment was attempted from a different country than usual. The attempt is identified as an anomaly and raises a verification process.

Use Case

Big retail companies often provide loyalty cards to collect value points. After each purchase, the card can be scanned to collect these value points. Each month the customer gets a report about his shopping behavior. Which products did the customer buy most often? What is the share of sustainable products the customer purchased? The customer also gets some kind of discount or cashback voucher. The more points a customer collects, the higher the value of the voucher. This can lead to higher customer loyalty and satisfaction since they get rewarded for their behavior. The collected data about the customers’ shopping behavior represents a lot of potential value to the retail company.

It is fair to assume that there is some heterogeneity among customers since all have their shopping behavior. Clustering could be used to identify groups with similar preferences. This is an Unsupervised Learning problem because the groups are unknown beforehand.

E.g., three clusters might emerge. One cluster contains customers with a vegetarian diet, meat-eaters, and customers mixing meat and vegetarian substitutes. Based on the three customer profiles, marketing campaigns can be customized cluster-specific.

Clustering der Kunden in drei Gruppen auf der Grundlage ihres Einkaufsverhaltens. Figure 2: Clustering customers into three groups based on their shopping behavior.

Frameworks/Maturity

Dimensionality reduction, clustering, and anomaly detection algorithms are relatively mature approaches and available in most Machine Learning packages.

Considerations

Recent research puts a focus on Unsupervised methods. The main reason is that there is an abundance of data available, but just a tiny share is labeled data. Self-Supervised and Semi-Supervised Learning approaches arose from the goal to combine labeled and unlabeled data to learn a task.

Self-Supervised Learning

What is Self-Supervised Learning?

Self-Supervised Learning is a relatively new approach in Machine Learning, mainly applied in Computer Vision and Natural Language Processing. These fields require large datasets to learn a task, but as I discussed earlier, it is expensive to obtain a clean labeled large dataset.

Large amounts of unlabeled data are generated every day – tweets, Instagram posts, and many others. The idea of Self-Supervised Learning is to take advantage of all the available data. The learning task is constructed from a supervised problem using unlabeled data. The key is that the labels for the supervised task are a subset of the input. In other words, the data is transformed so that a piece of the input is used as the label and the model’s objective is to predict the missing piece of the input. This way, the data is supervising itself.

There are different ways to transform input data. A simple one is through rotation. As depicted in figure 2, a transformation is applied to each input image. The modified image is input for the model, and the applied rotation is the target to predict. To predict different rotations for the same image, the model must learn to recognize high-level image semantics – such as sky, terrain, heads, eyes, and the relative position of these parts.

Self Supervised Learning IllustrationFigure 3: Illustration of Self-Supervised Learning using the rotation of the input image. The model learns to predict the applied rotation.

In conclusion, the performance of the model predicting the rotation of the image is irrelevant; it is just a means to an end. While solving the rotation prediction task, the model implicitly learns a generally good understanding of images. In the context of language models, an initial objective could be to predict the next word in the text. Through this task, the model already learns the structure of a language. This part of the model can be extracted and further used for a more specialized task using Supervised Learning. The benefit is that the model already learned a broad understanding and doesn’t need to start from scratch to learn a task.

Use Case

Speech recognition models can translate human speech into text. It is used in many services, most prominently in home-assistance systems like Alexa.

In a Self-Supervised setting, unlabeled data (for example, audiobooks) can be used for pretraining. Through an auxiliary task, the model learns a broad understanding of speech and language. That part of the model is extracted and combined with Supervised Learning techniques. Hence, Self-Supervised Learning can be used to improve the performance of Supervised Learning approaches.

Frameworks

Few frameworks provide access to Self-Supervised Learning models. PyTorch, e.g., provides pre-trained models that can be used for Computer Vision tasks. Besides that, the models can be reproduced – but that is quite a challenging task and can be costly.

Considerations

Recent papers show promising results, but they are often focusing on one area. For instance, Self-Supervised Learning is yielding provable improvements for image classification. Translating the progress to object detection, however, has shown to be quite tricky and not straightforward. Nevertheless, the approaches and improvements are exciting.

Semi-Supervised Learning

What is Semi-Supervised Learning?

Supervised and Unsupervised Learning are combined in Semi-Supervised Learning.

There are different approaches to Semi-Supervised Learning. Usually, it is divided into multiple steps.

First, a model is trained in a Self-Supervised Learning manner. For instance, the model learns a good understanding of a language using unlabeled text data and an auxiliary task (a simple example would be to predict the next word in the text). The elements of the model responsible for succeeding in the auxiliary task can be cut off. The rest of the model contains a high-level understanding of a language.

The truncated model is then trained on the labeled data in a Supervised Learning fashion. Since the model already learned some understanding of a language, it does not need a massive amount of labeled data to learn another specialized task connected to language understanding. Semi-Supervised Learning is primarily applied in the fields of Computer Vision and Natural Language Processing.

Use Case

People are sharing their thoughts and opinions daily on social media. These comments in the form of text can be valuable for companies. Tracking customer satisfaction over time generates a deeper understanding of the customer experience. Sentiment analysis, a subfield of Natural Language Processing, can be used to classify the customer views based on tweets, Facebook comments, or blog posts. Before being able to classify sentiment from text, the model is pretrained on text documents from Wikipedia, books, and many other resources. Usually, the model’s task is to predict the next word in a sentence. The objective is to learn the structure of a language in a first step before specializing in a particular task. After the pretraining, the model parameters are fine-tuned on a dataset with labels. In this scenario, the model is trained on a dataset with tweets, each with a tag, whether the sentiment is negative, neutral, or positive.

Frameworks

Semi-Supervised Learning covers all approaches that combine Unsupervised and Supervised Learning. Hence, practitioners can use well established frameworks from both areas.

Considerations

Semi-Supervised Learning is combining unlabeled and labeled data to increase the predictive performance of the model. But a prerequisite is that the size of the unlabeled data is substantially larger compared to the labeled data. Otherwise, just Supervised Learning could be used.

Reinforcement Learning

What is Reinforcement Learning?

In the Reinforcement setting, an agent is interacting with an environment. The agent observes the state of the environment, and based on its observations, it can choose an action. Depending on the chosen action, it gets a reward. If the action was good, the agent will get a high reward and vice versa. The goal of the agent is to maximize its future cumulative reward. In other words, it has to find the best action for each state.

In many cases, the interaction of the agent with the environment is simulated. That way, the agent can experience millions of states and learn the correct behavior.

Illustration Reinforcement Learning Figure 4: Illustration of the learning process in Reinforcement Learning

Use Case

Personalized advertisement can improve the efficiency of a new marketing campaign. Instead of displaying ads with a discount for a product, say a car, internet users could be led down a sales funnel before presenting the final deal. In the first step, the ad could describe the availability of favorable financing terms, then, on the next visit, praise an excellent service department, and in the end, present the final discount. This might lead to more clicks of a user over repeated visits and, if well implemented, to more sales.

In the above-described case, an agent would choose which ad should be shown depending on the user’s profile. The profile is based on the user’s interests and preferences derived from online activity. The agent adjusts his actions – displaying different ads – based on the user’s feedback. In this setting, the agent does not get a signal for each action he takes since it takes a while until the user decides to purchase a car. For instance, showing a specific ad might catch the user’s interest but won’t immediately translate into a sale. This is called the credit assignment problem. But the agent can wait for an amount of time. If the case results in a sale, the probability of the taken actions will be increased given the same input. Otherwise, the agent will be discouraged to take the same actions for the same input.

Frameworks

Within the field of Reinforcement Learning, various approaches exist. Hence, there is not a standardized framework available. But gym and baselines – both developed by OpenAI, a nonprofit AI research company – establish themselves as important starting points for Reinforcement Learning applications.

Considerations

The agent is learning based on many interactions with the environment. Conducting these interactions in real-time would take a long time and is often infeasible – e.g., a car autopilot cannot learn in the real world because it is not safe. Therefore, simulations of the real world are created, where the agent can learn through trial and error. To train an agent successfully, the gap between simulation and reality has to be as small as possible.

Apart from the simulation, the conception of the reward is crucial to the agent’s performance. In maximizing his reward, the agent can learn unintended behavior, which leads to higher rewards but does not solve the task.

The learning procedure of the agent is sample inefficient. This means that it needs a lot of interactions to learn a task.

Reinforcement Learning shows promising results in many areas. However, few real-world applications are based on it. This might change in a few years.

Conclusion

The Future of Artificial Intelligence

Artificial Intelligence has come a long way – from rule-based scripts to Machine Learning algorithms that are beating human experts. The field has grown substantially in recent years. ML models are used to tackle different domain problems, and their impact on our lives is increasing steadily.

Nonetheless, there are still important questions that remain unanswered. To create more reliable models, we have to address these questions.

A crucial point is the measurement of intelligence. We do not have a general definition of intelligence yet. The current models that we describe as artificially intelligent are learning by memorizing the data we provide. That already yields astonishing results. Introducing a metric of intelligence and optimizing it will lead to far more powerful models in the future. It is exciting to experience the progress in AI.

Resources

  1. “Google AI defeats human Go champion”
  2. “OpenAI Five Defeats Dota 2 World Champions”
  3. https://www.alexirpan.com/2018/02/14/rl-hard.html
  4. Reinforcement Learning, Sutton R. and Barto A., 2018

Fran Peric

Management Summary

In modern companies, information in text form can be found in many places in day-to-day business. Depending on the business context, this can involve invoices, emails, customer input (such as reviews or inquiries), product descriptions, explanations, FAQs, and applications. Until recently, these information sources were reserved mainly for human beings, as the understanding of a text is a technologically challenging problem for machines.
Due to recent achievements in deep learning, several different NLP (“Natural Language Processing”) tasks can now be solved with outstanding quality.
In this article, you will learn how NLP applications solve various business problems through five practical examples, which ensured an increase in efficiency and innovation in their field of application.

Introduction

Natural Language Processing (NLP) is undoubtedly an area that has received special attention in the Big Data environment in the recent past. The interest in the topic, as measured by Google, has more than doubled in the last three years. This shows that innovative NLP technologies have long since ceased to be an issue only for big players such as Apple, Google, or Amazon. Instead, a general democratization of the technology can be observed. One of the reasons for this is that according to an IBM estimate, about 80% of “global information” is not available in structured databases, but unstructured, natural language. NLP will play a key role in the future when it comes to making this information usable. Thus, the successful use of NLP technologies will become one of the success factors for digitization in companies.

To give you an idea of the possibilities NLP opens up in the business context today, I will present five practical use cases and explain the solutions behind them in the following.

What is NLP? – A Short Overview

As a research topic that had already occupied linguists and computer scientists in the 1950s, NLP had a barely visible existence on the application side in the 20th century.

The main reason for this was the availability of the necessary training data. Although the availability of unstructured data, in the form of texts, has generally increased exponentially, especially with the rise of the Internet, there was still a lack of suitable data for model training. This can be explained by the fact that the early NLP models mostly had to be trained under supervision (so-called supervised learning). However, supervised learning requires that training data must be provided with a dedicated target variable. This means that, for example, in the case of text classification, the text corpus must be manually annotated by humans before the model training.

This changed at the end of the 2010s when a new model generation of artificial neural networks led to a paradigm shift. These so-called “Language Models” are based on huge text corpora of Facebook, Google, etc., (pre-)trained by randomly masking individual words in the texts and predicting them in the course of training. This is so-called self-supervised learning, which no longer requires a separate target variable. In the course of the training, these models learn a contextual understanding of texts.

The advantage of this approach is that the same model can be readjusted for various downstream tasks (e.g., text classification, sentiment analysis, named entity recognition) with the help of the learned contextual understanding. This process is called transfer learning. In practice, these pre-trained models can be downloaded so that only the fine-tuning for the specific application must be done by additional data. Consequently, high-performance NLP applications can now be developed with little development effort.

To learn more about Language Models (especially the so-called Transformer Models like “BERT”, resp. “roBERTa”, etc.) as well as trends and obstacles in the field of NLP, please read the article on NLP trends by our colleague Dominique Lade. [https://www.statworx.com/de/blog/neue-trends-im-natural-language-processing-wie-nlp-massentauglich-wird/].

The 5 Use Cases

Text Classification in the Recruitment Process

A medical research institute wants to make its recruitment process of study participants more efficient.

For testing a new drug, different, interdependent requirements are placed on the persons in question (e.g., age, general health status, presence/absence of previous illnesses, medications, genetic dispositions, etc.). Checking all these requirements is very time-consuming. Usually, it takes about one hour per potential study participant to view and assess relevant information. The main reason for this is that the clinical notes contain patient information that exceeds structured data such as laboratory values and medication: Unstructured information in text form can also be found in the medical reports, physician’s letters, and discharge reports. Especially the evaluation of the latter data requires a lot of reading time and is therefore very time-consuming. To speed up the process, the research institute is developing a machine learning model that pre-selects promising candidates. The experts then only have to validate the proposed group of people.

The NLP Solution

From a methodological point of view, this problem is a so-called text classification. Based on a text, a prognosis is created for a previously defined target variable. To train the model, it is necessary – as usual in supervised learning – to annotate the data, in this case the medical documents, with the target variable. Since a classification problem has to be solved here (suitable or unsuitable study participants), the experts manually assess the suitability for the study for some persons in the pool. If a person is suitable, they are marked with a one (=positive case), otherwise with a zero (=negative case). Based on these training examples, the model can now learn the relationships between the persons’ medical documents and their suitability.

To cope with the complexity of the problem, a correspondingly complex model called ClinicalBERT is used. This is a language model based on BERT (Bidirectional Encoder Representations from Transformers), which was additionally trained on a data set of clinical texts. Thus, ClinicalBERT can generate so-called representations of all medical documentation for each person. In the last step, the neural network of ClinicalBERT is completed by a task-specific component. In this case, it is a binary classification: For each person, a probability of suitability should be output. Through a corresponding linear layer, the high-dimensional text documentation is finally transformed into a single number, the suitability probability. In a gradient procedure, the model now learns the suitability probabilities based on the training examples.

Further Application Scenarios of Text Classification:

Text classification often takes place in the form of sentiment analysis. This involves classifying texts into predefined sentiment categories (e.g., negative/positive). This information is particularly important in the financial world or for social media monitoring. Text classification can also be used in various contexts where it is vital to sort documents according to their type (e.g., invoices, letters, reminders).

Name Entity Recognition for Usability Improvement of a News Page

A publishing house offers its readers on a news page a large number of articles on various topics. In the course of optimization measures, one would like to implement a better recommender system so that for each article, further suitable (complementary or similar) articles are suggested. Also, the search function on the landing page is to be improved so that the customer can quickly find the article he or she is looking for.

To create a good data basis for these purposes, the publisher decided to use Named Entity Recognition (NER) to assign automated tags to the texts, improving both the recommender system and the search function. After successful implementation, significantly more suggested articles are clicked on, and the search function has become much more convenient. As a result, the readers spend substantially more time on the page.

The NLP Solution

To solve the problem, one must first understand how NER works:

NER is about assigning words or entire phrases to content categories. For example, “Peter” can be identified as a person, “Frankfurt am Main” is a place, and “24.12.2020” is a time specification. There are also much more complicated cases. For this purpose, compare the following pairs of sentences:

  1. In the past, Adam didn’t know how to parallel park. (park = from the verb “to park”)
  2. Yesterday I took my dog for a walk in the park. (park = open green area)

It is perfectly evident to humans that the word “park” has a different meaning in each of the two sentences. However, this seemingly simple distinction is anything but trivial for the computer. An entity recognition model could characterize the two sentences as follows:

  1. “[In the past] (time reference), [Adam] (person) didn’t know how to parallel [park] (verb).”
  2. [Yesterday] (time reference) [I] (person) took my dog for a walk in the [park] (location).

In the past, rule-based algorithms would have been used to solve the above NER problem, but here too, the machine learning approach is gaining ground:

The present multiclass classification problem of entity determination is again addressed using the BERT model. Additionally, the model is trained on an annotated data set in which the entities are manually identified. The most comprehensive publicly accessible database in the English language is the Groningen Meaning Bank (GMB). After successful training, the model can correctly determine previously unknown words from the context resulting from the sentence. Thus, the model recognizes that prepositions like “in, at, after…” are followed by a location, but more complex contexts are also used to determine the entity.

Further Application Scenarios of NER:

NER is a classic information retrieval task and is central to many other NER tasks, such as chatbots and question-answer systems. Also, NER is often used for text cataloging, where the type of text is determined based on valid recognized entities.

A Chatbot for a Long-Distance Bus Company

A long-distance bus company would like to increase its accessibility and expand the communication channels with the customer. In addition to its homepage and app, the company wants to offer a third way to the customer, namely a Whatsapp-Chatbot. The goal is to perform specific actions in the conversation with the chatbot, such as searching, booking, and canceling trips. In addition, the chatbot is intended to create a reliable way of informing passengers about delays.

With the introduction of the chatbot, not only existing passengers can be reached more quickly, but also, contact can be established with new customers* who have not yet installed an app.

The NLP solution

Depending on the requirements that are placed on the chatbot, you can choose between different chatbot architectures.

Over the years, four main chatbot paradigms have been tested: In a first generation, the inquiry was examined for well-known patterns and accordingly adapted prefabricated answers were spent (“pattern matching”). More sophisticated is the so-called “grounding”, in which information extracted from knowledge libraries (e.g., Wikipedia) is organized in a network by Named Entity Recognition (see above). Such a network has the advantage that not only registered knowledge can be retrieved, but also unregistered knowledge can be inferred by the network structure. In “searching”, question-answer pairs from the conversation history (or from previously registered logs) are directly used to find a suitable answer. The use of machine learning models is the most proven approach to generate suitable answers (“generative models”) dynamically.

The best way to implement this modern chatbot with clearly definable competencies for the company is to use existing frameworks such as Google Dialogflow. This is a platform for configuring chatbots that have the elements of all previously mentioned chatbot paradigms. For this purpose, parameters such as intents, entities, and actions are passed.

An intend (“user intention”) is, for example, the timetable information. By giving different example phrases (“How do I get from … to … from … to …”, “When is the next bus from … to …”) to a language model, the chatbot can assign even unseen input to the correct intend (see text classification).

Furthermore, different travel locations and times are defined as entities. If the chatbot now captures an intend with matching entities (see NER), an action, in this case a database query, can be triggered. Finally, an intend-answer with the relevant information is given and adapted to all information in the chat history specified by the user (“stateful”).

Further Application Scenarios of Chatbots:

There are many possible applications in customer service, depending on the complexity of the scenario, from the automatic preparation (e.g., sorting) of a customer order to the complete processing of a customer experience.

A Question-Answering System as a Voice Assistant for Technical Questions About the Automobile

An automobile manufacturer discovers that many of its customers do not get along well with the manuals that come with the cars. Often, finding the relevant information takes too long, or it is not found at all. Therefore, it was decided to offer a Voice Assistant to provide precise answers to technical questions in addition to the static manual. In the future, drivers will be able to speak comfortably with their center console when they want to service their vehicle or request technical information.

The NLP solution

Question-answer systems have been around for decades, as they are at the forefront of artificial intelligence. A question-answer system that would always find a correct answer, taking into account all available data, could also be called “General AI”. A significant difficulty on the way to General AI is that the area the system needs to know about is unlimited. In contrast, question-answer systems provide good results when the area is delimited, as is the case with the automotive assistant. In general, the more specific the area, the better results can be expected.

For the implementation of the question-answer system, two data types from the manual are used: structured data, such as technical specifications of the components and key figures of the model, and unstructured data, such as instructions for action. All data is transformed into question-answer form in a preparatory step using other NLP techniques (classification, NER). This data is transferred to a version of BERT that has already been pre-trained on a large question-answer data set (“SQuAD”). The model is thus able to answer questions that have already been fed into the system and provide educated guesses for unseen questions.

Further Application Scenarios of Question-Answer Systems:

With the help of question-answer systems, company-internal search engines can be extended by functionalities. In e-commerce, answers to factual questions can be given automatically based on article descriptions and reviews.

Automatic Text Summaries (Text Generation) of Damage Descriptions for a Property Insurance

An insurance company wants to increase the efficiency of its claim settlement department. It has been noticed that some claims complaints from the customer lead to internal conflicts of responsibility. The reason for this is simple: customers usually describe the claims over several pages, and an increased training period is needed to be able to judge whether or not to process the case. Thus, it often happens that a damage description must be read thoroughly to understand that the damage itself does not need to be processed. Now, a system that generates automated summaries is to remedy this situation. As a result of the implementation, the claim handlers can now make responsibility decisions much faster.

The NLP solution

One can differentiate between two different approaches to the text summary problem: In the extraction, the most important sentences are identified from the input text and are then used as a summary in the simplest case. In abstraction, a text is transformed by a model into a newly generated summary text. The second approach is much more complex since paraphrasing, generalization, or the inclusion of further knowledge is possible here. Therefore, this approach has a higher potential to generate meaningful summaries but is also more error-prone. Modern text summary algorithms use the second approach or a combination of both methods.

A so-called sequence-to-sequence model is used to solve the insurance use case, which assigns a word sequence (the damage description) to another word sequence (the summary). This is usually a recurrent neural network (RNN), trained based on text summary pairs. The training process is designed to model the probability of the next word depending on the last words (and additionally, an “inner state” of the model). Similarly, the model effectively writes the summary “from left to right” by successively predicting the next word. An alternative approach is to have the input numerically encoded by the Language Model BERT and to have a GPT decoder autoregressively summarize the text based on this numerical representation. With the help of model parameters, it can be adjusted in both cases how long the summary should be.

Further Application Scenarios of Text Generation:

Such a scenario is conceivable in many places: Automated report writing, text generation based on retail sales data analysis, electronic medical record summaries, or textual weather forecasts from weather data are possible applications. Text generation is also used in other NLP use cases such as chatbots and Q&A systems.

Outlook

These five application examples of text classification, chatbots, question-answer systems, NER, and text summaries show that there are many processes in all kinds of companies that can be optimized with NLP solutions.

NLP is not only an exciting field of research but also a technology whose applicability in the business environment is continually growing.

In the future, NLP will not only become a foundation of a data-driven corporate culture but also already holds a considerable innovation potential through direct application, in which it is worth investing.

At STATWORX, we already have years of experience in the development of customized NLP solutions. Here are two of our case studies on NLP: Social Media Recruiting with NLP & Supplier Recommendation Tool. We are happy to provide you with individual advice on this and many other topics.

Management Summary

OCR (Optical Character Recognition) ist eine große Herausforderung für viele Unternehmen. Am OCR-Markt tummeln sich diverse Open Source sowie kommerzielle Anbieter. Ein bekanntes Open Source Tool für OCR ist Tesseract, das mittlerweile von Google bereitgestellt wird. Tesseract ist aktuell in der Version 4 verfügbar, die die OCR Extraktion mittels rekurrenten neuronalen Netzen durchführt. Die OCR Performance von Tesseract ist nach wie vor jedoch volatil und hängt von verschiedenen Faktoren ab. Eine besondere Herausforderung ist die Anwendung von Tesseract auf Dokumente, die aus verschiedenen Strukturen aufgebaut sind, z.B. Texten, Tabellen und Bildern. Eine solche Dokumentenart stellen bspw. Rechnungen dar, die OCR Tools aller Anbieter nach wie vor besondere Herausforderungen stellen. In diesem Beitrag wird demonstriert, wie ein Finetuning der Tesseract-OCR (Optical Character Recognition) Engine auf einer kleinen Stichprobe von Daten bereits eine erhebliche Verbesserung der OCR-Leistung auf Rechnungsdokumenten bewirken kann. Dabei ist der dargestellte Prozess nicht ausschließlich auf Rechnungen anwendbar sondern auf beliebige Dokumentenarten. Es wird ein Anwendungsfall definiert, der auf eine korrekte Extraktion des gesamten Textes (Wörter und Zahlen) aus einem fiktiven, aber realistischen deutschen Rechnungsdokument abzielt. Es wird hierbei angenommen, dass die extrahierten Informationen für nachgelagerte Buchhaltungszwecke bestimmt sind. Daher wird eine korrekte Extraktion der Zahlen sowie des Euro-Zeichens als kritisch angesehen. Die OCR-Leistung von zwei Tesseract-Modellen für die deutsche Sprache wird verglichen: das Standardmodell (nicht getuned) und eine finegetunete Variante. Das Standardmodell wird aus dem Tesseract OCR GitHub Repository bezogen. Das feinabgestimmte Modell wird mit denen in diesem Artikel beschriebenen Schritten entwickelt. Eine zweite deutsche Rechnung ähnlich der ersten wird für die Feinabstimmung verwendet. Sowohl das Standardmodell als auch das getunte Modell werden auf der gleichen Out-of-Sample Rechnung bewertet, um einen fairen Vergleich zu gewährleisten. Die OCR-Leistung des Tesseract Standardmodells ist bei Zahlen vergleichsweise schlecht. Dies gilt insbesondere für Zahlen, die den Zahlen 1 und 7 ähnlich sind. Das Euro-Symbol wird in 50% der Fälle falsch erkannt, sodass das Ergebnis für eine etwaig nachgelagerte Buchhaltungsanwendung ungeeignet ist. Das getunte Modell zeigt eine ähnliche OCR-Leistung für deutsche Wörter. Die OCR-Leistung bei Zahlen verbessert sich jedoch deutlich. Alle Zahlen und jedes Euro-Symbol werden korrekt extrahiert.  Es zeigt sich, dass eine Feinabstimmung mit minimalem Aufwand und einer geringen Menge an Schulungsdaten eine große Verbesserung der Erkennungsleistung erzielen kann. Dadurch wird Tesseract OCR mit seiner Open-Source-Lizenzierung zu einer attraktiven Lösung im Vergleich zu proprietärer OCR-Software. Weiterhin werden abschließende Empfehlungen für das Finetuning von Tesseract LSTM-Modellen dargestellt, für den Fall, dass mehr Trainingsdaten vorliegen.

Download des Tesseract Docker Containers

Der gesamte Finetuning-Prozess des LSTM-Modells von Tesseract wird im Folgenden ausführlich erörtert. Da die Installation und Anwendung von Tesseract kompliziert werden kann, haben wir einen Docker Container vorbereitet, der alle nötigen Installationen bereits enthält. [contact-form-7 404 "Not Found"]

Einführung

Tesseract 4 mit seiner LSTM-Engine funktioniert out-of-the-box für einfache Texte bereits recht gut. Es gibt jedoch Szenarien, für die das Standardmodell schlecht abschneidet. Beispiele hierfür sind exotische Schriftarten, Bilder mit Hintergründen oder Text in Tabellen.  Glücklicherweise bietet Tesseract eine Möglichkeit zum Finetuning der LSTM-Engine, um die OCR-Leistung für speziellere Anwendungsfälle zu verbessern.

Warum OCR für Rechnungen eine Herausforderung ist

Auch wenn OCR in Teilbereichen als ein gelöstes Problem gilt, stellt die fehlerfreie Extraktion eines großen Textkorpus nach wie vor eine Herausforderung dar. Dies gilt insbesondere für OCR auf Dokumenten, die eine hohe strukturelle Varianz aufweisen, wie bspw. Rechnungsdokumente. Diese bestehen häufig aus unterschiedlichsten Elementen, die OCR-Engine von Tesseract for Herausforderungen stellen: 1. Farbige Hintergründe und Tabellenstrukturen stellen eine Herausforderung für die Seitensegmentierung dar. 2. Rechnungen enthalten normalerweise seltene Zeichen wie das EUR- oder USD-Zeichen 3. Zahlen können nicht mit einem Sprachwörterbuch überprüft werden. Darüber hinaus ist die Fehlermarge gering: Häufig ist eine exakte Extraktion der numerischen Daten für nachfolgenden Prozessschritte von größter Bedeutung. Problem (1) lässt sich in der Regel dadurch lösen, dass man eine der 14 von Tesseract bereitgestellten Segmentierungsmodus auswählt. Die beiden letztgenannten Probleme lassen sich häufig durch ein Finetuning der LSTM-Engine auf Basis von Beispielen ähnlicher Dokumente lösen.

Use Case Zielsetzung und Daten

Zwei ähnliche Beispielrechnungen werden in dem Artikel näher betrachtet. Die in Abbildung 1 gezeigte Rechnung wird zur Bewertung der OCR-Leistung sowohl für das Standard- als auch des feingetunte Tesseract-Modell verwendet. Besondere Aufmerksamkeit wird der korrekten Extraktion von Zahlen gewidmet. Die in Abbildung 2 gezeigte, zweite Rechnung wird zum Finetuning des LSTM Modells verwendet. Die meisten Rechnungsdokumente sind in einer sehr gut lesbaren Schriftart wie “Arial” geschrieben. Um die Vorteile des Tunings zu veranschaulichen, wird das anfängliche OCR-Problem durch die Berücksichtigung von Rechnungen, die in der Schriftart “Impact” geschrieben sind, erschwert. „Impact“ ist eine Schriftart, die sich deutlich von normalen serifenlosen Schriften unterscheidet, und zu einer höheren Fehlerkennung für Tesseract führt. Es wird im Folgenden gezeigt, dass Tesseract nach der Feinabstimmung auf Basis einer sehr kleinen Datenmenge trotz dieser schwierigen Schriftart sehr zufriedenstellende Ergebnisse liefert.
Abbildung 1: Rechnung 1, die zur Evaluierung der OCR Performance beider Modelle verwendet wird
Abbildung 2: Rechnung 2, die zum Finetuning der LSTM Engine verwendet wird

Verwendung des Tesseract 4.0 Docker Containers

Die Einrichtung zum Finetuning der Tesseract-LSTM-Engine funktioniert derzeit nur unter Linux und kann etwas knifflig sein. Daher wird zusammen mit diesem Artikel ein Docker-Container mit vorinstalliertem Tesseract 4.0 sowie mit den kompilierten Trainings-Tools und Skripten bereitgestellt. Laden Sie das Docker-Image aus der bereitgestellten Archivdatei oder pullen Sie das Container-Image über den bereitgestellten Link:
docker load -i docker/tesseract_image.tar
Sobald das image aufgebaut ist, starten Sie den Container im “detached” Modus:
docker run -d --rm --name tesseract_container tesseract:latest
Greifen Sie auf die Shell des laufenden Containers zu, um die folgenden Befehle in diesem Artikel zu replizieren:
docker exec -it tesseract_container /bin/bash

Allgemeine Verbesserungen der OCR Performance

Es gibt drei Möglichkeiten, wie die OCR-Leistung von Tesseract verbessert werden kann, noch bevor ein Finetuning der LSTM-Engine vorgenommen wird.

1. Preprocessing der Bilder

Gescannte Dokumente können eine schiefe Ausrichtung haben, wenn sie auf dem Scanner nicht richtig platziert wurden. Gedrehte Bilder sollten entzerrt werden, um die Liniensegmentierungsleistung von Tesseract zu optimieren. Darüber hinaus kann beim Scannen ein Bildrauschen entstehen, das durch einen Rauschunterdrückungsalgorithmus entfernt werden sollte. Beachten Sie, dass Tesseract standardmäßig eine Schwellenwertbildung unter Verwendung des Otsu-Algorithmus durchführt, um Graustufenbilder in schwarze und weiße Pixel zu binärisieren. Eine detaillierte Behandlung der Bildvorverarbeitung würde den Rahmen dieses Artikels sprengen und ist nicht notwendig, um für den gegebenen Anwendungsfall zufriedenstellende Ergebnisse zu erzielen. Die Tesseract-Dokumentation bietet einen praktischen Überblick.

2. Seitensegmentierung

Während der Seitensegmentierung versucht Tesseract, rechteckige Textbereiche zu identifizieren. Nur diese Bereiche werden im nächsten Schritt für die OCR ausgewählt. Es ist daher wichtig, alle Regionen mit Text zu erfassen, damit keine Informationen verloren gehen. Tesseract ermöglicht die Auswahl aus 14 verschiedenen Seitensegmentierungsmethoden, die mit dem folgenden Befehl angezeigt werden können:
tesseract --help-psm
Die Standard-Segmentierungsmethode erwartet eine Bild ähnlich zu einer Buchseite. Dieser Modus kann jedoch aufgrund der zusätzlichen tabellarischen Strukturen in Rechnungsdokumenten nicht alle Textbereiche korrekt identifizieren. Eine bessere Segmentierungsmethode ist durch Option 4 gegeben: „Assume a single column of text of variable sizes“. Um die Bedeutung einer geeigneten Seitensegmentierungsmethode zu veranschaulichen, betrachten wir das Ergebnis der Verwendung der Standardmethode “Fully automatic page segmentation, but no OSD ” in Abbildung 3:
Abbildung 3: Die Standard-Segmentierungsmethode kann nicht alle Textbereiche erkennen
Beachten Sie, dass die Texte “Rechnungsinformationen:”, “Pos.” und “Produkt” nicht segmentiert wurden. In Abbildung 4 führt eine geeignetere Methode zu einer perfekten Segmentierung der Seite.

3. Verwendung von Dictionaries, Wortlisten und Mustern für den Text

Die von Tesseract verwendeten LSTM-Modelle wurden auf Basis von großen Textmengen in einer bestimmten Sprache trainiert. Dieser Befehl zeigt die Sprachen an, die derzeit für Tesseract verfügbar sind:
tesseract --list-langs 
Weitere Sprachmodelle sind verfügbar, indem die entsprechenden language.tessdata heruntergelden und in den Ordner tessdata der lokalen Tesseract-Installation abgelegt werden. Das Tesseract-Repository auf GitHub stellt drei Varianten von Sprachmodellen zur Verfügung: normal, fast und best. Nur die schnelle sowie die beste Variante sind für ein Finetuning verwendbar. Wie der Name schon sagt, handelt es sich dabei um die schnellsten bzw. genauesten Varianten von Modellen. Weitere Modelle wurden ebenfalls für spezielle Anwendungsfälle wie die ausschließliche Erkennung von Ziffern und Interpunktion trainiert und sind in den Referenzen aufgeführt. Da die Sprache der Rechnungen in diesem Anwendungsfall Deutsch ist, wird das zu diesem Artikel gehörende Docker-Image mit dem deu.tessdata-Modell geliefert. Für eine bestimmte Sprache kann die Wortliste von Tesseract weiter ausgebaut oder auf bestimmte Wörter oder sogar Zeichen beschränkt werden. Dieses Thema liegt außerhalb des Rahmens dieses Artikels, da es nicht notwendig ist, um für den vorliegenden Anwendungsfall zufriedenstellende Ergebnisse zu erzielen.

Setup des Finetuning-Prozesses

Für das Finetuning müssen drei Dateitypen erstellt werden:

1. tiff-Dateien

Tagged Image File Format oder TIFF ist ein unkomprimiertes Bilddateiformat (im Gegensatz zu JPG oder PNG, die komprimierte Dateiformate sind). TIFF-Dateien können mit einem Konvertierungswerkzeug aus PNG- oder JPG-Formaten gewonnen werden. Obwohl Tesseract mit PNG- und JPG-Bildern arbeiten kann, wird das TIFF-Format empfohlen.

2. Box-Dateien

Zum Trainieren des LSTM-Modells verwendet Tesseract so genannte Box-Dateien mit der Erweiterung “.box”. Eine Box-Datei enthält den erkannten Text zusammen mit den Koordinaten der Bounding Box, in der sich der Text befindet. Box-Dateien enthalten sechs Spalten, die korrespondieren zu Symbol, Links, Unten, Rechts, Oben und Seite:
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
o 157 2566 1465 2609 0
d 157 2566 1465 2609 0
u 157 2566 1465 2609 0
k 157 2566 1465 2609 0
t 157 2566 1465 2609 0
  157 2566 1465 2609 0
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
e 157 2566 1465 2609 0
i 157 2566 1465 2609 0
s 157 2566 1465 2609 0
  157 2566 1465 2609 0
( 157 2566 1465 2609 0
N 157 2566 1465 2609 0
e 157 2566 1465 2609 0
t 157 2566 1465 2609 0
t 157 2566 1465 2609 0
o 157 2566 1465 2609 0
) 157 2566 1465 2609 0
  157 2566 1465 2609 0
Jedes Zeichen befindet sich auf einer separaten Zeile in der Box-Datei. Das LSTM-Modell akzeptiert entweder die Koordinaten einzelner Zeichen oder einer ganzen Textzeile. In der obigen Beispiel-Box-Datei befindet sich der Text “Produkt Preis (Netto)” optisch auf der gleichen Zeile im Dokument. Alle Zeichen haben die gleichen Koordinaten, nämlich die Koordinaten des Begrenzungsrahmens um diese Textzeile herum. Die Verwendung von Koordinaten auf Zeilenebene ist wesentlich einfacher und wird standardmäßig bereitgestellt, wenn die Box-Datei mit dem folgenden Befehl erzeugt wird:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice --psm 4 -l best/deu lstmbox
Das erste Argument ist die zu extrahierende Bilddatei, das zweite Argument stellt den Dateinamen der Box-Datei dar. Der Sprachparameter -l weist Tesseract an, das deutsche Modell für die OCR zu verwenden. Der Parameter –psm weist Tesseract an, das vierte Seitensegmentierungsverfahren zu verwenden. Nahezu unvermeidlich ist, dass die generierten OCR-Box-Dateien Fehler in der Symbolspalte enthalten. Jedes Symbol in der Box-Datei des Trainings muss daher von Hand überprüft werden. Dies ist ein mühsamer Prozess, da die Box-Datei der Demo-Rechnung fast tausend Zeilen enthält (eine für jedes Zeichen in der Rechnung). Um die Korrektur zu vereinfachen, stellt der Docker-Container ein Python-Skript zur Verfügung, das die Bounding-Boxes zusammen mit dem OCR-Text auf dem Originalrechnungsbild zeichnet, um einen Vergleich zwischen der Box Datei und dem Dokument zu erleichtern. Das Ergebnis ist in Abbildung 4 dargestellt. Der Docker-Container enthält bereits die korrigierten Box-Dateien, die durch den Suffix “_correct” gekennzeichnet sind.
Abbildung 4 – Extrahierter Text bei Anwendung des Tesseract Modells „deu“

3. lstmf Dateien

Während des Finetunings extrahiert Tesseract den Text aus der Tiff-Datei und überprüft die Vorhersage anhand der Koordinaten sowie des Symbols in der Box-Datei. Tesseract verwendet dabei nicht direkt die Tiff- und Box-Datei, sondern erwartet eine sog. lstmf-Datei, die aus den beiden vorherigen Dateien erstellt wurde. Hierbei ist zu beachten, dass zur Erstellung der lstmf-Datei die Tiff- und Box-Datei denselben Namen haben müssen, z.B. train_invoice.tiff und train_invoice.box. Der folgende Befehl erzeugt eine lstmf-Datei für die Zugrechnung:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice lstm.train 
Alle lstmf-Dateien, die für das Training relevant sind, müssen durch ihren relativen Pfad in einer Textdatei namens deu.training_files.txt angegeben werden. In diesem Anwendungsfall wird nur eine lstmf-Datei für das Training verwendet, so dass die Datei deu.training_files.txt nur eine Zeile enthält, nämlich: eval/train_invoice_correct.lstmf. Es wird empfohlen, auch eine lstfm-Datei für die Evaluierungs-Rechnung zu erstellen. Auf diese Weise kann die Performance des Modells während dem Trainingsvorgang bewertet werden:
cd /home/fine_tune/eval
tesseract eval_invoice_correct.tiff eval_invoice_correct lstm.train

Evaluierung des Standard-LSTM-Modells

OCR-Vorhersagen aus dem deutschen Standardmodell “deu” werden als Benchmark verwendet. Einen genauen Überblick über die OCR-Leistung des deutschen Standardmodells erhält man, indem man eine Box-Datei für die Evaluierungs-Rechnung erzeugt und den OCR-Text mit dem bereits erwähnten Python-Skript visualisiert. Dieses Skript, das die Datei “eval_invoice_ocr deu.tiff” erzeugt, befindet sich im mitgelieferten Container unter „/home/fine_tune/src/draw_box_file_data.py“. Das Skript erwartet als Argument den Pfad zu einer Tiff-Datei, die entsprechende Box-Datei sowie einen Namen für die Ausgabe-Tiff-Datei. Der durch das deutsche Standardmodell extrahierte OCR-Text wird als eval/eval_invoice_ocr_deu.tiff gespeichert und ist in Abbildung 1 dargestellt. Auf den ersten Blick sieht der durch OCR extrahierte Text gut aus. Das Modell extrahiert deutsche Zeichen wie ä, ö ü und ß korrekt. Tatsächlich gibt es nur drei Fälle, in denen Wörter Fehler enthalten:
OCR Truth
Jessel GmbH 8 Co Jessel GmbH & Co
11 Glasbehälter 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Das Modell schneidet bei gebräuchlichen deutschen Wörtern bereits gut ab, hat aber Schwierigkeiten mit singulären Symbolen wie “&” und “l” sowie Wörtern wie “bloch”, die nicht in der Wortliste des Modells enthalten sind. Preise und Zahlen sind für das Modell eine viel größere Herausforderung. Hierbei treten deutlich häufiger Fehler bei der Extraktion auf:
OCR Truth
159,16 159,1€
1% 7%
1305.816 1305.81€
227.66 227.6€
341.51 347.57€
1115.16 1115.7€
242.86 242.8€
1456.86 1456.8€
51.46 54.1€
1954.719€ 1954.79€
Das deutsche Standardmodell extrahiert das Euro-Symbol € in 9 von 18 Fällen nicht korrekt. Dies entspricht einer Fehlerquote von 50%.

Finetuning des Standard-LSTM-Modells

Das Standard-LSTM-Modell wird nun auf der in Abbildung 2 gezeigten Rechnung finegetuned. Anschließend wird die OCR-Leistung anhand der in Abbildung 1 gezeigten Evaluierungs-Rechnung bewertet, die auch zuvor für das Benchmarking des deutschen Standardmodells verwendet wurde. Zum Finetuning des LSTM-Modells muss dieses zunächst aus der Datei deu.traineddata extrahiert werden. Mit dem folgenden Befehl wird das LSTM-Modell aus dem deutschen Standardmodell in das Verzeichnis lstm_model extrahiert:
cd /home/fine_tune
combine_tessdata -e tesseract/tessdata/best/deu.traineddata lstm_model/deu.lstm
Anschließend werden alle notwendigen Dateien für das Finetuning zusammengestellt. Die Dateien sind ebenfalls im Docker-Container vorhanden:
  1. Die Trainings-Dateien train_invoice_correct.lstmf und deu.training_files.txt im Verzeichnis train.
  2. Die Evaluierungs-Dateien eval_invoice_correct.lstmf und deu.training_files.txt im eval-Verzeichnis.
  3. Das extrahierte LSTM-Modell deu.lstm im Verzeichnis lstm_model.
Der Docker-Container enthält das Skript src/fine_tune.sh, das den Prozess des Finetunings startet. Sein Inhalt ist:
/usr/bin/lstmtraining 
 --model_output output/fine_tuned 
 --continue_from lstm_model/deu.lstm 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --train_listfile train/deu.training_files.txt 
 --evallistfile eval/deu.training_files.txt 
 --max_iterations 400
Mit diesem Befehl wird das extrahierte Modell deu.lstm in der in train/deu.training_files.txt angegebenen Datei train_invoice.lstmf getuned. Das Finetuning des LSTM-Modells erfordert sprachspezifische Informationen, die im Ordner deu.tessdata enthalten sind. Die Datei eval_invoice.lstmf, die in eval/deu.training_files.txt angegeben ist, wird zur Berechnung der OCR-Performance während des Trainings verwendet. Das Finetuning wird nach 400 Iterationen beendet. Die gesamte Trainingsdauer dauert weniger als zwei Minuten. Der folgende Befehl führt das Skript aus und protokolliert die Ausgabe in einer Datei:
cd /home/fine_tune
sh src/fine_tune.sh > output/fine_tune.log 2>&1
Der Inhalt der Protokolldatei nach dem Training ist unten dargestellt:
src/fine_tune.log
Loaded file lstm_model/deu.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from lstm_model/deu.lstm
Loaded 20/20 lines (1-20) of document train/train_invoice_correct.lstmf
Loaded 24/24 lines (1-24) of document eval/eval_invoice_correct.lstmf
2 Percent improvement time=69, best error was 100 @ 0
At iteration 69/100/100, Mean rms=1.249%, delta=2.886%, char train=8.17%, word train=22.249%, skip ratio=0%, New best char error = 8.17 Transitioned to stage 1 wrote best model:output/deu_fine_tuned8.17_69.checkpoint wrote checkpoint.
-----
2 Percent improvement time=62, best error was 8.17 @ 69
At iteration 131/200/200, Mean rms=1.008%, delta=2.033%, char train=5.887%, word train=20.832%, skip ratio=0%, New best char error = 5.887 wrote best model:output/deu_fine_tuned5.887_131.checkpoint wrote checkpoint.
-----
2 Percent improvement time=112, best error was 8.17 @ 69
At iteration 181/300/300, Mean rms=0.88%, delta=1.599%, char train=4.647%, word train=17.388%, skip ratio=0%, New best char error = 4.647 wrote best model:output/deu_fine_tuned4.647_181.checkpoint wrote checkpoint.
-----
2 Percent improvement time=159, best error was 8.17 @ 69
At iteration 228/400/400, Mean rms=0.822%, delta=1.416%, char train=4.144%, word train=16.126%, skip ratio=0%, New best char error = 4.144 wrote best model:output/deu_fine_tuned4.144_228.checkpoint wrote checkpoint.
-----
Finished! Error rate = 4.144
Während des Trainings speichert Tesseract nach jeder Iteration einen sog. Model Checkpoint. Die Leistung des Modells an diesem Kontrollpunkt wird anhand der Evaluierungs-Daten getestet und mit dem aktuell besten Ergebnis verglichen. Wenn sich das Ergebnis verbessert, d.h. der OCR-Fehler abnimmt, wird eine beschriftete Kopie des Checkpoints gespeichert. Die erste Nummer des Dateinamens für den Kontrollpunkt steht für den Zeichenfehler und die zweite Nummer für die Trainingsiteration. Der letzte Schritt ist die neue Zusammenstellung des finegetunten LSTM-Modells, so dass man wieder ein “traineddata” Modell erhält. Unter der Annahme, dass der Kontrollpunkt bei der 181. Iteration selektiert wurde, wird mit dem folgenden Befehl ein ausgewählter Kontrollpunkt “deu_fine_tuned4.647_181.checkpoint” in ein voll funktionsfähiges Tesseract-Modell “deu_fine_tuned.traineddata” umgewandelt:
cd /home/fine_tune
/usr/bin/lstmtraining 
 --stop_training 
 --continue_from output/deu_fine_tuned4.647_181.checkpoint 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --model_output output/deu_fine_tuned.traineddata
Dieses Modell muss in die Testdaten der lokalen Tesseract-Installation kopiert werden, um es Tesseract zur Verfügung zu stellen. Dies ist im Docker-Container bereits geschehen. Vergewissern Sie sich, dass das feinabgestimmte Modell in Tesseract verfügbar ist:
tesseract --list-langs

Evaluierung des finegetunten LSTM-Modells

Das finegetunte Modell wird analog zum Standardmodell evaluiert: Es wird eine Box-Datei der Auswertungs-Rechnung erstellt, und der OCR-Text wird mit Hilfe des Python-Skripts auf dem Bild der Auswertungsrechnung angezeigt. Der Befehl zur Erzeugung der Box-Dateien muss so modifiziert werden, dass das fein abgestimmte Modell “deu_fine_tuned” anstelle des Standardmodells “deu” verwendet wird:
cd /home/fine_tune/eval
tesseract eval_invoice.tiff eval_invoice --psm 4 -l deu_fine_tuned lstmbox
Der durch das fein abgestimmte Modell extrahierte OCR-Text ist in Abbildung 5 unten dargestellt.
Abbildung 5: OCR Ergebnisse des finegetunten LSTM Modells
Wie beim deutschen Standardmodell bleibt die Leistung bei den deutschen Wörtern gut, aber nicht perfekt. Um die Leistung bei seltenen Wörtern zu verbessern, könnte die Wortliste des Modells um weitere Worte erweitert werden.
OCR Truth
 Jessel GmbH 8 Co Jessel GmbH & Co
1! Glasbehälte 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Wichtiger ist, dass sich die OCR-Leistung bei Zahlen deutlich verbessert hat: Das verfeinerte Modell extrahierte alle Zahlen und jedes Vorkommen des €-Zeichens korrekt.
OCR Truth
159,1€ 159,1€
7% 7%
1305.81€ 1305.81€
227.6€ 227.6€
347.57€ 347.57€
1115.7€ 1115.7€
242.8€ 242.8€
1456.8€ 1456.8€
54.1€ 54.1€
1954.79€ 1954.79€

Fazit und Ausblick

In diesem Artikel wurde gezeigt, dass die OCR-Leistung von Tesseract durch Finetuning erheblich verbessert werden kann. Insbesondere bei Nicht-Standard-Anwendungsfällen, wie der Text-Extraktion von Rechnungsdokumenten, kann so die OCR-Leistung signifikant verbessert werden. Neben der Open Source Lizensierung macht die Möglichkeit, die LSTM-Engine von Tesseract mittels Finetunings für spezifische Anwendungsfälle zu tunen, das Framework zu einem attraktiven Tool, auch für anspruchsvollere OCR-Einsatzszenarien. Zur weiteren Verbesserung des Ergebnisses kann es sinnvoll sein, das Modell für weitere Iterationen zu tunen. In diesem Anwendungsfall wurde die Anzahl der Iterationen absichtlich begrenzt, da nur ein Dokument zum Finetuning verwendet wurde. Mehr Iterationen erhöhen potenziell das Risiko einer Überanpassung des LSTM-Modells auf bestimmten Symbolen, was wiederum die Fehlerquote bei anderen Symbolen erhöht. In der Praxis ist es wünschenswert, die Anzahl der Iterationen unter der Voraussetzung zu erhöhen, dass ausreichend Trainingsdaten zur Verfügung stehen. Die endgültige OCR-Leistung sollte immer auf Basis eines weiteren, jedoch repräsentativen Datensatz von Dokumenten überprüft werden.

Referenzen

  • Tesseract training: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
  • Image processing overview: https://tesseract-ocr.github.io/tessdoc/ImproveQuality#image-processing
  • Otsu thresholding: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
  • Tesseract digits comma model: https://github.com/Shreeshrii/tessdata_shreetest
  As a data scientist, it is always tempting to focus on the newest technology, the latest release of your favorite deep learning network, or a fancy statistical test you recently heard of. While all of this is very important, and we here at STATWORX are proud to use the latest open-source machine learning tools, it is often more important to take a step back and have a closer look at the problem we want to solve. In this article, I want to show you the importance of framing your business question in a different way – the data science way. Once the problem is clearly defined, we are more than happy to apply the newest fancy algorithm. But let’s start from the beginning!

The Initial Problem

Management View

Let’s assume for a moment that you are a data scientist here at STATWORX. Monday morning, at 10 o’clock the telephone rings, and a manager of an international bank is on the phone. After a bit of back and forth, the bank manager explains that they have a problem with defaulting loans and they need a program that predicts loans which are going to default in the future. Unfortunately, he must end the call now, but he’ll catch up with you later. In the meanwhile, you start to make sense of the problem.

Data Scientist View

While it’s clear for the bank manager that he provided you with all necessary information, you grab another cup of coffee, lean back in your chair and recap the problem:
  • The bank lends money to customers today
  • The customer promises the bank to pay back the loan bit by bit over the next couple of months/years
  • Unfortunately, some of the customers are not able to do so and are going to default on the loan
So far everything is fine. The bank will give you data of the past and you are instructed to make a prediction. Fair enough, but what specifically was there to predict again? Do they need to know whether every single loan is going to default or not? Are they more concerned about the default trend throughout the whole bank?

Data Science Explanation

From a data science perspective, we differentiate between two sorts of problems: Classification and Regression tasks. The way we prepare the data and the models we apply are inherently different between the two tasks. Classification problems, as the name suggested, assign data points into a specific category. For bank loans, one approach could be to construct two categories:
  • The loan defaulted
  • The loan is still performing
On the other hand, the output of a Regression problem is a continuous variable. In this case, this could be:
  • The percentage of loans which are going to default in a given month
  • The total amount of money the bank will lose in a given month
From now on, it’s paramount to evaluate with the clients what problem they actually want to solve. While it’s a lot of fun to play around with the best tech stack, it is of the highest importance to never forget about the business needs of the client. I’ll present you two possible scenarios, one for the classification and one for the regression case.
Regression and Classification Task

Scenario Classification Problem

Management View

For the next day, you set up a phone conference with the manager and decision-makers of the bank to discuss the overall direction of the project. The management board of the bank decided that it is more important to focus on the default prediction of single loans, instead of the overall default trend. Now you know that you have to solve a classification problem. Further, you ask the board what exactly they expect from the model. Manager A: I want to have the best performing model possible! Manager B: As long as it predicts reality as accurate as possible, I’m happy 🙂 Manager C: As long as it catches every defaulted loan for sure… Manager A: … but of course, it should not predict too many loans wrong!

Data Scientist View

You try to match every requirement from the bank. Understandably, the bank wants to have the perfect model, which makes little to no mistakes. Unfortunately, there is always an error. You are still unsure which error is worse for the bank. To properly continue your work, it is important to define with the client which problem exactly to solve and, therefore, which error to minimize. Some options could be:
  • Catch every loan that will default
  • Make sure the model does not classify a performing loan as a defaulted loan
  • Some kind of weighted average between both of them
Have a look at the right chart above to see how it could look like.

Data Science Explanation

To generate predictions, you have to train a model on the given data. To tell the model how well it performed and to punish it for mistakes, it is necessary to define an error metric. The choice of the error metric always depends on the business case. From a technical point of view, it is possible to model nearly every business case, however, there are four metrics that are used in most classification problems.

    \[Accuracy = frac{# : correctly : classified : loans}{# : loans}\]

This metric measures, as the name suggests, how accurate the model can predict the loan status. While this is the most basic metric one can think of, it’s also a dangerous one. Let’s say the bank tells us that roughly 5% of the loans on the balance sheet default. If, for some reason, our model never predicts defaults. In other words, the model classifies every loan as a non-defaulting loan. The accuracy is immediately 95/100 = 95%. For datasets where the classes are highly imbalanced, it is usually a good idea to discard accuracy.

    \[Recall = frac{# : correctly : classified : defaulted : loans }{# : all : in : reality : defaulted : loans}\]

Optimizing the machine learning algorithm for recall would ensure that the algorithm catches as many defaulted loans as possible. On the flip side, an algorithm that predicts perfectly all defaulted loans as a default is often the result that the algorithm predicts too many loans as defaulted. Many loans that are not going to default are also flagged as default.

    \[Precision = frac{# : correctly : classified : defaulted : loans}{# : all : as : default : predicted : loans}\]

High precision ensures that all of the loans the algorithm flags as a default are classified correctly. This is done at the expense of the overall amount of loans which are flagged as default. Therefore, it might not be possible to flag every loan which is going to default as a default, but the loans which are flagged as defaults are most likely really going to default.

    \[F-beta : score = (1+beta^2) * frac{Precision * Recall}{(beta^2 * Precision) * Recall}\]

Empirically speaking, an increase in recall is almost always associated with a decrease in precision and vice versa. Often, it is desired to balance precision and recall somehow. This can be done with the F-beta score.

Scenario Regression Problem

Management View

During the phone conference (same one as in the classification scenario), the decision-makers from the bank announced that they want to predict the overall default trends. While that’s already important information, you evaluate with the client what exactly their business need is. At the end you’ll end up with a list of requirements: Manager A: It’s important to match the overall trend as close as possible. Manager B: During normal times, I won’t pay too much attention to the model. However, it is absolute necessary that the model performs well in extreme market situations. Manager C: To make it as easy and convenient to use as possible and to be able to explain it to the regulating agency, it has to be as explainable as possible.

Data Science View

Similar to the last scenario, there is again a tradeoff. It is a business problem to define which error is worse. Is every deviation from the ground truth equally bad? Is a certain stability of the prediction error important? Does the client care about the volatility of the forecast? Does a baseline exists? Have a look at the left chart above to see how it could look like.

Data Science Explanation

Once again, there are several metrices one can choose from. The best metric always depends on the business need. Here are the most common ones:

    \[Mean : Absolut : Error = frac{1}{n} sum(actual : output - predicted : output)\]

The Mean Absolute Error (MAE) calculates, as the name suggests, how far the predictions are off in absolute terms. While the number is easy to interpret, it treats every deviation in the same way. On a 100-day time interval, being every day off by 1 unit is the same as predicting everything, every day right but being one day off by 100 units.

    \[Mean : Squared : Error = frac{1}{n} sum(actual : output - predicted : output)^2\]

The Mean Squared Error (MSE) also calculates the difference between the actual and the predicted output. This time, the deviation is weighted. Extreme values are worse compared to many small errors.

    \[R^2 = 1 - frac{MSE(model)}{MSE(baseline)}\]

The R^2 compares the model to evaluate against a simple baseline model. The advantage is that the output is easy to interpret. A value of 1 describes the perfect model, while a value close to 0 (or even negative) describes a model with room for improvement. This metric is commonly used among economists and econometricians and, therefore, in some industries a metric to consider. However, it is also relatively easy to get a high R^2, which makes it hard to compare.

    \[Mean : Absolute : Percentage : Error = frac{1}{n} sum frac{actual : output - predicted : output}{actual : output} * 100\]

The Mean Absolute Percentage Error (MAPE) measures the absolute deviation from the predicted values. On the contrary to the MAE, the MAPE displays them in relative terms, which makes it very easy to interpret and to compare. The MAPE has its own set of drawbacks and caveats. Fortunately, my colleague Jan already wrote an article about it. Check it out if you want to learn more about it here

Conclusion

In either one of the cases, the classification or the regression case, the “right” answer to the problem depends on how the problem is actually defined. Before applying the latest machine learning algorithm, it is crucial that the business question is well defined. A strong collaboration with the client team is necessary and is the key to achieving the best result for the client. There is no one-size-fits-all data science solution. Even though the underlying problem is the same for every stakeholder in the bank, it might be worth it to train several models for every department. It all boils down to the business needs! We still haven’t covered several other problems, which might arise in subsequent steps. How is the default of a loan defined? What is the prediction horizon? Do we have enough data to cover all business cycles? Is the model just used internally or do we have to explain the model to a regulating agency? Should we optimize the model for some kind of internal resource constraints? To discuss this and more, feel free to reach out to me at dominique.lade@statworx.com or send me a message via LinkedIn. Recently, some colleagues and I attended the 2-day COVID-19 hackathon #wirvsvirus, organized by the German government. Thereby, we’ve developed a great application for simulating COVID-19 curves based on estimations of governmental measure effectiveness (FlatCurver). As there are many COVID-related dashboards and visualizations out there, I thought that gathering the underlying data from a single point of truth would be a minor issue. However, I soon realized that there are plenty of different data sources, mostly relying on the Johns Hopkins University COVID-19 case data. At first, I thought that’s great, but at a second glance, I revised my initial thought. The JHU datasets have some quirky issues to it that makes it a bit cumbersome to prepare and analyze it:
  • weird column names including special characters
  • countries and states “in the mix”
  • wide format, quite unhandy for data analysis
  • import problems due to line break issues
  • etc.
For all of you, who have been or are working with COVID-19 time series data and want to step up your data-pipeline game, let me tell you: we have an API for that! The API uses official data from the European Centre for Disease Prevention and Control and delivers a clear and concise data structure for further processing, analysis, etc.

Overview of our COVID-19 API

Our brand new COVID-19-API brings you the latest case number time series right into your application or analysis, regardless of your development environment. For example, you can easily import the data into Python using the requests package:
import requests
import json
import pandas as pd

# POST to API
payload = {'country': 'Germany'} # or {'code': 'DE'}
URL = 'https://api.statworx.com/covid'
response = requests.post(url=URL, data=json.dumps(payload))

# Convert to data frame
df = pd.DataFrame.from_dict(json.loads(response.text))
Or if you’re an R aficionado, use httr and jsonlite to grab the lastest data and turn it into a cool plot.
library(httr)
library(dplyr)
library(jsonlite)
library(ggplot2)

# Post to API
payload <- list(code = "ALL")
response <- httr::POST(url = "https://api.statworx.com/covid",
                       body = toJSON(payload, auto_unbox = TRUE), encode = "json")

# Convert to data frame
content <- rawToChar(response$content)
df <- data.frame(fromJSON(content))

# Make a cool plot
df %>%
  mutate(date = as.Date(date)) %>%
  filter(cases_cum > 100) %>%
  filter(code %in% c("US", "DE", "IT", "FR", "ES")) %>%
  group_by(code) %>%
  mutate(time = 1:n()) %>%
  ggplot(., aes(x = time, y = cases_cum, color = code)) +
  xlab("Days since 100 cases") + ylab("Cumulative cases") +
  geom_line() + theme_minimal()
covid-race

Developing the API using Flask

Developing a simple web app using Python is straightforward using Flask. Flask is a web framework for Python. It allows you to create websites, web applications, etc. right from Python. Flask is widely used to develop web services and APIs. A simple Flask app looks something like this.
from flask import Flask
app = Flask(__name__)

@app.route('/')
def handle_request():
  """ This code gets executed """
  return 'Your first Flask app!'
In the example above, app.route decorator defines at which URL our function should be triggered. You can specify multiple decorators to trigger different functions for each URL. You might want to check out our code in the Github repository to see how we build the API using Flask.

Deployment using Google Cloud Run

Developing the API using Flask is straightforward. However, building the infrastructure and auxiliary services around it can be challenging, depending on your specific needs. A couple of things you have to consider when deploying an API:
  • Authentification
  • Security
  • Scalability
  • Latency
  • Logging
  • Connectivity
We’ve decided to use Google Cloud Run, a container-based serverless computing framework on Google Cloud. Basically, GCR is a fully managed Kubernetes service, that allows you to deploy scalable web services or other serverless functions based on your container. This is how our Dockerfile looks like.
# Use the official image as a parent image
FROM python:3.7

# Copy the file from your host to your current location
COPY ./main.py /app/main.py
COPY ./requirements.txt /app/requirements.txt

# Set the working directory
WORKDIR /app

# Run the command inside your image filesystem
RUN pip install -r requirements.txt

# Inform Docker that the container is listening on the specified port at runtime.
EXPOSE 80

# Run the specified command within the container.
CMD ["python", "main.py"]
You can develop your container locally and then push it in to the container registry of your GCP project. To do so, you have to tag your local image using docker tag according to the following scheme: [HOSTNAME]/[PROJECT-ID]/[IMAGE]. The hostname is one of the following: gcr.io, us.gcr.io, eu.gcr.io, asia.gcr.io. Afterward, you can push using gcloud push, followed by your image tag. From there, you can easily connect the container to the Google Cloud Run service:
google cloud run
When deploying the service, you can define parameters for scaling, etc. However, this is not in scope for this post. Furthermore, GCR allows custom domain mapping to functions. That’s why we have the neat API endpoint https://api.statworx.com/covid.

Conclusion

Building and deploying a web service is easier than ever. We hope that you find our new API useful for your projects and analyses regarding COVID-19. If you have any questions or remarks, feel free to contact us or to open an issue on Github. Lastly, if you make use of our free API, please add a link to our website, https://statworx-1727.demosrv.dev to your project. Thanks in advance and stay healthy! In our blog, we have talked a lot about calculating elasticities. But how do we use them in practice? A powerful way to utilize elasticities is to combine them with sales forecasts. In this blog post, we will my STATWORX colleague Daniel’s lunch place as an example to illustrate the power of combining forecasting and elasticities. As Daniel has already mentioned, this is a small chain where you can put together your salad. Using this example, Daniel explained the concept of price elasticity and used simulated prices to show different ways of calculating it. Similarly, the hypothetical optimal price was identified based on estimated price elasticities. If you want to know more about the details of price elasticities, you should read the post (see /blog/food-for-regression-using-sales-data-to-identify-price-elasticity/). So, let’s go again to the Lunch Place. But this time, it’s not about the determination of a general optimal price, but to calculate a price to achieve specific sales targets. Every sales manager is interested in how to increase sales and, of course how they might evolve in the future. For this, it is necessary to know what the sales depend on and what influence you have to increase sales. Of course, you can rely on the experience and your gut feeling. But nowadays, there are ways to support decision making for specific actions significantly. So what is the situation in the lunch place? Suppose you want to grow the number of sales as much as possible. For that, you certainly will not just buy as many ingredients as fit into your warehouse in the expectation that everything will be consumed. Therefore, one must first be able to estimate how much will be sold soon at all. Sales Forecasting can be used for this aspect. But that alone does not necessarily bring significant added value. If we assume that sales targets have been determined, which of course should not be unrealistic, we have a clear reference point which we can use. If we have a case in which the estimated sales are lower than planned, we can use the price elasticities to determine the price necessary to reach the target. If we have a case in which the estimated sales are lower than planned, we can use the price elasticities to determine the price necessary to reach the target. Of course, there may also be a case in which the estimated sales are higher than planned. If you want to increase the profit margin or because the stock situation does not allow significantly more sales, it is possible to make appropriate price adjustments.

Simulation of data

To keep it simple, we will use a single product. For this, we take prices between 4.90€ and 5.90€. At higher rates, there are also fewer sales than at low prices. In this way, it is possible to estimate price elasticities. The simulated sales at the respective prices, then look like this:
fig-01-sales-by-price
Likewise, an upward trend was built into the data. At the same time, the number of sales in the summer months is higher than in the other months. Thus, we can now implement a simple form of Sales Forecasting.
fig-02-sales-by-week-final

Calculating price elasticities and prediction of sales

The price elasticities are estimated using linear regression in which we assume a logarithmic relationship. A logarithmic relationship goes hand in hand with the assumption that demand exponentially grows as the price decreases and also that demand can not sink below zero:

    \[log(Sales) = a+epsilon*log(Price)+d_{summer}\]

In this equation, alpha is the intercept, epsilon is the price elasticity and d_{summer} the dummy for the summer months. In practice, such a simple model will not lead to valid results, but it should suffice for illustration:
Coefficients Estimate Std. Error CI.95 (lower) CI.95 (upper)
Intercept 8.809 0.467 7.891 9.726
log(Price) -0.939 0.277 -1.483 -0.384
d_{summer} 0.180 0.010 0.160 0.199
The confidence intervals for the price effect are relatively large, whereas the confidence intervals for the dummy are small. However, this is primarily due to the simulation of the data. Now that we’ve estimated the price elasticity, all we need is an estimate of future sales. Future sales were forecast with a HoltWinters forecast, which only took into account the seasonality.
fig-03-sales-forecast

Adjusting prices according to sales targets

Now we have all the necessary results to be able to perform the last step, in which we calculate the prices required to compensate for possible differences. For this, we take as a starting point the Cobb-Douglas-function. In addition, the effect of price and elasticity on quantity was visualized:

    \[Q(p)=A*p^epsilon\]

fig-04-cobb-douglas-plot
If we define the sales target as Q_ {0}, we can transform the Cobb Douglas function as follows: Q_{0}-Q(p) = Q_{0}-A*p^epsilon Q_{0} - A*p^epsilon=0 A*p^epsilon = Q_{0} p^epsilon = frac{Q_{0}}{A} p^* = Big(frac{Q_{0}}{A}Big)^frac{1}{epsilon} With this formula, we can now use our results to calculate the price for a specific target. Q_{0} represents the sales target, A represents the predicted sales, and epsilon is, of course, the estimated price elasticity -0.939. The formula gives us a factor around which we would have to adjust the price to reach the sales target. Also, the confidence intervals of the price elasticities can be used to calculate a range of price adjustments. In our simple example, we will now use the predicted next month with different targets. The price used for the calculation was taken from the last month, which was 5.40€.
Date Prediction (A) Price Target (Q_0) Adj. Price (p^*) Adj. Price (lower CI) Adj. Price (upper CI)
2020-01-01 1510 5.4 € 2000 4.00 € 4.47 € 2.65 €
2020-01-01 1510 5.4 € 1800 4.48 € 4.80 € 3.46 €
2020-01-01 1510 5.4 € 1300 6.33 € 5.97 € 7.90 €
2020-01-01 1510 5.4 € 1000 8.38 € 7.13 € 15.37 €
The results show a wide range for optimal price adjustments, where the ranges are partly well over 1 €. Moreover, the range of the predicted sales is not even considered into these results. Of course, this is due to the data and models used here. Apart from that, the results also show the potential of such an implementation in the decision-making process. On the one hand, of course, the prices can be adjusted to achieve specific sales targets. On the other hand, margins can be optimized if the targets are achieved as predicted.

Conclusion

Nevertheless, the example given here shows how price elasticity and sales forecasting can be combined to recommend specific price adjustments. Particularly for sales managers, such a methodology is interesting and can offer added value to the decision-making process. Especially if the modeling and the associated price adjustments are significantly more accurate. In practice, price adjustments for different products can be estimated in this way. Moreover, competitors’ information or assumptions about them can be included in the models in addition to many other parameters. Similarly, seasonal price elasticities can be estimated, which can lead to possibly more meaningful price adjustments. It is June and nearly half of the year is over, marking the middle between Christmas 2018 and 2019. Last year in autumn, I’ve published a blog post about predicting Wham’s “Last Christmas” search volume using Google Trends data with different types of neural network architectures. Of course, now I want to know how good the predictions were, compared to the actual search volumes. The following table shows the predicted values by the different network architectures, the true search volume data in the relevant time region from November 2018 until January 2019, as well as the relative prediction error in brackets:
month MLP CNN LSTM actual
2018-11 0.166 (0.21) 0.194 (0.078) 0.215 (0.023) 0.21
2018-12 0.858 (0.057) 0.882 (0.031) 0.817 (0.102) 0.91
2019-01 0.035 (0.153) 0.034 (0.149) 0.035 (0.153) 0.03
There’s no clear winner in this game. For the month November, the LSTM model performs best with a relative error of only 2.3%. However, in the “main” month December, the LSTM drops in accuracy in favor of the 1-dimensional CNN with 3.1% error and the MLP with 5.7% error. Compared to November and December, January exhibits higher prediction errors >10% regardless of the architecture. To bring a little more data science flavor into this post, I’ve created a short R script that presents the results in a cool “heatmap” style.
library(dplyr)
library(ggplot2)

# Define data frame for plotting
df_plot <- data.frame(MONTH=rep(c("2018-11", "2018-12", "2019-01"), 3),
                      MODEL = c(rep("MLP", 3), rep("CNN", 3), rep("LSTM", 3)),
                      PREDICTION = c(0.166, 0.858, 0.035, 0.194, 0.882, 0.034, 0.215, 0.817, 0.035),
                      ACTUAL = rep(c(0.21, 0.91, 0.03), 3))

# Do plot
df_plot %>%
  mutate(MAPE = round(abs(ACTUAL - PREDICTION) / ACTUAL, 3)) %>%
  ggplot(data = ., aes(x = MONTH, y = MODEL)) +
  geom_tile(aes(fill = MAPE)) +
  scale_fill_gradientn(colors = c('navyblue', 'darkmagenta', 'darkorange1')) +
  geom_text(aes(label = MAPE), color = "white") +
  theme_minimal()
prediction heat map
This year, I will (of course) redo the experiment using the newly acquired data. I am curious to find out if the prediction improves. In the meantime, you can sign up to our mailing list, bringing you the best data science, machine learning and AI reads and treats directly into your mailbox!
It is June and nearly half of the year is over, marking the middle between Christmas 2018 and 2019. Last year in autumn, I’ve published a blog post about predicting Wham’s “Last Christmas” search volume using Google Trends data with different types of neural network architectures. Of course, now I want to know how good the predictions were, compared to the actual search volumes. The following table shows the predicted values by the different network architectures, the true search volume data in the relevant time region from November 2018 until January 2019, as well as the relative prediction error in brackets:
month MLP CNN LSTM actual
2018-11 0.166 (0.21) 0.194 (0.078) 0.215 (0.023) 0.21
2018-12 0.858 (0.057) 0.882 (0.031) 0.817 (0.102) 0.91
2019-01 0.035 (0.153) 0.034 (0.149) 0.035 (0.153) 0.03
There’s no clear winner in this game. For the month November, the LSTM model performs best with a relative error of only 2.3%. However, in the “main” month December, the LSTM drops in accuracy in favor of the 1-dimensional CNN with 3.1% error and the MLP with 5.7% error. Compared to November and December, January exhibits higher prediction errors >10% regardless of the architecture. To bring a little more data science flavor into this post, I’ve created a short R script that presents the results in a cool “heatmap” style.
library(dplyr)
library(ggplot2)

# Define data frame for plotting
df_plot <- data.frame(MONTH=rep(c("2018-11", "2018-12", "2019-01"), 3),
                      MODEL = c(rep("MLP", 3), rep("CNN", 3), rep("LSTM", 3)),
                      PREDICTION = c(0.166, 0.858, 0.035, 0.194, 0.882, 0.034, 0.215, 0.817, 0.035),
                      ACTUAL = rep(c(0.21, 0.91, 0.03), 3))

# Do plot
df_plot %>%
  mutate(MAPE = round(abs(ACTUAL - PREDICTION) / ACTUAL, 3)) %>%
  ggplot(data = ., aes(x = MONTH, y = MODEL)) +
  geom_tile(aes(fill = MAPE)) +
  scale_fill_gradientn(colors = c('navyblue', 'darkmagenta', 'darkorange1')) +
  geom_text(aes(label = MAPE), color = "white") +
  theme_minimal()
prediction heat map
This year, I will (of course) redo the experiment using the newly acquired data. I am curious to find out if the prediction improves. In the meantime, you can sign up to our mailing list, bringing you the best data science, machine learning and AI reads and treats directly into your mailbox!