Management Summary

In modern companies, information in text form can be found in many places in day-to-day business. Depending on the business context, this can involve invoices, emails, customer input (such as reviews or inquiries), product descriptions, explanations, FAQs, and applications. Until recently, these information sources were reserved mainly for human beings, as the understanding of a text is a technologically challenging problem for machines.
Due to recent achievements in deep learning, several different NLP (“Natural Language Processing”) tasks can now be solved with outstanding quality.
In this article, you will learn how NLP applications solve various business problems through five practical examples, which ensured an increase in efficiency and innovation in their field of application.

Introduction

Natural Language Processing (NLP) is undoubtedly an area that has received special attention in the Big Data environment in the recent past. The interest in the topic, as measured by Google, has more than doubled in the last three years. This shows that innovative NLP technologies have long since ceased to be an issue only for big players such as Apple, Google, or Amazon. Instead, a general democratization of the technology can be observed. One of the reasons for this is that according to an IBM estimate, about 80% of “global information” is not available in structured databases, but unstructured, natural language. NLP will play a key role in the future when it comes to making this information usable. Thus, the successful use of NLP technologies will become one of the success factors for digitization in companies.

To give you an idea of the possibilities NLP opens up in the business context today, I will present five practical use cases and explain the solutions behind them in the following.

What is NLP? – A Short Overview

As a research topic that had already occupied linguists and computer scientists in the 1950s, NLP had a barely visible existence on the application side in the 20th century.

The main reason for this was the availability of the necessary training data. Although the availability of unstructured data, in the form of texts, has generally increased exponentially, especially with the rise of the Internet, there was still a lack of suitable data for model training. This can be explained by the fact that the early NLP models mostly had to be trained under supervision (so-called supervised learning). However, supervised learning requires that training data must be provided with a dedicated target variable. This means that, for example, in the case of text classification, the text corpus must be manually annotated by humans before the model training.

This changed at the end of the 2010s when a new model generation of artificial neural networks led to a paradigm shift. These so-called “Language Models” are based on huge text corpora of Facebook, Google, etc., (pre-)trained by randomly masking individual words in the texts and predicting them in the course of training. This is so-called self-supervised learning, which no longer requires a separate target variable. In the course of the training, these models learn a contextual understanding of texts.

The advantage of this approach is that the same model can be readjusted for various downstream tasks (e.g., text classification, sentiment analysis, named entity recognition) with the help of the learned contextual understanding. This process is called transfer learning. In practice, these pre-trained models can be downloaded so that only the fine-tuning for the specific application must be done by additional data. Consequently, high-performance NLP applications can now be developed with little development effort.

To learn more about Language Models (especially the so-called Transformer Models like “BERT”, resp. “roBERTa”, etc.) as well as trends and obstacles in the field of NLP, please read the article on NLP trends by our colleague Dominique Lade. [https://www.statworx.com/de/blog/neue-trends-im-natural-language-processing-wie-nlp-massentauglich-wird/].

The 5 Use Cases

Text Classification in the Recruitment Process

A medical research institute wants to make its recruitment process of study participants more efficient.

For testing a new drug, different, interdependent requirements are placed on the persons in question (e.g., age, general health status, presence/absence of previous illnesses, medications, genetic dispositions, etc.). Checking all these requirements is very time-consuming. Usually, it takes about one hour per potential study participant to view and assess relevant information. The main reason for this is that the clinical notes contain patient information that exceeds structured data such as laboratory values and medication: Unstructured information in text form can also be found in the medical reports, physician’s letters, and discharge reports. Especially the evaluation of the latter data requires a lot of reading time and is therefore very time-consuming. To speed up the process, the research institute is developing a machine learning model that pre-selects promising candidates. The experts then only have to validate the proposed group of people.

The NLP Solution

From a methodological point of view, this problem is a so-called text classification. Based on a text, a prognosis is created for a previously defined target variable. To train the model, it is necessary – as usual in supervised learning – to annotate the data, in this case the medical documents, with the target variable. Since a classification problem has to be solved here (suitable or unsuitable study participants), the experts manually assess the suitability for the study for some persons in the pool. If a person is suitable, they are marked with a one (=positive case), otherwise with a zero (=negative case). Based on these training examples, the model can now learn the relationships between the persons’ medical documents and their suitability.

To cope with the complexity of the problem, a correspondingly complex model called ClinicalBERT is used. This is a language model based on BERT (Bidirectional Encoder Representations from Transformers), which was additionally trained on a data set of clinical texts. Thus, ClinicalBERT can generate so-called representations of all medical documentation for each person. In the last step, the neural network of ClinicalBERT is completed by a task-specific component. In this case, it is a binary classification: For each person, a probability of suitability should be output. Through a corresponding linear layer, the high-dimensional text documentation is finally transformed into a single number, the suitability probability. In a gradient procedure, the model now learns the suitability probabilities based on the training examples.

Further Application Scenarios of Text Classification:

Text classification often takes place in the form of sentiment analysis. This involves classifying texts into predefined sentiment categories (e.g., negative/positive). This information is particularly important in the financial world or for social media monitoring. Text classification can also be used in various contexts where it is vital to sort documents according to their type (e.g., invoices, letters, reminders).

Name Entity Recognition for Usability Improvement of a News Page

A publishing house offers its readers on a news page a large number of articles on various topics. In the course of optimization measures, one would like to implement a better recommender system so that for each article, further suitable (complementary or similar) articles are suggested. Also, the search function on the landing page is to be improved so that the customer can quickly find the article he or she is looking for.

To create a good data basis for these purposes, the publisher decided to use Named Entity Recognition (NER) to assign automated tags to the texts, improving both the recommender system and the search function. After successful implementation, significantly more suggested articles are clicked on, and the search function has become much more convenient. As a result, the readers spend substantially more time on the page.

The NLP Solution

To solve the problem, one must first understand how NER works:

NER is about assigning words or entire phrases to content categories. For example, “Peter” can be identified as a person, “Frankfurt am Main” is a place, and “24.12.2020” is a time specification. There are also much more complicated cases. For this purpose, compare the following pairs of sentences:

  1. In the past, Adam didn’t know how to parallel park. (park = from the verb “to park”)
  2. Yesterday I took my dog for a walk in the park. (park = open green area)

It is perfectly evident to humans that the word “park” has a different meaning in each of the two sentences. However, this seemingly simple distinction is anything but trivial for the computer. An entity recognition model could characterize the two sentences as follows:

  1. “[In the past] (time reference), [Adam] (person) didn’t know how to parallel [park] (verb).”
  2. [Yesterday] (time reference) [I] (person) took my dog for a walk in the [park] (location).

In the past, rule-based algorithms would have been used to solve the above NER problem, but here too, the machine learning approach is gaining ground:

The present multiclass classification problem of entity determination is again addressed using the BERT model. Additionally, the model is trained on an annotated data set in which the entities are manually identified. The most comprehensive publicly accessible database in the English language is the Groningen Meaning Bank (GMB). After successful training, the model can correctly determine previously unknown words from the context resulting from the sentence. Thus, the model recognizes that prepositions like “in, at, after…” are followed by a location, but more complex contexts are also used to determine the entity.

Further Application Scenarios of NER:

NER is a classic information retrieval task and is central to many other NER tasks, such as chatbots and question-answer systems. Also, NER is often used for text cataloging, where the type of text is determined based on valid recognized entities.

A Chatbot for a Long-Distance Bus Company

A long-distance bus company would like to increase its accessibility and expand the communication channels with the customer. In addition to its homepage and app, the company wants to offer a third way to the customer, namely a Whatsapp-Chatbot. The goal is to perform specific actions in the conversation with the chatbot, such as searching, booking, and canceling trips. In addition, the chatbot is intended to create a reliable way of informing passengers about delays.

With the introduction of the chatbot, not only existing passengers can be reached more quickly, but also, contact can be established with new customers* who have not yet installed an app.

The NLP solution

Depending on the requirements that are placed on the chatbot, you can choose between different chatbot architectures.

Over the years, four main chatbot paradigms have been tested: In a first generation, the inquiry was examined for well-known patterns and accordingly adapted prefabricated answers were spent (“pattern matching”). More sophisticated is the so-called “grounding”, in which information extracted from knowledge libraries (e.g., Wikipedia) is organized in a network by Named Entity Recognition (see above). Such a network has the advantage that not only registered knowledge can be retrieved, but also unregistered knowledge can be inferred by the network structure. In “searching”, question-answer pairs from the conversation history (or from previously registered logs) are directly used to find a suitable answer. The use of machine learning models is the most proven approach to generate suitable answers (“generative models”) dynamically.

The best way to implement this modern chatbot with clearly definable competencies for the company is to use existing frameworks such as Google Dialogflow. This is a platform for configuring chatbots that have the elements of all previously mentioned chatbot paradigms. For this purpose, parameters such as intents, entities, and actions are passed.

An intend (“user intention”) is, for example, the timetable information. By giving different example phrases (“How do I get from … to … from … to …”, “When is the next bus from … to …”) to a language model, the chatbot can assign even unseen input to the correct intend (see text classification).

Furthermore, different travel locations and times are defined as entities. If the chatbot now captures an intend with matching entities (see NER), an action, in this case a database query, can be triggered. Finally, an intend-answer with the relevant information is given and adapted to all information in the chat history specified by the user (“stateful”).

Further Application Scenarios of Chatbots:

There are many possible applications in customer service, depending on the complexity of the scenario, from the automatic preparation (e.g., sorting) of a customer order to the complete processing of a customer experience.

A Question-Answering System as a Voice Assistant for Technical Questions About the Automobile

An automobile manufacturer discovers that many of its customers do not get along well with the manuals that come with the cars. Often, finding the relevant information takes too long, or it is not found at all. Therefore, it was decided to offer a Voice Assistant to provide precise answers to technical questions in addition to the static manual. In the future, drivers will be able to speak comfortably with their center console when they want to service their vehicle or request technical information.

The NLP solution

Question-answer systems have been around for decades, as they are at the forefront of artificial intelligence. A question-answer system that would always find a correct answer, taking into account all available data, could also be called “General AI”. A significant difficulty on the way to General AI is that the area the system needs to know about is unlimited. In contrast, question-answer systems provide good results when the area is delimited, as is the case with the automotive assistant. In general, the more specific the area, the better results can be expected.

For the implementation of the question-answer system, two data types from the manual are used: structured data, such as technical specifications of the components and key figures of the model, and unstructured data, such as instructions for action. All data is transformed into question-answer form in a preparatory step using other NLP techniques (classification, NER). This data is transferred to a version of BERT that has already been pre-trained on a large question-answer data set (“SQuAD”). The model is thus able to answer questions that have already been fed into the system and provide educated guesses for unseen questions.

Further Application Scenarios of Question-Answer Systems:

With the help of question-answer systems, company-internal search engines can be extended by functionalities. In e-commerce, answers to factual questions can be given automatically based on article descriptions and reviews.

Automatic Text Summaries (Text Generation) of Damage Descriptions for a Property Insurance

An insurance company wants to increase the efficiency of its claim settlement department. It has been noticed that some claims complaints from the customer lead to internal conflicts of responsibility. The reason for this is simple: customers usually describe the claims over several pages, and an increased training period is needed to be able to judge whether or not to process the case. Thus, it often happens that a damage description must be read thoroughly to understand that the damage itself does not need to be processed. Now, a system that generates automated summaries is to remedy this situation. As a result of the implementation, the claim handlers can now make responsibility decisions much faster.

The NLP solution

One can differentiate between two different approaches to the text summary problem: In the extraction, the most important sentences are identified from the input text and are then used as a summary in the simplest case. In abstraction, a text is transformed by a model into a newly generated summary text. The second approach is much more complex since paraphrasing, generalization, or the inclusion of further knowledge is possible here. Therefore, this approach has a higher potential to generate meaningful summaries but is also more error-prone. Modern text summary algorithms use the second approach or a combination of both methods.

A so-called sequence-to-sequence model is used to solve the insurance use case, which assigns a word sequence (the damage description) to another word sequence (the summary). This is usually a recurrent neural network (RNN), trained based on text summary pairs. The training process is designed to model the probability of the next word depending on the last words (and additionally, an “inner state” of the model). Similarly, the model effectively writes the summary “from left to right” by successively predicting the next word. An alternative approach is to have the input numerically encoded by the Language Model BERT and to have a GPT decoder autoregressively summarize the text based on this numerical representation. With the help of model parameters, it can be adjusted in both cases how long the summary should be.

Further Application Scenarios of Text Generation:

Such a scenario is conceivable in many places: Automated report writing, text generation based on retail sales data analysis, electronic medical record summaries, or textual weather forecasts from weather data are possible applications. Text generation is also used in other NLP use cases such as chatbots and Q&A systems.

Outlook

These five application examples of text classification, chatbots, question-answer systems, NER, and text summaries show that there are many processes in all kinds of companies that can be optimized with NLP solutions.

NLP is not only an exciting field of research but also a technology whose applicability in the business environment is continually growing.

In the future, NLP will not only become a foundation of a data-driven corporate culture but also already holds a considerable innovation potential through direct application, in which it is worth investing.

At STATWORX, we already have years of experience in the development of customized NLP solutions. Here are two of our case studies on NLP: Social Media Recruiting with NLP & Supplier Recommendation Tool. We are happy to provide you with individual advice on this and many other topics.

Management Summary

OCR (Optical Character Recognition) ist eine große Herausforderung für viele Unternehmen. Am OCR-Markt tummeln sich diverse Open Source sowie kommerzielle Anbieter. Ein bekanntes Open Source Tool für OCR ist Tesseract, das mittlerweile von Google bereitgestellt wird. Tesseract ist aktuell in der Version 4 verfügbar, die die OCR Extraktion mittels rekurrenten neuronalen Netzen durchführt. Die OCR Performance von Tesseract ist nach wie vor jedoch volatil und hängt von verschiedenen Faktoren ab. Eine besondere Herausforderung ist die Anwendung von Tesseract auf Dokumente, die aus verschiedenen Strukturen aufgebaut sind, z.B. Texten, Tabellen und Bildern. Eine solche Dokumentenart stellen bspw. Rechnungen dar, die OCR Tools aller Anbieter nach wie vor besondere Herausforderungen stellen. In diesem Beitrag wird demonstriert, wie ein Finetuning der Tesseract-OCR (Optical Character Recognition) Engine auf einer kleinen Stichprobe von Daten bereits eine erhebliche Verbesserung der OCR-Leistung auf Rechnungsdokumenten bewirken kann. Dabei ist der dargestellte Prozess nicht ausschließlich auf Rechnungen anwendbar sondern auf beliebige Dokumentenarten. Es wird ein Anwendungsfall definiert, der auf eine korrekte Extraktion des gesamten Textes (Wörter und Zahlen) aus einem fiktiven, aber realistischen deutschen Rechnungsdokument abzielt. Es wird hierbei angenommen, dass die extrahierten Informationen für nachgelagerte Buchhaltungszwecke bestimmt sind. Daher wird eine korrekte Extraktion der Zahlen sowie des Euro-Zeichens als kritisch angesehen. Die OCR-Leistung von zwei Tesseract-Modellen für die deutsche Sprache wird verglichen: das Standardmodell (nicht getuned) und eine finegetunete Variante. Das Standardmodell wird aus dem Tesseract OCR GitHub Repository bezogen. Das feinabgestimmte Modell wird mit denen in diesem Artikel beschriebenen Schritten entwickelt. Eine zweite deutsche Rechnung ähnlich der ersten wird für die Feinabstimmung verwendet. Sowohl das Standardmodell als auch das getunte Modell werden auf der gleichen Out-of-Sample Rechnung bewertet, um einen fairen Vergleich zu gewährleisten. Die OCR-Leistung des Tesseract Standardmodells ist bei Zahlen vergleichsweise schlecht. Dies gilt insbesondere für Zahlen, die den Zahlen 1 und 7 ähnlich sind. Das Euro-Symbol wird in 50% der Fälle falsch erkannt, sodass das Ergebnis für eine etwaig nachgelagerte Buchhaltungsanwendung ungeeignet ist. Das getunte Modell zeigt eine ähnliche OCR-Leistung für deutsche Wörter. Die OCR-Leistung bei Zahlen verbessert sich jedoch deutlich. Alle Zahlen und jedes Euro-Symbol werden korrekt extrahiert.  Es zeigt sich, dass eine Feinabstimmung mit minimalem Aufwand und einer geringen Menge an Schulungsdaten eine große Verbesserung der Erkennungsleistung erzielen kann. Dadurch wird Tesseract OCR mit seiner Open-Source-Lizenzierung zu einer attraktiven Lösung im Vergleich zu proprietärer OCR-Software. Weiterhin werden abschließende Empfehlungen für das Finetuning von Tesseract LSTM-Modellen dargestellt, für den Fall, dass mehr Trainingsdaten vorliegen.

Download des Tesseract Docker Containers

Der gesamte Finetuning-Prozess des LSTM-Modells von Tesseract wird im Folgenden ausführlich erörtert. Da die Installation und Anwendung von Tesseract kompliziert werden kann, haben wir einen Docker Container vorbereitet, der alle nötigen Installationen bereits enthält. [contact-form-7 404 "Not Found"]

Einführung

Tesseract 4 mit seiner LSTM-Engine funktioniert out-of-the-box für einfache Texte bereits recht gut. Es gibt jedoch Szenarien, für die das Standardmodell schlecht abschneidet. Beispiele hierfür sind exotische Schriftarten, Bilder mit Hintergründen oder Text in Tabellen.  Glücklicherweise bietet Tesseract eine Möglichkeit zum Finetuning der LSTM-Engine, um die OCR-Leistung für speziellere Anwendungsfälle zu verbessern.

Warum OCR für Rechnungen eine Herausforderung ist

Auch wenn OCR in Teilbereichen als ein gelöstes Problem gilt, stellt die fehlerfreie Extraktion eines großen Textkorpus nach wie vor eine Herausforderung dar. Dies gilt insbesondere für OCR auf Dokumenten, die eine hohe strukturelle Varianz aufweisen, wie bspw. Rechnungsdokumente. Diese bestehen häufig aus unterschiedlichsten Elementen, die OCR-Engine von Tesseract for Herausforderungen stellen: 1. Farbige Hintergründe und Tabellenstrukturen stellen eine Herausforderung für die Seitensegmentierung dar. 2. Rechnungen enthalten normalerweise seltene Zeichen wie das EUR- oder USD-Zeichen 3. Zahlen können nicht mit einem Sprachwörterbuch überprüft werden. Darüber hinaus ist die Fehlermarge gering: Häufig ist eine exakte Extraktion der numerischen Daten für nachfolgenden Prozessschritte von größter Bedeutung. Problem (1) lässt sich in der Regel dadurch lösen, dass man eine der 14 von Tesseract bereitgestellten Segmentierungsmodus auswählt. Die beiden letztgenannten Probleme lassen sich häufig durch ein Finetuning der LSTM-Engine auf Basis von Beispielen ähnlicher Dokumente lösen.

Use Case Zielsetzung und Daten

Zwei ähnliche Beispielrechnungen werden in dem Artikel näher betrachtet. Die in Abbildung 1 gezeigte Rechnung wird zur Bewertung der OCR-Leistung sowohl für das Standard- als auch des feingetunte Tesseract-Modell verwendet. Besondere Aufmerksamkeit wird der korrekten Extraktion von Zahlen gewidmet. Die in Abbildung 2 gezeigte, zweite Rechnung wird zum Finetuning des LSTM Modells verwendet. Die meisten Rechnungsdokumente sind in einer sehr gut lesbaren Schriftart wie “Arial” geschrieben. Um die Vorteile des Tunings zu veranschaulichen, wird das anfängliche OCR-Problem durch die Berücksichtigung von Rechnungen, die in der Schriftart “Impact” geschrieben sind, erschwert. „Impact“ ist eine Schriftart, die sich deutlich von normalen serifenlosen Schriften unterscheidet, und zu einer höheren Fehlerkennung für Tesseract führt. Es wird im Folgenden gezeigt, dass Tesseract nach der Feinabstimmung auf Basis einer sehr kleinen Datenmenge trotz dieser schwierigen Schriftart sehr zufriedenstellende Ergebnisse liefert.
Abbildung 1: Rechnung 1, die zur Evaluierung der OCR Performance beider Modelle verwendet wird
Abbildung 2: Rechnung 2, die zum Finetuning der LSTM Engine verwendet wird

Verwendung des Tesseract 4.0 Docker Containers

Die Einrichtung zum Finetuning der Tesseract-LSTM-Engine funktioniert derzeit nur unter Linux und kann etwas knifflig sein. Daher wird zusammen mit diesem Artikel ein Docker-Container mit vorinstalliertem Tesseract 4.0 sowie mit den kompilierten Trainings-Tools und Skripten bereitgestellt. Laden Sie das Docker-Image aus der bereitgestellten Archivdatei oder pullen Sie das Container-Image über den bereitgestellten Link:
docker load -i docker/tesseract_image.tar
Sobald das image aufgebaut ist, starten Sie den Container im “detached” Modus:
docker run -d --rm --name tesseract_container tesseract:latest
Greifen Sie auf die Shell des laufenden Containers zu, um die folgenden Befehle in diesem Artikel zu replizieren:
docker exec -it tesseract_container /bin/bash

Allgemeine Verbesserungen der OCR Performance

Es gibt drei Möglichkeiten, wie die OCR-Leistung von Tesseract verbessert werden kann, noch bevor ein Finetuning der LSTM-Engine vorgenommen wird.

1. Preprocessing der Bilder

Gescannte Dokumente können eine schiefe Ausrichtung haben, wenn sie auf dem Scanner nicht richtig platziert wurden. Gedrehte Bilder sollten entzerrt werden, um die Liniensegmentierungsleistung von Tesseract zu optimieren. Darüber hinaus kann beim Scannen ein Bildrauschen entstehen, das durch einen Rauschunterdrückungsalgorithmus entfernt werden sollte. Beachten Sie, dass Tesseract standardmäßig eine Schwellenwertbildung unter Verwendung des Otsu-Algorithmus durchführt, um Graustufenbilder in schwarze und weiße Pixel zu binärisieren. Eine detaillierte Behandlung der Bildvorverarbeitung würde den Rahmen dieses Artikels sprengen und ist nicht notwendig, um für den gegebenen Anwendungsfall zufriedenstellende Ergebnisse zu erzielen. Die Tesseract-Dokumentation bietet einen praktischen Überblick.

2. Seitensegmentierung

Während der Seitensegmentierung versucht Tesseract, rechteckige Textbereiche zu identifizieren. Nur diese Bereiche werden im nächsten Schritt für die OCR ausgewählt. Es ist daher wichtig, alle Regionen mit Text zu erfassen, damit keine Informationen verloren gehen. Tesseract ermöglicht die Auswahl aus 14 verschiedenen Seitensegmentierungsmethoden, die mit dem folgenden Befehl angezeigt werden können:
tesseract --help-psm
Die Standard-Segmentierungsmethode erwartet eine Bild ähnlich zu einer Buchseite. Dieser Modus kann jedoch aufgrund der zusätzlichen tabellarischen Strukturen in Rechnungsdokumenten nicht alle Textbereiche korrekt identifizieren. Eine bessere Segmentierungsmethode ist durch Option 4 gegeben: „Assume a single column of text of variable sizes“. Um die Bedeutung einer geeigneten Seitensegmentierungsmethode zu veranschaulichen, betrachten wir das Ergebnis der Verwendung der Standardmethode “Fully automatic page segmentation, but no OSD ” in Abbildung 3:
Abbildung 3: Die Standard-Segmentierungsmethode kann nicht alle Textbereiche erkennen
Beachten Sie, dass die Texte “Rechnungsinformationen:”, “Pos.” und “Produkt” nicht segmentiert wurden. In Abbildung 4 führt eine geeignetere Methode zu einer perfekten Segmentierung der Seite.

3. Verwendung von Dictionaries, Wortlisten und Mustern für den Text

Die von Tesseract verwendeten LSTM-Modelle wurden auf Basis von großen Textmengen in einer bestimmten Sprache trainiert. Dieser Befehl zeigt die Sprachen an, die derzeit für Tesseract verfügbar sind:
tesseract --list-langs 
Weitere Sprachmodelle sind verfügbar, indem die entsprechenden language.tessdata heruntergelden und in den Ordner tessdata der lokalen Tesseract-Installation abgelegt werden. Das Tesseract-Repository auf GitHub stellt drei Varianten von Sprachmodellen zur Verfügung: normal, fast und best. Nur die schnelle sowie die beste Variante sind für ein Finetuning verwendbar. Wie der Name schon sagt, handelt es sich dabei um die schnellsten bzw. genauesten Varianten von Modellen. Weitere Modelle wurden ebenfalls für spezielle Anwendungsfälle wie die ausschließliche Erkennung von Ziffern und Interpunktion trainiert und sind in den Referenzen aufgeführt. Da die Sprache der Rechnungen in diesem Anwendungsfall Deutsch ist, wird das zu diesem Artikel gehörende Docker-Image mit dem deu.tessdata-Modell geliefert. Für eine bestimmte Sprache kann die Wortliste von Tesseract weiter ausgebaut oder auf bestimmte Wörter oder sogar Zeichen beschränkt werden. Dieses Thema liegt außerhalb des Rahmens dieses Artikels, da es nicht notwendig ist, um für den vorliegenden Anwendungsfall zufriedenstellende Ergebnisse zu erzielen.

Setup des Finetuning-Prozesses

Für das Finetuning müssen drei Dateitypen erstellt werden:

1. tiff-Dateien

Tagged Image File Format oder TIFF ist ein unkomprimiertes Bilddateiformat (im Gegensatz zu JPG oder PNG, die komprimierte Dateiformate sind). TIFF-Dateien können mit einem Konvertierungswerkzeug aus PNG- oder JPG-Formaten gewonnen werden. Obwohl Tesseract mit PNG- und JPG-Bildern arbeiten kann, wird das TIFF-Format empfohlen.

2. Box-Dateien

Zum Trainieren des LSTM-Modells verwendet Tesseract so genannte Box-Dateien mit der Erweiterung “.box”. Eine Box-Datei enthält den erkannten Text zusammen mit den Koordinaten der Bounding Box, in der sich der Text befindet. Box-Dateien enthalten sechs Spalten, die korrespondieren zu Symbol, Links, Unten, Rechts, Oben und Seite:
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
o 157 2566 1465 2609 0
d 157 2566 1465 2609 0
u 157 2566 1465 2609 0
k 157 2566 1465 2609 0
t 157 2566 1465 2609 0
  157 2566 1465 2609 0
P 157 2566 1465 2609 0
r 157 2566 1465 2609 0
e 157 2566 1465 2609 0
i 157 2566 1465 2609 0
s 157 2566 1465 2609 0
  157 2566 1465 2609 0
( 157 2566 1465 2609 0
N 157 2566 1465 2609 0
e 157 2566 1465 2609 0
t 157 2566 1465 2609 0
t 157 2566 1465 2609 0
o 157 2566 1465 2609 0
) 157 2566 1465 2609 0
  157 2566 1465 2609 0
Jedes Zeichen befindet sich auf einer separaten Zeile in der Box-Datei. Das LSTM-Modell akzeptiert entweder die Koordinaten einzelner Zeichen oder einer ganzen Textzeile. In der obigen Beispiel-Box-Datei befindet sich der Text “Produkt Preis (Netto)” optisch auf der gleichen Zeile im Dokument. Alle Zeichen haben die gleichen Koordinaten, nämlich die Koordinaten des Begrenzungsrahmens um diese Textzeile herum. Die Verwendung von Koordinaten auf Zeilenebene ist wesentlich einfacher und wird standardmäßig bereitgestellt, wenn die Box-Datei mit dem folgenden Befehl erzeugt wird:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice --psm 4 -l best/deu lstmbox
Das erste Argument ist die zu extrahierende Bilddatei, das zweite Argument stellt den Dateinamen der Box-Datei dar. Der Sprachparameter -l weist Tesseract an, das deutsche Modell für die OCR zu verwenden. Der Parameter –psm weist Tesseract an, das vierte Seitensegmentierungsverfahren zu verwenden. Nahezu unvermeidlich ist, dass die generierten OCR-Box-Dateien Fehler in der Symbolspalte enthalten. Jedes Symbol in der Box-Datei des Trainings muss daher von Hand überprüft werden. Dies ist ein mühsamer Prozess, da die Box-Datei der Demo-Rechnung fast tausend Zeilen enthält (eine für jedes Zeichen in der Rechnung). Um die Korrektur zu vereinfachen, stellt der Docker-Container ein Python-Skript zur Verfügung, das die Bounding-Boxes zusammen mit dem OCR-Text auf dem Originalrechnungsbild zeichnet, um einen Vergleich zwischen der Box Datei und dem Dokument zu erleichtern. Das Ergebnis ist in Abbildung 4 dargestellt. Der Docker-Container enthält bereits die korrigierten Box-Dateien, die durch den Suffix “_correct” gekennzeichnet sind.
Abbildung 4 – Extrahierter Text bei Anwendung des Tesseract Modells „deu“

3. lstmf Dateien

Während des Finetunings extrahiert Tesseract den Text aus der Tiff-Datei und überprüft die Vorhersage anhand der Koordinaten sowie des Symbols in der Box-Datei. Tesseract verwendet dabei nicht direkt die Tiff- und Box-Datei, sondern erwartet eine sog. lstmf-Datei, die aus den beiden vorherigen Dateien erstellt wurde. Hierbei ist zu beachten, dass zur Erstellung der lstmf-Datei die Tiff- und Box-Datei denselben Namen haben müssen, z.B. train_invoice.tiff und train_invoice.box. Der folgende Befehl erzeugt eine lstmf-Datei für die Zugrechnung:
cd /home/fine_tune/train
tesseract train_invoice.tiff train_invoice lstm.train 
Alle lstmf-Dateien, die für das Training relevant sind, müssen durch ihren relativen Pfad in einer Textdatei namens deu.training_files.txt angegeben werden. In diesem Anwendungsfall wird nur eine lstmf-Datei für das Training verwendet, so dass die Datei deu.training_files.txt nur eine Zeile enthält, nämlich: eval/train_invoice_correct.lstmf. Es wird empfohlen, auch eine lstfm-Datei für die Evaluierungs-Rechnung zu erstellen. Auf diese Weise kann die Performance des Modells während dem Trainingsvorgang bewertet werden:
cd /home/fine_tune/eval
tesseract eval_invoice_correct.tiff eval_invoice_correct lstm.train

Evaluierung des Standard-LSTM-Modells

OCR-Vorhersagen aus dem deutschen Standardmodell “deu” werden als Benchmark verwendet. Einen genauen Überblick über die OCR-Leistung des deutschen Standardmodells erhält man, indem man eine Box-Datei für die Evaluierungs-Rechnung erzeugt und den OCR-Text mit dem bereits erwähnten Python-Skript visualisiert. Dieses Skript, das die Datei “eval_invoice_ocr deu.tiff” erzeugt, befindet sich im mitgelieferten Container unter „/home/fine_tune/src/draw_box_file_data.py“. Das Skript erwartet als Argument den Pfad zu einer Tiff-Datei, die entsprechende Box-Datei sowie einen Namen für die Ausgabe-Tiff-Datei. Der durch das deutsche Standardmodell extrahierte OCR-Text wird als eval/eval_invoice_ocr_deu.tiff gespeichert und ist in Abbildung 1 dargestellt. Auf den ersten Blick sieht der durch OCR extrahierte Text gut aus. Das Modell extrahiert deutsche Zeichen wie ä, ö ü und ß korrekt. Tatsächlich gibt es nur drei Fälle, in denen Wörter Fehler enthalten:
OCR Truth
Jessel GmbH 8 Co Jessel GmbH & Co
11 Glasbehälter 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Das Modell schneidet bei gebräuchlichen deutschen Wörtern bereits gut ab, hat aber Schwierigkeiten mit singulären Symbolen wie “&” und “l” sowie Wörtern wie “bloch”, die nicht in der Wortliste des Modells enthalten sind. Preise und Zahlen sind für das Modell eine viel größere Herausforderung. Hierbei treten deutlich häufiger Fehler bei der Extraktion auf:
OCR Truth
159,16 159,1€
1% 7%
1305.816 1305.81€
227.66 227.6€
341.51 347.57€
1115.16 1115.7€
242.86 242.8€
1456.86 1456.8€
51.46 54.1€
1954.719€ 1954.79€
Das deutsche Standardmodell extrahiert das Euro-Symbol € in 9 von 18 Fällen nicht korrekt. Dies entspricht einer Fehlerquote von 50%.

Finetuning des Standard-LSTM-Modells

Das Standard-LSTM-Modell wird nun auf der in Abbildung 2 gezeigten Rechnung finegetuned. Anschließend wird die OCR-Leistung anhand der in Abbildung 1 gezeigten Evaluierungs-Rechnung bewertet, die auch zuvor für das Benchmarking des deutschen Standardmodells verwendet wurde. Zum Finetuning des LSTM-Modells muss dieses zunächst aus der Datei deu.traineddata extrahiert werden. Mit dem folgenden Befehl wird das LSTM-Modell aus dem deutschen Standardmodell in das Verzeichnis lstm_model extrahiert:
cd /home/fine_tune
combine_tessdata -e tesseract/tessdata/best/deu.traineddata lstm_model/deu.lstm
Anschließend werden alle notwendigen Dateien für das Finetuning zusammengestellt. Die Dateien sind ebenfalls im Docker-Container vorhanden:
  1. Die Trainings-Dateien train_invoice_correct.lstmf und deu.training_files.txt im Verzeichnis train.
  2. Die Evaluierungs-Dateien eval_invoice_correct.lstmf und deu.training_files.txt im eval-Verzeichnis.
  3. Das extrahierte LSTM-Modell deu.lstm im Verzeichnis lstm_model.
Der Docker-Container enthält das Skript src/fine_tune.sh, das den Prozess des Finetunings startet. Sein Inhalt ist:
/usr/bin/lstmtraining 
 --model_output output/fine_tuned 
 --continue_from lstm_model/deu.lstm 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --train_listfile train/deu.training_files.txt 
 --evallistfile eval/deu.training_files.txt 
 --max_iterations 400
Mit diesem Befehl wird das extrahierte Modell deu.lstm in der in train/deu.training_files.txt angegebenen Datei train_invoice.lstmf getuned. Das Finetuning des LSTM-Modells erfordert sprachspezifische Informationen, die im Ordner deu.tessdata enthalten sind. Die Datei eval_invoice.lstmf, die in eval/deu.training_files.txt angegeben ist, wird zur Berechnung der OCR-Performance während des Trainings verwendet. Das Finetuning wird nach 400 Iterationen beendet. Die gesamte Trainingsdauer dauert weniger als zwei Minuten. Der folgende Befehl führt das Skript aus und protokolliert die Ausgabe in einer Datei:
cd /home/fine_tune
sh src/fine_tune.sh > output/fine_tune.log 2>&1
Der Inhalt der Protokolldatei nach dem Training ist unten dargestellt:
src/fine_tune.log
Loaded file lstm_model/deu.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Continuing from lstm_model/deu.lstm
Loaded 20/20 lines (1-20) of document train/train_invoice_correct.lstmf
Loaded 24/24 lines (1-24) of document eval/eval_invoice_correct.lstmf
2 Percent improvement time=69, best error was 100 @ 0
At iteration 69/100/100, Mean rms=1.249%, delta=2.886%, char train=8.17%, word train=22.249%, skip ratio=0%, New best char error = 8.17 Transitioned to stage 1 wrote best model:output/deu_fine_tuned8.17_69.checkpoint wrote checkpoint.
-----
2 Percent improvement time=62, best error was 8.17 @ 69
At iteration 131/200/200, Mean rms=1.008%, delta=2.033%, char train=5.887%, word train=20.832%, skip ratio=0%, New best char error = 5.887 wrote best model:output/deu_fine_tuned5.887_131.checkpoint wrote checkpoint.
-----
2 Percent improvement time=112, best error was 8.17 @ 69
At iteration 181/300/300, Mean rms=0.88%, delta=1.599%, char train=4.647%, word train=17.388%, skip ratio=0%, New best char error = 4.647 wrote best model:output/deu_fine_tuned4.647_181.checkpoint wrote checkpoint.
-----
2 Percent improvement time=159, best error was 8.17 @ 69
At iteration 228/400/400, Mean rms=0.822%, delta=1.416%, char train=4.144%, word train=16.126%, skip ratio=0%, New best char error = 4.144 wrote best model:output/deu_fine_tuned4.144_228.checkpoint wrote checkpoint.
-----
Finished! Error rate = 4.144
Während des Trainings speichert Tesseract nach jeder Iteration einen sog. Model Checkpoint. Die Leistung des Modells an diesem Kontrollpunkt wird anhand der Evaluierungs-Daten getestet und mit dem aktuell besten Ergebnis verglichen. Wenn sich das Ergebnis verbessert, d.h. der OCR-Fehler abnimmt, wird eine beschriftete Kopie des Checkpoints gespeichert. Die erste Nummer des Dateinamens für den Kontrollpunkt steht für den Zeichenfehler und die zweite Nummer für die Trainingsiteration. Der letzte Schritt ist die neue Zusammenstellung des finegetunten LSTM-Modells, so dass man wieder ein “traineddata” Modell erhält. Unter der Annahme, dass der Kontrollpunkt bei der 181. Iteration selektiert wurde, wird mit dem folgenden Befehl ein ausgewählter Kontrollpunkt “deu_fine_tuned4.647_181.checkpoint” in ein voll funktionsfähiges Tesseract-Modell “deu_fine_tuned.traineddata” umgewandelt:
cd /home/fine_tune
/usr/bin/lstmtraining 
 --stop_training 
 --continue_from output/deu_fine_tuned4.647_181.checkpoint 
 --traineddata tesseract/tessdata/best/deu.traineddata 
 --model_output output/deu_fine_tuned.traineddata
Dieses Modell muss in die Testdaten der lokalen Tesseract-Installation kopiert werden, um es Tesseract zur Verfügung zu stellen. Dies ist im Docker-Container bereits geschehen. Vergewissern Sie sich, dass das feinabgestimmte Modell in Tesseract verfügbar ist:
tesseract --list-langs

Evaluierung des finegetunten LSTM-Modells

Das finegetunte Modell wird analog zum Standardmodell evaluiert: Es wird eine Box-Datei der Auswertungs-Rechnung erstellt, und der OCR-Text wird mit Hilfe des Python-Skripts auf dem Bild der Auswertungsrechnung angezeigt. Der Befehl zur Erzeugung der Box-Dateien muss so modifiziert werden, dass das fein abgestimmte Modell “deu_fine_tuned” anstelle des Standardmodells “deu” verwendet wird:
cd /home/fine_tune/eval
tesseract eval_invoice.tiff eval_invoice --psm 4 -l deu_fine_tuned lstmbox
Der durch das fein abgestimmte Modell extrahierte OCR-Text ist in Abbildung 5 unten dargestellt.
Abbildung 5: OCR Ergebnisse des finegetunten LSTM Modells
Wie beim deutschen Standardmodell bleibt die Leistung bei den deutschen Wörtern gut, aber nicht perfekt. Um die Leistung bei seltenen Wörtern zu verbessern, könnte die Wortliste des Modells um weitere Worte erweitert werden.
OCR Truth
 Jessel GmbH 8 Co Jessel GmbH & Co
1! Glasbehälte 1l Glasbehälter
Zeki64@hloch.com Zeki64@bloch.com
Wichtiger ist, dass sich die OCR-Leistung bei Zahlen deutlich verbessert hat: Das verfeinerte Modell extrahierte alle Zahlen und jedes Vorkommen des €-Zeichens korrekt.
OCR Truth
159,1€ 159,1€
7% 7%
1305.81€ 1305.81€
227.6€ 227.6€
347.57€ 347.57€
1115.7€ 1115.7€
242.8€ 242.8€
1456.8€ 1456.8€
54.1€ 54.1€
1954.79€ 1954.79€

Fazit und Ausblick

In diesem Artikel wurde gezeigt, dass die OCR-Leistung von Tesseract durch Finetuning erheblich verbessert werden kann. Insbesondere bei Nicht-Standard-Anwendungsfällen, wie der Text-Extraktion von Rechnungsdokumenten, kann so die OCR-Leistung signifikant verbessert werden. Neben der Open Source Lizensierung macht die Möglichkeit, die LSTM-Engine von Tesseract mittels Finetunings für spezifische Anwendungsfälle zu tunen, das Framework zu einem attraktiven Tool, auch für anspruchsvollere OCR-Einsatzszenarien. Zur weiteren Verbesserung des Ergebnisses kann es sinnvoll sein, das Modell für weitere Iterationen zu tunen. In diesem Anwendungsfall wurde die Anzahl der Iterationen absichtlich begrenzt, da nur ein Dokument zum Finetuning verwendet wurde. Mehr Iterationen erhöhen potenziell das Risiko einer Überanpassung des LSTM-Modells auf bestimmten Symbolen, was wiederum die Fehlerquote bei anderen Symbolen erhöht. In der Praxis ist es wünschenswert, die Anzahl der Iterationen unter der Voraussetzung zu erhöhen, dass ausreichend Trainingsdaten zur Verfügung stehen. Die endgültige OCR-Leistung sollte immer auf Basis eines weiteren, jedoch repräsentativen Datensatz von Dokumenten überprüft werden.

Referenzen

  • Tesseract training: https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html
  • Image processing overview: https://tesseract-ocr.github.io/tessdoc/ImproveQuality#image-processing
  • Otsu thresholding: https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_thresholding/py_thresholding.html
  • Tesseract digits comma model: https://github.com/Shreeshrii/tessdata_shreetest
  Reinforcement learning is currently one of the hottest topics in machine learning. For a recent conference we attended (the awesome Data Festival in Munich), we’ve developed a reinforcement learning model that learns to play Super Mario Bros on NES so that visitors, that come to our booth, can compete against the agent in terms of level completion time.
The promotion was a great success and people enjoyed the “human vs. machine” competition. There was only one contestant who was able to beat the AI by taking a secret shortcut, that the AI wasn’t aware of. Also, developing the model in Python was a lot of fun. So, I decided to write a blog post about it that covers some of the fundamental concepts of reinforcement learning as well as the actual implementation of our Super Mario agent in TensorFlow (beware, I’ve used TensorFlow 1.13.1, TensorFlow 2.0 was not released at the time of writing this article).

Recap: reinforcement learning

Most machine learning models have an explicit connection between inputs and outputs that does not change during training time. Therefore, it can be difficult to model or predict systems, where the inputs or targets themselves depend on previous predictions. However, often,the world around the model updates itself with every prediction made. What sounds quite abstract is actually a very common situation in the real world: autonomous driving, machine control, process automation etc. – in many situations, decisions that are made by models have an impact on their surroundings and consequently on the next actions to be taken. Classical supervised learning approaches can only be used to a limited extend in such kinds of situations. To solve the latter, machine learning models are needed that are able to cope with time-dependent variation of inputs and outputs that are interdependent. This is where reinforcement learning comes into play. In reinforcement learning, the model (called agent) interacts with its environment by choosing from a set of possible actions (action space) in each state of the environment that cause either positive or negative rewards from the environment. Think of rewards as an abstract concept of signalizing that the action taken was good or bad. Thereby, the reward issued by the environment can be immediate or delayed into the future. By learning from the combination of environment states, actions and corresponsing rewards (so called transitions), the agent tries to reach an optimal set of decision rules (the policy) that maximize the total reward gathered by the agent in each state.

Q-learning and Deep Q-learning

In reinforcement learning we often use a learning concept called Q-learning. Q-learning is based on so called Q-values, that help the agent determining the optimal action, given the current state of the environment. Q-values are “discounted” future rewards, that our agent collects during training by taking actions and moving through the different states of the environment. Q-values themselves are tried to be approximated during training, either by simple exploration of the environment or by using a function approximator, such as a deep neural network (as in our case here). Mostly, we select in each state the action that has the highest Q-value, i.e. the highest discounuted future reward, givent the current state of the environment. When using a neural network as a Q-function approximator we learn by computing the difference between the predicted Q-values and the “true” Q-values, i.e. the representation of the optimal decision in the current state. Based on the computed loss, we update the network’s parameters using gradient descent, just like in any other neural network model. By doing this often, our network converges to a state, where it can approximate the Q-values of the next state, given the current state of the environment. If the approximation is good enough, we simple select the action that has the highest Q-value. By doing so, the agent is able to decide in each situation, which action generates the best outcome in terms of reward collection. In most deep reinforcement learning models there are actually two deep neural networks involved: the online- and the target-network. This is done because during training, the loss function of a single neural network is computed against steadily changing targets (Q-values), that are based on the networks weights themselves. This adds increased difficulty to the optimization problem or might result in no convergence at all. The target network is basically a copy of the online network with frozen weights that are not directly trained. Instead the target network’s weights are synchronized with the online network after a certain amount of training steps. Enforcing “stable outputs” of the target network that do not change after each training step makes sure that the computed target Q-values that are needed for computing the loss do not change steadily which supports convergence of the optimization problem.

Deep Double Q-learning

Another possible issue with Q-learning is, that due to the selection of the maximum Q-value for determining the best action, the model sometimes produces extraordinary high Q-values during training. Basically, this is not always a problem but might turn into one, if there is a strong concentration at certain actions that in return lead to the negletion of less favorable but “worth-to-try” actions. If the latter are neglected all the time, the model might run into a locally optimal solution or even worse selects the same actions all the time. One way to deal with this problem is to introduce an updated version of Q-learning called double Q-learning. In double Q-learning the actions in each state are not simply chosen by selecting the action with maximum Q-value of the target network. Instead, the selection process is split into three distinct steps: (1) first, the target network computes the target Q-values of the state after taking the action. Then, (2) the online network computes the Q-values of the state after taking the action and selects the best action by finding the maximum Q-value. Finally, (3) the target Q-Values are calculated using the target Q-values of the target network, but at the selected action indices of the online network. This assures, that there cannot occur an overestimation of Q-values because the Q-values are not updated based on themselves.

Gym environments

In order to build a reinforcement learning aplication, we need two things: (1) an environment that the agent can interact with and learn from (2) the agent, that observes the state(s) of the environment and chooses appropriate actions using Q-values, that (ideally) result in high rewards for the agent. An environment is typically provided as a so called gym, a class that contains the neecessary code to emulate the states and rewards of the environment as a function of the agent’s actions as well further information, e.g. about the possible action space. Here is an example of a simple environment class in Python:
class Environment:
    """ A simple environment skeleton """
    def __init__(self):
          # Initializes the environment
        pass

    def step(self, action):
          # Changes the environment based on agents action
        return next_state, reward, done, info

    def reset(self):
        # Resets the environment to its initial state
        pass

    def render(self):
          # Show the state of the environment on screen
        pass
The environment has three major class functions: (1) step() executes the environment code as function of the action selected by the agent and returns the next state of the environment, the reward with respect to action, a done flag indicating if the environment has reached its terminal state as well as a dictionary of additional information about the environment and its state, (2) reset() resets the environment in it’s original state and (3) render() print the current state on the screen (for example showing the current frame of the Super Mario Bros game). For Python, a go-to place for finding gyms is OpenAI. It contains lots of diffenrent games and problems well suited for solving using reinforcement learning. Furthermore, there is an Open AI project called Gym Retro that contains hundrets of Sega and SNES games, ready to be tackled by reinforcement learning algorithms.

Agent

The agent comsumes the current state of the environment and selects an appropriate action based on the selection policy. The policy maps the state of the environment to the action to be taken by the agent. Finding the right policy is a key question in reinforcement learning and often involves the usage of deep neural networks. The following agent simply observes the state of the environment and returns action = 1 if state is larger than 0 and action = 0 otherwise.
class Agent:
    """ A simple agent """
    def __init__(self):
        pass

    def action(self, state):
        if state > 0:
            return 1
        else:
            return 0
This is of course a very simplistic policy. In practical reinforcement learning applications the state of the environment can be very complex and high-dimensional. One example are video games. The state of the environment is determined by the pixels on screen and the previous actions of the player. Our agent needs to find a policy that maps the screen pixels into actions that generate rewards from the environment.

Environment wrappers

Gym environments contain most of the functionalities needed to use them in a reinforcement learning scenario. However, there are certain features that do not come prebuilt into the gym, such as image downscaling, frame skipping and stacking, reward clipping and so on. Luckily, there exist so called gym wrappers that provide such kinds of utility functions. An example that can be used for many video games such as Atari or NES can be found here. For video game gyms it is very common to use wrapper functions in order to achieve a good performance of the agent. The example below shows a simple reward clipping wrapper.
import gym

class ClipRewardEnv(gym.RewardWrapper):
        """ Example wrapper for reward clipping """
    def __init__(self, env):
        gym.RewardWrapper.__init__(self, env)

    def reward(self, reward):
        # Clip reward to {1, 0, -1} by its sign
        return np.sign(reward)
From the example shown above you can see, that it is possible to change the default behavior of the environment by “overwriting” its core functions. Here, rewards of the environment are clipped to [-1, 0, 1] using np.sign() based on the sign of the reward.

The Super Mario Bros NES environment

For our Super Mario Bros reinforcement learning experiment, I’ve used gym-super-mario-bros. The API ist straightforward and very similar to the Open AI gym API. The following code shows a random agent playing Super Mario. This causes Mario to wiggle around on the screen and – of course – does not lead to a susscessful completion of the game.
from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT


# Make gym environment
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = BinarySpaceToDiscreteSpaceEnv(env, SIMPLE_MOVEMENT)

# Play random
done = True
for step in range(5000):
    if done:
        state = env.reset()
    state, reward, done, info = env.step(env.action_space.sample())
    env.render()

# Close device
env.close()
The agent interacts with the environment by choosing random actions from the action space of the environment. The action space of a video game is actually quite large since you can press multiple buttons at the same time. Here, the action space is reduced to SIMPLE_MOVEMENT, which covers basic game actions such as run in all directions, jump, duck and so on. BinarySpaceToDiscreteSpaceEnv transforms the binary action space (dummy indicator variables for all buttons and directions) into a single integer. So for example the integer action 12 corresponds to pressing right and A (running).

Using a deep learning model as an agent

When playing Super Mario Bros on NES, humans see the game screen – more precisely – they see consecutive frames of pixels, displayed at a high speed on the screen. Our human brains are capable of transforming the raw sensorial input from our eyes into electrical signals that are processed by our brain that trigger corresponding actions (pressing buttons on the controller) that (hopefully) lead Mario to the finishing line. When training the agent, the gym renders each game frame as a matrix of pixels, according to the respective action taken by the agent. Basically, those pixels can be used as an input to any machine learning model. However, in reinforcement learning we often use convolutional neural networks (CNNs) that excel at image recognition problems compared to other ML models. I won’t go into technical detail about CNNs here, there’s a plethora of great intro articles to CNNs like this one. Instead of using only the current game screen as an input to the model, it is common to use multiple stacked frames as an input to the CNN. By doing so, the model can process changes and “movements” on the screen between consecutive frames, which would not be possible when using only a single game frame. Here, the input tensor of our model is of size [84, 84, 4]. This corresponds to a stack of 4 grayscale frames, each frame of size 84×84 pixels. This corresponds to the default tensor size for 2-dimensional convolution. The architecture of the deep learning model consists of three convolutional layers, followed by a flatten and one fully connected layer with 512 neurons as well as an output layer, consisting of actions = 6 nerons, which corresponds to the action space of the game (in this case RIGHT_ONLY, i.e. actions to move Mario to the right – enlarging the action space usually causes an increase in problem complexity and training time). If you take a closer look at the TensorBoard image below, you’ll notice that the model actually consists of not only one but two identical convolutional branches. One is the online network branch, the other one is the target network branch. The online network is acutally trained using gradient descent. The target network is not directly trained but periodically synchronized every copy = 10000 steps by copying the weights from the online branch to the target branch of the network. The target network branch is excluded from gradient descent training by using the tf.stop_gradient() function around the output layer of the branch. This causes a stop in the flow of gradients at the output layer so that they cannot propagate along the branch and so the weights are not updated.
The agent learns by (1) taking random samples of historical transitions, (2) computing the “true” Q-values based on the states of the environment after action, next_state, using the target network branch and the double Q-learning rule, (3) discounting the target Q-values using gamma = 0.9 and (4) run a batch gradient descent step based on the network’s internal Q-prediction and the true Q-values, supplied by target_q. In order to speed up the training process, the agent is not trained after each action but every train_each = 3 frames which corresponds to a training every 4 frames. In addition, not every frame is stored in the replay buffer but each 4th frame. This is called frame skipping. More specifically, a max pooling operation is performed that aggregates the information between the last 4 consecutive frames. This is motivated by the fact that consecutive frames contain nearly the same information which does not add new information to the learning problem and might introduce strongly autocorrelated datapoints. Speaking of correlated data: our network is trained using adaptive moment estimation (ADAM) and gradient descent at a learning_rate = 0.00025, which requires i.i.d. datapoints in order to work well. This means, that we cannot simply use all new transition tuples subsequently for training since they are highly correlated. To solve this issue we use a concept called experience replay buffer. Hereby, we store every transition of our game in a ring buffer object (in Python the deque() function) which is then randomly sampled from, when we acquire our training data of batch_size = 32. By using a random sampling strategy and a large enough replay buffer, we can assume that the resulting datapoints are (hopefully) not correlated. The following codebox shows the DQNAgent class.
import time
import random
import numpy as np
from collections import deque
import tensorflow as tf
from matplotlib import pyplot as plt


class DQNAgent:
    """ DQN agent """
    def __init__(self, states, actions, max_memory, double_q):
        self.states = states
        self.actions = actions
        self.session = tf.Session()
        self.build_model()
        self.saver = tf.train.Saver(max_to_keep=10)
        self.session.run(tf.global_variables_initializer())
        self.saver = tf.train.Saver()
        self.memory = deque(maxlen=max_memory)
        self.eps = 1
        self.eps_decay = 0.99999975
        self.eps_min = 0.1
        self.gamma = 0.90
        self.batch_size = 32
        self.burnin = 100000
        self.copy = 10000
        self.step = 0
        self.learn_each = 3
        self.learn_step = 0
        self.save_each = 500000
        self.double_q = double_q

    def build_model(self):
        """ Model builder function """
        self.input = tf.placeholder(dtype=tf.float32, shape=(None, ) + self.states, name='input')
        self.q_true = tf.placeholder(dtype=tf.float32, shape=[None], name='labels')
        self.a_true = tf.placeholder(dtype=tf.int32, shape=[None], name='actions')
        self.reward = tf.placeholder(dtype=tf.float32, shape=[], name='reward')
        self.input_float = tf.to_float(self.input) / 255.
        # Online network
        with tf.variable_scope('online'):
            self.conv_1 = tf.layers.conv2d(inputs=self.input_float, filters=32, kernel_size=8, strides=4, activation=tf.nn.relu)
            self.conv_2 = tf.layers.conv2d(inputs=self.conv_1, filters=64, kernel_size=4, strides=2, activation=tf.nn.relu)
            self.conv_3 = tf.layers.conv2d(inputs=self.conv_2, filters=64, kernel_size=3, strides=1, activation=tf.nn.relu)
            self.flatten = tf.layers.flatten(inputs=self.conv_3)
            self.dense = tf.layers.dense(inputs=self.flatten, units=512, activation=tf.nn.relu)
            self.output = tf.layers.dense(inputs=self.dense, units=self.actions, name='output')
        # Target network
        with tf.variable_scope('target'):
            self.conv_1_target = tf.layers.conv2d(inputs=self.input_float, filters=32, kernel_size=8, strides=4, activation=tf.nn.relu)
            self.conv_2_target = tf.layers.conv2d(inputs=self.conv_1_target, filters=64, kernel_size=4, strides=2, activation=tf.nn.relu)
            self.conv_3_target = tf.layers.conv2d(inputs=self.conv_2_target, filters=64, kernel_size=3, strides=1, activation=tf.nn.relu)
            self.flatten_target = tf.layers.flatten(inputs=self.conv_3_target)
            self.dense_target = tf.layers.dense(inputs=self.flatten_target, units=512, activation=tf.nn.relu)
            self.output_target = tf.stop_gradient(tf.layers.dense(inputs=self.dense_target, units=self.actions, name='output_target'))
        # Optimizer
        self.action = tf.argmax(input=self.output, axis=1)
        self.q_pred = tf.gather_nd(params=self.output, indices=tf.stack([tf.range(tf.shape(self.a_true)[0]), self.a_true], axis=1))
        self.loss = tf.losses.huber_loss(labels=self.q_true, predictions=self.q_pred)
        self.train = tf.train.AdamOptimizer(learning_rate=0.00025).minimize(self.loss)
        # Summaries
        self.summaries = tf.summary.merge([
            tf.summary.scalar('reward', self.reward),
            tf.summary.scalar('loss', self.loss),
            tf.summary.scalar('max_q', tf.reduce_max(self.output))
        ])
        self.writer = tf.summary.FileWriter(logdir='./logs', graph=self.session.graph)

    def copy_model(self):
        """ Copy weights to target network """
        self.session.run([tf.assign(new, old) for (new, old) in zip(tf.trainable_variables('target'), tf.trainable_variables('online'))])

    def save_model(self):
        """ Saves current model to disk """
        self.saver.save(sess=self.session, save_path='./models/model', global_step=self.step)

    def add(self, experience):
        """ Add observation to experience """
        self.memory.append(experience)

    def predict(self, model, state):
        """ Prediction """
        if model == 'online':
            return self.session.run(fetches=self.output, feed_dict={self.input: np.array(state)})
        if model == 'target':
            return self.session.run(fetches=self.output_target, feed_dict={self.input: np.array(state)})

    def run(self, state):
        """ Perform action """
        if np.random.rand() < self.eps:
            # Random action
            action = np.random.randint(low=0, high=self.actions)
        else:
            # Policy action
            q = self.predict('online', np.expand_dims(state, 0))
            action = np.argmax(q)
        # Decrease eps
        self.eps *= self.eps_decay
        self.eps = max(self.eps_min, self.eps)
        # Increment step
        self.step += 1
        return action

    def learn(self):
        """ Gradient descent """
        # Sync target network
        if self.step % self.copy == 0:
            self.copy_model()
        # Checkpoint model
        if self.step % self.save_each == 0:
            self.save_model()
        # Break if burn-in
        if self.step < self.burnin:
            return
        # Break if no training
        if self.learn_step < self.learn_each:
            self.learn_step += 1
            return
        # Sample batch
        batch = random.sample(self.memory, self.batch_size)
        state, next_state, action, reward, done = map(np.array, zip(*batch))
        # Get next q values from target network
        next_q = self.predict('target', next_state)
        # Calculate discounted future reward
        if self.double_q:
            q = self.predict('online', next_state)
            a = np.argmax(q, axis=1)
            target_q = reward + (1. - done) * self.gamma * next_q[np.arange(0, self.batch_size), a]
        else:
            target_q = reward + (1. - done) * self.gamma * np.amax(next_q, axis=1)
        # Update model
        summary, _ = self.session.run(fetches=[self.summaries, self.train],
                                      feed_dict={self.input: state,
                                                 self.q_true: np.array(target_q),
                                                 self.a_true: np.array(action),
                                                 self.reward: np.mean(reward)})
        # Reset learn step
        self.learn_step = 0
        # Write
        self.writer.add_summary(summary, self.step)

Training the agent to play

First, we need to instantiate the environment. Here, we use the first level of Super Mario Bros, SuperMarioBros-1-1-v0 as well as a discrete event space with RIGHT_ONLY action space. Additionally, we use a wrapper that applies frame resizing, stacking and max pooling, reward clipping as well as lazy frame loading to the environment. When the training starts, the agent begins to explore the environment by taking random actions. This is done in order to build up initial experience that serves as a starting point for the actual learning process. After burin = 100000 game frames, the agent slowly starts to replace random actions by actions determined by the CNN policy. This is called an epsilon-greedy policy. Epsilon-greeedy means, that the agent takes a random action with probability epsilon or a policy-based action with probability (1-epsilon). Here, epsilon diminisches linearly during training by a factor of eps_decay = 0.99999975 until it reaches eps = 0.1 where it remains constant for the rest of the training process. It is important to not completely eliminate random actions from the training process in order to avoid getting stuck on locally optimal solutions. For each action taken, the environment returns a four objects: (1) the next game state, (2) the reward for taking the action, (3) a flag if the episode is done and (4) an info dictionary containing additional information from the environment. After taking the action, a tuple of the returned objects is added to the replay buffer and the agent performs a learning step. After learning, the current state is updated with the next_state and the loop increments. The while loop breaks, if the done flag is True. This corresponds to either the death of Mario or to a successful completion of the level. Here, the agent is trained in 10000 episodes.
import time
import numpy as np
from nes_py.wrappers import BinarySpaceToDiscreteSpaceEnv
import gym_super_mario_bros
from gym_super_mario_bros.actions import RIGHT_ONLY
from agent import DQNAgent
from wrappers import wrapper

# Build env (first level, right only)
env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0')
env = BinarySpaceToDiscreteSpaceEnv(env, RIGHT_ONLY)
env = wrapper(env)

# Parameters
states = (84, 84, 4)
actions = env.action_space.n

# Agent
agent = DQNAgent(states=states, actions=actions, max_memory=100000, double_q=True)

# Episodes
episodes = 10000
rewards = []

# Timing
start = time.time()
step = 0

# Main loop
for e in range(episodes):

    # Reset env
    state = env.reset()

    # Reward
    total_reward = 0
    iter = 0

    # Play
    while True:

        # Show env (diabled)
        # env.render()

        # Run agent
        action = agent.run(state=state)

        # Perform action
        next_state, reward, done, info = env.step(action=action)

        # Remember transition
        agent.add(experience=(state, next_state, action, reward, done))

        # Update agent
        agent.learn()

        # Total reward
        total_reward += reward

        # Update state
        state = next_state

        # Increment
        iter += 1

        # If done break loop
        if done or info['flag_get']:
            break

    # Rewards
    rewards.append(total_reward / iter)

    # Print
    if e % 100 == 0:
        print('Episode {e} - +'
              'Frame {f} - +'
              'Frames/sec {fs} - +'
              'Epsilon {eps} - +'
              'Mean Reward {r}'.format(e=e,
                                       f=agent.step,
                                       fs=np.round((agent.step - step) / (time.time() - start)),
                                       eps=np.round(agent.eps, 4),
                                       r=np.mean(rewards[-100:])))
        start = time.time()
        step = agent.step

# Save rewards
np.save('rewards.npy', rewards)
After each game episode, the averagy reward in this episode is appended to the rewards list. Furthermore, different stats such as frames per second and the current epsilon are printed after every 100 episodes.

Replay

During training, the program checkpoints the current network at save_each = 500000 frames and keeps the 10 latest models on disk. I’ve downloaded several model versions during training to my local machine and produced the following video.
It is so awesome to see the learning progress of the agent! The training process took approximately 20 hours on a GPU accelerated VM on Google Cloud.

Summary and outlook

Reinforcement learning is an exciting field in machine learning that offers a wide range of possible applications in science and business likewise. However, the training of reinforcement learning agents is still quite cumbersome and often requires tedious tuning of hyperparameters and network architecture in order to work well. There have been recent advances, such as RAINBOW (a combination of multiple RL learning strategies) that aim at a more robust framework for training reinforcement learning agents but the field is still an area of active research. Besides Q-learning, there are many other interesting training concepts in reinforcement learning that have been developed. If you want to try different RL agents and training approaches, I suggest you check out Stable Baselines, a great way to easily use state-of-the-art RL agents and training concepts. If you are a deep learning beginner and want to learn more, you should check our brandnew STATWORX Deep Learning Bootcamp, a 5-day in-person introduction into the field that covers everything you need to know in order to develop your first deep learning models: neural net theory, backpropagation and gradient descent, programming models in Python, TensorFlow and Keras, CNNs and other image recognition models, recurrent networks and LSTMs for time series data and NLP as well as advaned topics such as deep reinforcement learning and GANs. If you have any comments or questions on my post, feel free to contact me!  Also, feel free to use my code (link to GitHub repo) or share this post with your peers on social platforms of your choice. If you’re interested in more content like this, join our mailing list, constantly bringing you fresh data science, machine learning and AI reads and treats from me and my team at STATWORX right into your inbox! Lastly, follow me on LinkedIn or my company STATWORX on Twitter, if you’re interested in more!

In a recent project at STATWORX, I’ve developed a large scale deep learning application for image classification using Keras and Tensorflow. After developing the model, we needed to deploy it in a quite complex pipeline of data acquisition and preparation routines in a cloud environment. We decided to deploy the model on a prediction server that exposes the model through an API. Thereby, we came across NVIDIA TensorRT Server (TRT Server), a serious alternative to good old TF Serving (which is an awesome product, by the way!). After checking the pros and cons, we decided to give TRT Server a shot. TRT Server has sevaral advantages over TF Serving, such as optimized inference speed, easy model management and ressource allocation, versioning and parallel inference handling. Furthermore, TensorRT Server is not “limited” to TensorFlow (and Keras) models. It can serve models from all major deep learning frameworks, such as TensorFlow, MxNet, pytorch, theano, Caffe and CNTK.

Despite the load of cool features, I found it a bit cumbersome to set up the TRT server. The installation and documentation is scattered to quite a few repositories, documetation guides and blog posts. That is why I decided to write this blog post about setting up the server and get your predictions going!

NVIDIA TensorRT Server

TensorRT Inference Server is NVIDIA’s cutting edge server product to put deep learning models into production. It is part of the NVIDIA’s TensorRT inferencing platform and provides a scaleable, production-ready solution for serving your deep learning models from all major frameworks. It is based on NVIDIA Docker and contains everything that is required to run the server from the inside of the container. Furthermore, NVIDIA Docker allows for using GPUs inside a Docker container, which, in most cases, significantly speeds up model inference. Talking about speed – TRT Server can be considerably faster than TF Serving and allows for multiple inferences from multiple models at the same time, using CUDA streams to exploit GPU scheduling and serialization (see image below).

Visualization of model serialization and parallelism

With TRT Server you can specify the number of concurrent inference computations using so called instance groups that can be configured on the model level (see section “Model Configuration File”) . For example, if you are serving two models and one model gets significantly more inference requests, you can assign more GPU ressources to this model allowing you to compute more multiple requests in parallel. Furthermore, instance groups allow you to specify, whether a model should be executed on CPU or GPU, which can be a very interesting feature in more complex serving environments. Overall, TRT Server has a bunch of great features that makes it interesting for production usage.

NVIDIA architecture

The upper image illustrates the general architecture of the server. One can see the HTTP and gRPC interfaces that allow you to integrate your models into other applications that are connected to the server over LAN or WAN. Pretty cool! Furthermore, the server exposes a couple of sanity features such as health status checks etc., that also come in handy in production.

Setting up the Server

As mentioned before, TensorRT Server lives inside a NVIDIA Docker container. In order to get things going, you need to complete several installation steps (in case you are starting with a blank machine, like here). The overall process is quite long and requires a certain amount of “general cloud, network and IT knowledge”. I hope, that the following steps make the installation and setup process clear to you.

Launch a Deep Learning VM on Google Cloud

For my project, I used a Google Deep Learning VM that comes with preinstalled CUDA as well as TensorFlow libraries. You can launch a cloud VM using the Google Cloud SDK or in the GCP console (which is pretty easy to use, in my opinion). The installation of the GCP SDK can be found here. Please note, that it might take some time until you can connect to the server because of the CUDA installation process, which takes several minutes. You can check the status of the VM in the cloud logging console.

# Create project
gcloud projects create tensorrt-server

# Start instance with deep learning image
gcloud compute instances create tensorrt-server-vm 
	--project tensorrt-server 
	--zone your-zone 
	--machine-type n1-standard-4 
	--create-disk='size=50' 
	--image-project=deeplearning-platform-release 
	--image-family tf-latest-gpu 
	--accelerator='type=nvidia-tesla-k80,count=1' 
	--metadata='install-nvidia-driver=True' 
	--maintenance-policy TERMINATE

After successfully setting up your instance, you can SSH into the VM using the terminal. From there you can execute all the neccessary steps to install the required components.

# SSH into instance
gcloud compute ssh tensorrt-server-vm --project tensorrt-server --zone your-zone

Note: Of course, you have to adapt the script for your project and instance names.

Install Docker

After setting up the GCP cloud VM, you have to install the Docker service on your machine. The Google Deep Learning VM uses Debian as OS. You can use the following code to install Docker on the VM.

# Install Docker
sudo apt-get update
sudo apt-get install 
    apt-transport-https 
    ca-certificates 
    curl 
    software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository 
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu 
   $(lsb_release -cs) 
   stable"
sudo apt-get update
sudo apt-get install docker-ce

You can verify that Docker has been successfully installed by running the following command.

sudo docker run --rm hello-world

You should see a “Hello World!” from the docker container which should give you something like this:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
d1725b59e92d: Already exists 
Digest: sha256:0add3ace90ecb4adbf7777e9aacf18357296e799f81cabc9fde470971e499788
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://hub.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/get-started/

Congratulations, you’ve just installed Docker successfully!

Install NVIDIA Docker

Unfortunately, Docker has no “out of the box” support for GPUs connected to the host system. Therefore, the installation of the NVIDIA Docker runtime is required to use TensorRT Server’s GPU capabilities within a containerized environment. NVIDIA Docker is also used for TF Serving, if you want to use your GPUs for model inference. The following figure illustrates the architecture of the NVIDIA Docker Runtime.

NVIDIA docker

You can see, that the NVIDIA Docker Runtime is layered around the Docker engine allowing you to use standard Docker as well as NVIDIA Docker containers on your system.

Since the NVIDIA Docker Runtime is a proprietary product of NVIDIA, you have to register at NVIDIA GPU Cloud (NGC) to get an API key in order to install and download it. To authenticate against NGC execute the following command in the server command line:

# Login to NGC
sudo docker login nvcr.io

You will be prompted for username and API key. For username you have to enter $oauthtoken, the password is the generated API key. After you have successfully logged in, you can install the NVIDIA Docker components. Following the instructions on the NVIDIA Docker GitHub repo, you can install NVIDIA Docker by executing the following script (Ubuntu 14.04/16.04/18.04, Debian Jessie/Stretch).

# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker

# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | 
  sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | 
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

# Test nvidia-smi with the latest official CUDA image
sudo docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

Installing TensorRT Server

The next step, after successfully installing NVIDIA Docker, is to install TensorRT Server. It can be pulled from the NVIDIA Container Registry (NCR). Again, you need to be authenticated against NGC to perform this action.

# Pull TensorRT Server (make sure to check the current version)
sudo docker pull nvcr.io/nvidia/tensorrtserver:18.09-py3

After pulling the image, TRT Server is ready to be started on your cloud machine. The next step is to create a model that will be served by TRT Server.

Model Deployment

After installing the required technical components and pulling the TRT Server container you need to take care of your model and the deployment. TensorRT Server manages it’s models in a folder on your server, the so called model repository.

Setting up the Model Repository

The model repository contains your exported TensorFlow / Keras etc. model graphs in a specific folder structure. For each model in the model repository, a subfolder with the corresponding model name needs to be defined. Within those model subfolders, the model schema files (config.pbtxt), label definitions (labels.txt) as well as model version subfolders are located. Those subfolders allow you to manage and serve different model versions. The file labels.txt contains strings of the target labels in appropriate order, corresponding to the output layer of the model. Within the version subfolder a file named model.graphdef (the exported protobuf graph) is stored. model.graphdef is actually a frozen tensorflow graph, that is created after exporting a TensorFlow model and needs to be named accordingly.

Remark: I did not manage to get a working serving from a tensoflow.python.saved_model.simple_save() or tensorflow.python.saved_model.builder.SavedModelBuilder() export with TRT Server due to some variable initialization error. We therefore use the “freezing graph” approach, which converts all TensorFlow variable inside a graph to constants and outputs everything into a single file (which is model.graphdef).

/models
|-   model_1/
|--      config.pbtxt
|--      labels.txt
|--      1/
|---		model.graphdef

Since the model repository is just a folder, it can be located anywhere the TRT Server host has a network connection to. For exmaple, you can store your exported model graphs in a cloud repository or a local folder on your machine. New models can be exported and deployed there in order to be servable through the TRT Server.

Model Configuration File

Within your model repository, the model configuration file (config.pbtxt) sets important parameters for each model on the TRT Server. It contains technical information about your servable model and is required for the model to be loaded properly. There are sevaral things you can control here:

name: "model_1"
platform: "tensorflow_graphdef"
max_batch_size: 64
input [
   {
      name: "dense_1_input"
      data_type: TYPE_FP32
      dims: [ 5 ]
   }
]
output [
   {
      name: "dense_2_output"
      data_type: TYPE_FP32
      dims: [ 2 ]
      label_filename: "labels.txt"
   }
]
instance_group [
   {
      kind: KIND_GPU
      count: 4
   }
]

First, name defines the tag under the model is reachable on the server. This has to be the name of your model folder in the model repository. platform defines the framework, the model was built with. If you are using TensorFlow or Keras, there are two options: (1) tensorflow_savedmodel and tensorflow_graphdef. As mentioned before, I used tensorflow_graphdef (see my remark at the end of the previous section). batch_size, as the name says, controls the batch size for your predictions. input defines your model’s input layer node name, such as the name of the input layer (yes, you should name your layers and nodes in TensorFlow or Keras), the data_type, currently only supporting numeric types, such as TYPE_FP16, TYPE_FP32, TYPE_FP64 and the input dims. Correspondingly, output defines your model’s output layer name, it’s data_type and dims. You can specify a labels.txt file that holds the labels of the output neurons in appropriate order. Since we only have two output classes here, the file looks simply like this:

class_0
class_1

Each row defines a single class label. Note, that the file does not contain any header. The last section instance_group lets you define specific GPU (KIND_GPU)or CPU (KIND_CPU) ressources to your model. In the example file, there are 4 concurrent GPU threads assigned to the model, allowing for four simultaneous predictions.

Building a simple model for serving

In order to serve a model through TensorRT server, you’ll first need – well – a model. I’ve prepared a small script that builds a simple MLP for demonstration purposes in Keras. I’ve already used TRT Server successfully with bigger models such as InceptionResNetV2 or ResNet50 in production and it worked very well.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import InputLayer, Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.utils import to_categorical

# Make toy data
X, y = make_classification(n_samples=1000, n_features=5)

# Make target categorical
y = to_categorical(y)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Scale inputs
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model definition
model_1 = Sequential()
model_1.add(Dense(input_shape=(X_train.shape[1], ),
                  units=16, activation='relu', name='dense_1'))
model_1.add(Dense(units=2, activation='softmax', name='dense_2'))
model_1.compile(optimizer='adam', loss='categorical_crossentropy')

# Early stopping
early_stopping = EarlyStopping(patience=5)
model_checkpoint = ModelCheckpoint(filepath='model_checkpoint.h5',
                                   save_best_only=True,
                                   save_weights_only=True)
callbacks = [early_stopping, model_checkpoint]

# Fit model and load best weights
model_1.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
            epochs=50, batch_size=32, callbacks=callbacks)

# Load best weights after early stopping
model_1.load_weights('model_checkpoint.h5')

# Export model
model_1.save('model_1.h5')

The script builds some toy data using sklearn.datasets.make_classification and fits a single layer MLP to the data. After fitting, the model gets saved for further treatment in a separate export script.

Freezing the graph for serving

Serving a Keras (TensorFlow) model works by exporting the model graph as a separate protobuf file (.pb-file extension). A simple way to export the model into a single file, that contains all the weights of the network, is to “freeze” the graph and write it to disk. Thereby, all the tf.Variables in the graph are converted to tf.constant which are stored together with the graph in a single file. I’ve modified this script for that purpose.

import os
import shutil
import keras.backend as K
import tensorflow as tf
from keras.models import load_model
from tensorflow.python.framework import graph_util
from tensorflow.python.framework import graph_io

def freeze_model(model, path):
    """ Freezes the graph for serving as protobuf """
    # Remove folder if present
    if os.path.isdir(path):
        shutil.rmtree(path)
        os.mkdir(path)
        shutil.copy('config.pbtxt', path)
        shutil.copy('labels.txt', path)
    # Disable Keras learning phase
    K.set_learning_phase(0)
    # Load model
    model_export = load_model(model)
    # Get Keras sessions
    sess = K.get_session()
    # Output node name
    pred_node_names = ['dense_2_output']
    # Dummy op to rename the output node
    dummy = tf.identity(input=model_export.outputs[0], name=pred_node_names)
    # Convert all variables to constants
    graph_export = graph_util.convert_variables_to_constants(
        sess=sess,
        input_graph_def=sess.graph.as_graph_def(),
        output_node_names=pred_node_names)
    graph_io.write_graph(graph_or_graph_def=graph_export,
                         logdir=path + '/1',
                         name='model.graphdef',
                         as_text=False)

# Freeze Model
freeze_model(model='model_1.h5', path='model_1')

# Upload to GCP
os.system('gcloud compute scp model_1 tensorrt-server-vm:~/models/ --project tensorrt-server --zone us-west1-b --recurse')

The freeze_model() function takes the path to the saved Keras model file model_1.h5 as well as the path for the graph to be exported. Furthermore, I’ve enhanced the function in order to build the required model repository folder structure containing the version subfolder, config.pbtxt as well as labels.txt, both stored in my project folder. The function loads the model and exports the graph into the defined destination. In order to do so, you need to define the output node’s name and then convert all variables in the graph to constants using graph_util.convert_variables_to_constants, which uses the respective Keras backend session, that has to be fetched using K.get_session(). Furthermore, it is important to disable the Keras learning mode using K.set_learning_phase(0) prior to export. Lastly, I’ve included a small CLI command that uploads my model folder to my GCP instance to the model repository /models.

Starting the Server

Now that everything is installed, set up and configured, it is (finally) time to launch our TRT prediciton server. The following command starts the NVIDIA Docker container and maps the model repository to the container.

sudo nvidia-docker run --rm --name trtserver -p 8000:8000 -p 8001:8001 
-v ~/models:/models nvcr.io/nvidia/tensorrtserver:18.09-py3 trtserver 
--model-store=/models

--rm removes existing containers of the same name, given by --name. -p exposes ports 8000 (REST) and 8001 (gRPC) on the host and maps them to the respective container ports. -v mounts the model repository folder on the host, which is /models in my case, to the container into /models, which is then referenced by --model-store as the location to look for servable model graphs. If everything goes fine you should see similar console output as below. If you don’t want to see the output of the server, you can start the container in detached model using the -d flag on startup.

===============================
== TensorRT Inference Server ==
===============================

NVIDIA Release 18.09 (build 688039)

Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2018 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

I1014 10:38:55.951258 1 server.cc:631] Initializing TensorRT Inference Server
I1014 10:38:55.951339 1 server.cc:680] Reporting prometheus metrics on port 8002
I1014 10:38:56.524257 1 metrics.cc:129] found 1 GPUs supported power usage metric
I1014 10:38:57.141885 1 metrics.cc:139]   GPU 0: Tesla K80
I1014 10:38:57.142555 1 server.cc:884] Starting server 'inference:0' listening on
I1014 10:38:57.142583 1 server.cc:888]  localhost:8001 for gRPC requests
I1014 10:38:57.143381 1 server.cc:898]  localhost:8000 for HTTP requests
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 235] RAW: Entering the event loop ...
I1014 10:38:57.880877 1 server_core.cc:465] Adding/updating models.
I1014 10:38:57.880908 1 server_core.cc:520]  (Re-)adding model: model_1
I1014 10:38:57.981276 1 basic_manager.cc:739] Successfully reserved resources to load servable {name: model_1 version: 1}
I1014 10:38:57.981313 1 loader_harness.cc:66] Approving load for servable version {name: model_1 version: 1}
I1014 10:38:57.981326 1 loader_harness.cc:74] Loading servable version {name: model_1 version: 1}
I1014 10:38:57.982034 1 base_bundle.cc:180] Creating instance model_1_0_0_gpu0 on GPU 0 (3.7) using model.savedmodel
I1014 10:38:57.982108 1 bundle_shim.cc:360] Attempting to load native SavedModelBundle in bundle-shim from: /models/model_1/1/model.savedmodel
I1014 10:38:57.982138 1 reader.cc:31] Reading SavedModel from: /models/model_1/1/model.savedmodel
I1014 10:38:57.983817 1 reader.cc:54] Reading meta graph with tags { serve }
I1014 10:38:58.041695 1 cuda_gpu_executor.cc:890] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I1014 10:38:58.042145 1 gpu_device.cc:1405] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
I1014 10:38:58.042177 1 gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
I1014 10:38:58.042192 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I1014 10:38:58.042200 1 gpu_device.cc:971]      0 
I1014 10:38:58.042207 1 gpu_device.cc:984] 0:   N 
I1014 10:38:58.067349 1 loader.cc:113] Restoring SavedModel bundle.
I1014 10:38:58.074260 1 loader.cc:148] Running LegacyInitOp on SavedModel bundle.
I1014 10:38:58.074302 1 loader.cc:233] SavedModel load for tags { serve }; Status: success. Took 92161 microseconds.
I1014 10:38:58.075314 1 gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7) with Cuda compute capability 3.7. The minimum required Cuda capability is 5.2.
I1014 10:38:58.075343 1 gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
I1014 10:38:58.075348 1 gpu_device.cc:971]      0 
I1014 10:38:58.075353 1 gpu_device.cc:984] 0:   N 
I1014 10:38:58.083451 1 loader_harness.cc:86] Successfully loaded servable version {name: model_1 version: 1}

There is also a warning showing that you should start the container using the following arguments

--shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864

You can do this of course. However, in this example I did not use them.

Installing the Python Client

Now it is time to test our prediction server. TensorRT Server comes with several client libraries that allow you to send data to the server and get predictions. The recommended method of building the client libraries is again – Docker. To use the Docker container, that contains the client libraries, you need to clone the respective GitHub repo using:

git clone https://github.com/NVIDIA/dl-inference-server.git

Then, cd into the folder dl-inference-server and run

docker build -t inference_server_clients .

This will build the container on your machine (takes some time). To use the client libraries within the container on your host, you need to mount a folder to the container. First, start the container in an interactive session (-it flag)

docker run --name tensorrtclient --rm -it -v /tmp:/tmp/host inference_server_clients

Then, run the following commands in the container’s shell (you may have to create /tmp/host first):

cp build/image_client /tmp/host/.
cp build/perf_client /tmp/host/.
cp build/dist/dist/tensorrtserver-*.whl /tmp/host/.
cd /tmp/host

The code above copies the prebuilt image_client and perf_client libraries into the mounted folder and makes it accessible from the host system. Lastly, you need to install the Python client library using

pip install tensorrtserver-0.6.0-cp35-cp35m-linux_x86_64.whl

on the container system. Finally! That’s it, we’re ready to go (sounds like it was an easy way)!

Inference using the Python Client

Using Python, you can easily perform predictions using the client library. In order to send data to the server, you need an InferContext() from the inference_server.api module that takes the TRT Server IP and port as well as the desired model name. If you are using the TRT Server in the cloud, make sure, that you have appropriate firewall rules allowing for traffic on ports 8000 and 8001.

from tensorrtserver.api import *
import numpy as np

# Some parameters
outputs = 2
batch_size = 1

# Init client
trt_host = '123.456.789.0:8000' # local or remote IP of TRT Server
model_name = 'model_1'
ctx = InferContext(trt_host, ProtocolType.HTTP, model_name)

# Sample some random data
data = np.float32(np.random.normal(0, 1, [1, 5]))

# Get prediction
# Layer names correspond to the names in config.pbtxt
response = ctx.run(
    {'dense_1_input': data}, 
    {'dense_2_output': (InferContext.ResultFormat.CLASS, outputs)},
    batch_size)

# Result
print(response)
{'output0': [[(0, 1.0, 'class_0'), (1, 0.0, 'class_1')]]}

Note: It is important that the data you are sending to the server matches the floating point precision, previously defined for the input layer in the model definition file. Furthermore, the names of the input and output layers must exactly match those of your model. If everything went well, ctx.run() returns a dictionary of predicted values, which you would further postprocess according to your needs.

Conclusion and Outlook

Wow, that was quite a ride! However, TensorRT Server is a great product for putting your deep learning models into production. It is fast, scaleable and full of neat features for production usage. I did not go into details regarding inference performance. If you’re interested in more, make sure to check out this blog post from NVIDIA. I must admit, that in comparison to TRT Server, TF Serving is much more handy when it comes to installation, model deployment and usage. However, compared to TRT Server it lacks some functionalities that are handy in production. Bottom line: my team and I will definitely add TRT Server to our production tool stack for deep learning models.

If you have any comments or questions on my story, feel free to comment below! I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice.

If you’re interested in more content like this, join our mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team at STATWORX right into your inbox!

Lastly, follow me on LinkedIn or Twitter, if you’re interested to connect with me.

References

Last Christmas is one of the most popular Christmas tunes that were, are and will be out there. The song is written by the brilliant musician George Michael and was released in 1984, when at that time, Epic Records quickly wanted to release a Christmas tune. According to Wikipedia, there are rumours going around that George Michael just changed the lyrics of an already composed tune named “last easter” in a more “winterly” manner. However, this has officially never been confirmed by the record company. Nonetheless, “Last Christmas” remains at the top of all christmas pop songs – also at STATWORX, where “Last Christmas” is on heavy rotation during the holiday season!

george michael meme

In a recent meme on the web I saw, that the Google Trends search volume for “last cristmas” beginning to kick in (first week of october), indicating that Christmas (and the voice of George Michael, backed by christmas bells) is knocking at the door. In order to get ready for the “most wonderful time of the year”, I decided to build a small neural network in Keras that is able to perform a multistep forecast of the expected “last christmas” search volume this year (that maybe correlates with the number of plays on TV, radio etc.).

google trends last christmas

The screenshot above shows the normalized search volume (range between 0 and 100) for last christmas for Germany from 2004 to October 2018. In winter 2017 there was the alltime high in search traffic for “Last Christmas”, maybe due to the tragic death of Michael on December 25th in 2016.

Data Preparation

I’ve downloaded the search volume data from Google Trends as a CSV file and manually formatted the file header (it included a description of the data as well as some blank lines) as well as string values of “<1”, which I’ve replace with numeric zeros. After preparing the file, I imported it into Python using pandas.read_csv().

Since neural networks work best with scaled data, i.e. data that ranges between a specific lower and upper bound, e.g. 0 and 1 or -1 and 1, I divided the normalized search volume by 100, y_{norm}=y/100.

There are many neural network architectures that can be used in order to perform time series forecasting. Since the data is modeled in a simple input-output-style, of course, MLPs can be used. Furthermore, 1-dimensional convolutional networks can be employed. Last but not least, LSTMs are also applicable.

Multistep forecasting can be done in several ways: (1) building a separate forecasting model for each forecast timestep, (2) building a recursive AR(p) model, that predicts the next value based on previous predictions and (3) building a model that is able to predict multiple values into the future at the same time. Since neural networks can easily handle multiple outputs, I decided to go with neural networks.

Preparing data for multistep forecasting can be a bit cumbersome, especially, when the input data consists on multiple timesteps and variables. In order to prepare my X and y data, I used the following snippet:

def prepare_data(target, window_X, window_y):
    """ Data preprocessing for multistep forecast """
    # Placeholders
    X, y = [], []
    n = len(target)
    # Iterators
    start_X = 0
    end_X = start_X + window_X
    start_y = end_X
    end_y = start_y + window_y
    # Build tensors
    for _ in range(n):
        if end_y < n:
            X.append(target[start_X:end_X])
            y.append(target[start_y:end_y])
        start_X += 1
        end_X = start_X + window_X
        start_y += 1
        end_y = start_y + window_y
    # Convert to array
    X = np.array(X)
    y = np.array(y)
    # Return
    return np.array(X), np.array(y)

The function transforms a single vector of values, target, into two 2-dimensional tensors X and y. The function arguments window_Xand window_y define the number of input lags per observation and the number of output values (timesteps to be predicted), respectively. Let’s take a look at the tensors. X[:3] yields:

array([[ 2,  1,  0,  0,  1,  1,  1,  1,  2,  4, 21, 69],
       [ 1,  0,  0,  1,  1,  1,  1,  2,  4, 21, 69,  1],
       [ 0,  0,  1,  1,  1,  1,  2,  4, 21, 69,  1,  1]])

whereas y[:3] yields:

array([[1, 1, 1, 1, 0, 0],
       [1, 1, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1]])

One can see, that the tensors contain iterating data over the vector target of a fixed length. Specifically, I used window_x = 12 and window_y = 6 which means, that each row in our input tensor X consists of window_X = 12 elements that are used to forecast the next window_y = 6 timesteps. Note, that the original time series contains T=178 months of data, whereas X is of shape (160, 12). The reason for this is that values only get appended to the X and y tensors, if the current end_y iterator is smaller than the number of total obervations in the dataset. Otherwise, y would contain NaN values at the end of the series.

# Training and test
train = 100
X_train = X[:train]
y_train = y[:train]
X_valid = X[train:]
y_valid = y[train:]

For model training, I used the first 100 observations for training purposes and the remaining 60 observations for validation (early stopping). Furthermore, I built a tensor for prediction, X_test that contains the last 12 observations from the time series (November, 1st. 2017 – October, 1st. 2018) in order to compute a prediction for the following 6 months.

Model Building

The following Python snippet shows the function for fitting three neural network models: a simple MLP, a 1-dimensional CNN and a LSTM (I will not go into detail about how the models exactly work – there are tons of great tutorials on the web). I wrote a single function for all three architectures. Thereby, I exploited a litte trick I recently discovered: if you’re experimenting with different network archirectures you always have to make sure, that the input tensors are of the appropriate shape. Since MLPs, CNNs and LSTMs require different dimensions of input tensors, it can be helpful to provide a single input layer that takes the original input tensor and then add a Reshape() layer in order to reshape the input into the required dimensionality of the respective Dense(), Conv() or LSTM() layer. I know, that from a computational point of view, this is not very efficient, however, I do not care in this case 😉

def fit_model(type, X_train, y_train, X_test, y_test, batch_size, epochs):
    """ Training function for network """
    
    # Model input
    model = Sequential()
    model.add(InputLayer(input_shape=(X_train.shape[1], )))

    if type == 'mlp':
        model.add(Reshape(target_shape=(X_train.shape[1], )))
        model.add(Dense(units=64, activation='relu'))

    if type == 'cnn':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(Conv1D(filters=64, kernel_size=4, activation='relu'))
        model.add(MaxPool1D())
        model.add(Flatten())

    if type == 'lstm':
        model.add(Reshape(target_shape=(X_train.shape[1], 1)))
        model.add(LSTM(units=64, return_sequences=False))

    # Output layer
    model.add(Dense(units=64, activation='relu'))
    model.add(Dense(units=y_train.shape[1], activation='sigmoid'))

    # Compile
    model.compile(optimizer='adam', loss='mse')

    # Callbacks
    early_stopping = EarlyStopping(monitor='val_loss', patience=10)
    model_checkpoint = ModelCheckpoint(filepath='model.h5', save_best_only=True)
    callbacks = [early_stopping, model_checkpoint]

    # Fit model
    model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test),
              batch_size=batch_size, epochs=epochs,
              callbacks=callbacks, verbose=2)

    # Load best model
    model.load_weights('model.h5')

    # Return
    return model

The InputLayer() consumes the prepared input tensor X_train. Depending on the type of model, distinct reshape operations are carried out in order to provide proper tensor dimensions for the model. Thereby, the MLP requires the input tensor to be 2-dimensional whereas the CNN and LSTM models require 3-dimensional input tensors. After the respective model architecture layers, another Dense() layer and the output layer are added. Note, that I’ve used a sigmoidal activation for the output. This makes sure, that the predicted values of the network range between 0 and 1, which corresponds to the normalized search volume data (another cool trick for predicting normalized outcomes or percentages). Dense activations are set to ReLU, optimization is performed using the ADAM optimizer. The model uses early stopping to stop model training when the validation error does not improve for 10 consecutive epochs. Then, thanks to model checkpointing, the best model is reloaded and returned.

Prediction

After model training, the predictions for the following 6 months are generated by feeding X_test into the respective model architecture. The following table shows the predicted values:

month MLP CNN LSTM
2018-11 0.165803 0.193593 0.214891
2018-12 0.857727 0.881791 0.817105
2019-01 0.034604 0.034484 0.034604
2019-02 0.007150 0.002432 0.007150
2019-03 0.012865 0.000508 0.012865
2019-04 0.013644 0.000502 0.013644

Overall, the predictions look quite similar between the models. However, there are certain systematic differences:

  • the CNN model predicts the highest search volume in december
  • the MLP seems to overestimate the search volume after the holiday season
  • the LSTM model shows the highest prediction in November and January but the lowest prediction in December

The following plot shows the search volume as well as the predictions for all three models from April 2016 to April 2019:

predictions last christmas

Conclusion and Outlook

Based on the predictions, there is an expected minor decline in search interest. This might be a predictor for lower “Last Christmas” song penetration. However, I did not find any data on radio plays etc. Of course, the model can be improved by incorporating further information into the network. Anyways, I had great fun building the model and working with the search volume data. In the next step, I will check in November, which model performed best when new actual data comes in (beginning of November). Besides this dataset, Google Trends is a great ressource to include external information into your models. For example, in a recent use case at STATWORX, we used Google Trends data to incorporate search interest of specific products into our sales forecasting models. You should give it a try!

If you want to play around with the data or the model, you can find everything on our GitHub repository.

If you have any comments or questions on my blog post, contact me, I will try to answer them. Also, feel free to use my code or share this story with your peers on social platforms of your choice. Follow me on LinkedIn or Twitter, if you want to stay in touch.

Make sure, you frequently check the awesome STATWORX Blog for more interesting data science, ML and AI content straight from the our office in Frankfurt, Germany!

If you’re interested in more quality content like this, join my mailing list, constantly bringing you new data science, machine learning and AI reads and treats from me and my team right into your inbox!

Introduction

Teaching machines to handle image data is probably one of the most exciting tasks in our daily routine at STATWORX. Computer vision in general is a path to many possibilities some would consider intruiging. Besides learning images, computer vision algorithms also enable machines to learn any kind of video sequenced data. With autonomous driving on the line, learning images and videos is probably one of hottest topics right.

learning images - so hot right now

In this post, I will show you how to get started with learning image data. I will make use of Keras, a high level API for Tensorflow, CTNK, and Theano. Keras is implemented in Python and in R. For this post I will work through the Python implementation.

Setup

I am using Python 3.6.5. and Keras is running with a Tensorflow backend. The dataset I will be using is the Airbus Ship Detection dataset from their recent Kaggle Competition. To get started, we will be building a very simple network, a so called autoencoder network. Autoencoders are simple networks in the sense that they are not aiming to predict any target. They rather aim to learn and reconstruct images or data in general. In his blog post Venelin Valkov shows the following figure, I think is quite cool:

mushroom encoder

The figure perfectly describes the intension of an autoencoder. The algorithm takes an input, compresses, and then tries to reconstruct it. Why would we do this? Well, autoencoders have numerous interesting applications. First, they are reasonably good in detecting outliers. The idea is, you teach a machine to reconstruct non-outliers. Thus, when confronted with an outlier, the algorithm will probably have a hard time reconstructing that very observation. Second, autoencoders are fairly interesting to look at, when you are looking to reduce the dimensionality of your data. Speaking about images, you can think of it as a complexity reduction for the images. An algorithm is unlikely to reconstruct nuances of the image that are rather irrelevant to the content. Image recognition or classification algorithms are prone to overreact to certain nuances of images, so denoising them, might ease the learning procedure. Thus, autoencoders can serve as a powerful preprocessing tool to denoising your data.

Data Preparation

Prepararing your data is one of the most important tasks when training algorithms. Even more so, when you are handling image data. Images typically require large amounts of storage, especially since computer vision algorithms usually need to be fed with a considerable amount of data. To encompass this issue my colleauges and I typically make use of either large on-premise servers or cloud instances.

For this blog post however, I am choosing to run everything on my local machine. Why? Well, if you are reading this and you are interested in taking your first steps in developing your own code to handle image data, I would probably bother you with details of setting up cloud instances. If you are reading this and you are already experienced in working with this kind of problems, you will most likely work with cloud instances and you will be bothered by my description as well. So, for this little experiment I am running everything on my local machine and I organized the data as follows:


00_data
    |
    | train
	| train_image_01
	| train_image_02
	| ...
    | test
	| test_image_01
	| ...

To read in the data, I am simply looping over the images. I am using the OpenCV implementation cv2 and the Keras preprocessing tools. I know, I know Keras has this genious ImageDataGenerator modul, however I think it is kind of important to understand the required input, so for this post I will make use of the OpenCV tools. The preprocessing is a little different than with other data. While we see something similar to this:

training images

A machine however, does not see images, but rather data. Each image is representated by a matrix of pixel values. Thus each picture is a data matrix. Unlike with other problems where all data is compressed in one matrix, we need to consider this complex setup. To deal with this issue, we can use the ndarray data type. Implemented in the numpy ecosystem, ndarrays provided a handy data type for multidimensional data. Thus, we convert all our images to numpy arrays and pack them together in an ndarraydata format.

# import libs
import os
import pandas as pd
import numpy as np
import cv2 
import random
from keras.preprocessing.image import ImageDataGenerator, img_to_array

# set the path to the images
train_path = "00_data/train"
test_path = "00_data/test"

# load train images
train_images = os.listdir(f'{train_path}')

# load test images
test_images = os.listdir(f'{test_path}')

# load a couple of random pics
train_images_first_run = train_images[0:10000]
test_images_first_run = test_images[0:1000]

# open up container for training_data
train_data = []
test_data = []

# loop over training images
for imgs in train_images_first_run:
    
    # load the image and resize it
    img = cv2.imread(f'{train_path}/{imgs}')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (128, 128))
    img = img_to_array(img)
    train_data.append(img)

# loop over testing images
for imgs in test_images_first_run:
    
    # load the images and resize it
    img = cv2.imread(f'{test_path}/{imgs}')
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (128, 128))
    img = img_to_array(img)
    test_data.append(img)

# convert data to np array
train_data = np.array(train_data, dtype = "float")
test_data = np.array(test_data, dtype = "float")
 
# reshape the data
x_train = train_data.astype('float32') / train_data.max()
x_test = test_data.astype('float32') / test_data.max()
x_train = np.reshape(x_train, (len(x_train), 128, 128, 1)) 
x_test = np.reshape(x_test, (len(x_test), 128, 128, 1)) 

We use the cv2 function cvtColor to change the color palette to a rather easy to interpret gray-scale. Next, we resize the input to 128 x 128. In addition, we are converting the image to an numpy array. Afterwards we stack all the arrays together. At last, we rescale the input data between 0 and 1. So let’s check out what the data looks like right now.

preprep images

Algorithm Design

The architecture of my autoencoder is somehwat arbitrary I have to confess. To equip my network with some computer vision features, I am adding convolutional layers. Convolutional layers are the essence of Convolutional Neural Networks (CNN). I won’t be going into detail, cause I could probably bore you with 20 pages about CNNs and still, I would barely cover the basics. Thus, I am just assuming you kind of know what’s going on.

As I said, we are setting up a convolutional autoencoder. It sounds quite fancy, though Keras is making it ridiculously simple. A little disclaimer, I am quite aware that there are many other ways to setup the code and so the code above might offend you. Though, I checked the Keras documentation and tried to align my code with the documentation. So if you are offended by my coding, don’t judge me… or at least not too much.

# import libraries
from keras.layers import Input, Dense, Conv2D, MaxPooling2D, UpSampling2D
from keras.models import Model
import matplotlib.pyplot as plt
from keras.models import load_model

# define input shape
input_img = Input(shape=(128, 128, 1))

# encoding dimension
x = Conv2D(16, (3, 3), activation='relu', padding='same')(input_img)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((2, 2), padding='same')(x)

# decoding dimension
x = Conv2D(8, (3, 3), activation='relu', padding='same')(encoded)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((4, 4))(x)
decoded = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)

# build model
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')

As I said before, the design is somewhat arbitrary, however those of you who are working with these kind of networks probably know, the preliminary architecture quite often is somewhat arbitrary. Let us go through the code above and again, I know there are a million ways to setup my model. At first of course I am importing all the modules I need. Then I am defining the input shape. We reshaped the images to 128 x 128 and we gray-scaled all the images, thus the third dimension is of value 1. Second, I am defining the encoding layers, so the first part of the autoencoder network. I am using three convolutional layers to compress the input. The decoding dimension is build using three convolutional layers as well. I am using relufor an activation function and sigmoidfor the last layer. Once I set up the layers, I am just stacking them all together with the Keras Model function. I am using adadelta as an optimizer and the binary crossentropy as the loss function. So let’s have a look at our model’s architecture the keras way:

>>>autoencoder.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 128, 128, 1)       0
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 128, 128, 16)      160
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 64, 16)        0
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 64, 64, 8)         1160
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 32, 32, 8)         0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 32, 8)         584
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 16, 16, 8)         0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 16, 16, 8)         584
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 32, 32, 8)         0
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 32, 32, 8)         584
_________________________________________________________________
up_sampling2d_2 (UpSampling2 (None, 128, 128, 8)       0
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 128, 128, 1)       73
=================================================================
Total params: 3,145
Trainable params: 3,145
Non-trainable params: 0
_________________________________________________________________

Results

To run the model we make use of the fit() method for keras.engine.training.Model objects. To fit the model we just need to specify the batch size and the number of epochs. Since I am running this on my machine I am choosing way to large a batch size and way to small a epoch number.

autoencoder.fit(x_train, x_train,
                epochs=100,
                batch_size=256,
                shuffle=True,
                validation_data=(x_test, x_test))

Our autoencoder is now trained and evaluated on the testing data. As a default, Keras provides extremely nice progress bars for each epoch. To evaluate the results I am not going to bother you with a lot of metrics, instead let’s check the input images and the reconstructed ones. To do so, we can quickly loop over some test images and some reconstructed images. First, we need to predict the reconstructed ones, once again Keras is incredibly handy.

decoded_imgs = autoencoder.predict(x_test)

The prediction is stored in a numpy ndarray and has the exact same structure as our prepped data. Now, let’s take a look at our reconstructed images:

n = 10
plt.figure(figsize=(20, 4))
for i in range(n):
    # display original
    ax = plt.subplot(2, n, i + 1)
    plt.imshow(x_test[i + 100].reshape(128, 128))
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)

    # display reconstruction
    ax = plt.subplot(2, n, i + 1 + n)
    plt.imshow(decoded_imgs[i + 100].reshape(128, 128))
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()

result images

Well, well, well.. isn’t that impressive. Of course this is not quite the result, we are looking for. Thus, we need to keep on improving the model. The first steps to take are quite obvious: smaller batch size, more epochs, more images, and of course a lot of iterations to adjust the architecture of the model.

This is quite a nice finding though. You see, the code above is incredibly simply. Thanks to implementations such as Keras it is becoming increasingly simple to build the most complex algorithms. However, the design is still very complicated and requires a lot of time and experience.

Everybody talks about them, many people know how to use them, few people understand them: Long Short-Term Memory Neural Networks (LSTM). At STATWORX, with the beginning of the hype around AI and projects with large amounts of data, we also started using this powerful tool to solve business problems.

In short, an LSTM is a special type of recurrent neural network – i.e. a network able to access its internal state to process sequences of inputs – which is really handy if you want to exploit some time-like structure in your data. Use cases for recurrent networks range from guessing the next frame in a video to stock prediction, but you can also use them to learn and produce original text. And this shall already be enough information about LSTMs from my side. I won’t bother you with yet another introduction into the theory of LSTMs, as there are more than enough great blog posts about their architecture (Kudos to Andrej Karpathy for this very informative piece of work which you should definitely read if you are not already bored by neural networks :)).

Especially inspired by the blog mentioned above, I thought about playing with a use case for LSTMs that actually has no intended use at all. LSTMs are good for learning text, so I thought it might be fun to let a character-level LSTM learn to write R code. It was not so important that the code is semantically correct or even solves a particular problem. Having a NN that is able to produce (more or less) syntactically correct code is already enough.

So, on my journey to CodeR, a NN that makes my workforce totally obsolete, I will let you participate in the three major steps of getting an RNN to write R code:

  1. Get enough text training data
  2. Build and train CodeR with that data
  3. Let CodeR write majestic R code

If you like to try it yourself or follow the subsequent steps along, you can get the code from my GitHub repository.

Step 1: Data Acquisition

Where to get enough Data?

Ultimately, CodeR needs data; a lot of data. Plus, the data should be of good quality and not be too heterogeneous so that CodeR is able to learn the structure from the given text. Since R is open source, the first address to search for good R code is GitHub. GitHub offers you an API to access information about its repositories, but for the flexibility and data I needed, I found the API too restrictive. That’s why I decided to scrape the webpage myself using Hadley Wickham’s rvest package.

Scrape GitHub

The goal is simple: Clone all R repositories from famous R users. Of course, you could manually define R contributors that seem to be good programmers, but chances are you miss someone out that has some good and influential packages to offer. Remember that we need a lot of code and that it isn’t much of a problem to reduce the data afterwards (which I in fact did).

Get trending R user names

So, let’s start by getting the names of the trending users. If you visit https://github.com/trending/developers/r?since=monthly, you see a list of all trending users. On June 14, 2018, it looked like this:

trending-r-users

If you inspect the HTML code, you quickly see that the actual user names are the href attribute of a link surrounded by <h2> tags, so we use rvest to dig us through that structure.

git_url <- "https://github.com"

trending_user <- glue("{git_url}/trending/developers/r?since=monthly") %>%
  read_html() %>%
  html_nodes(., "h2") %>%
  html_nodes(., "a") %>%
  html_attr(., "href") %>% 
  gsub("/", "", .)
trending_user
[1] "hadley"        "rstudio"       "yihui"         ...

It’s good to see that the names match the expected result from the webpage :).

Get R repository names

In the next step, taking user i, we need to get all repositories of i that are her own (i.e. not forked) R repositories. When checking the url that lets you inspect all repos of a user (e.g. https://github.com/hadley?page=1&tab=repositories), you realize that you need to go through all pages of a user’s repository tab. I wrote a function that does that plus makes sure that:

  • The repo’s main language is R
  • If the repo is forked, the repo will be assigned to the original author

With that function, it is easy to extract all R repo names from our trending users

repos <- list()
for (user in trending_user) {
  cat("User: ", user, "n")
  repos[[user]] <- get_r_repos(user)  # The actual magic
}
repos %<>% unlist() %>% unique()

Clone R repositories

Now that we have a bunch of repository names, the last step is to clone all those repos and to clean them so that they only contain R files. I have decided to clean a repo directly after I have cloned it since I am going to download a lot of data and don’t want to use too much space on my hard drive. The example code below clones the repo where you can find all of the code above (you are welcome ;)).

repo <- "tkrabel/rcoder"
system(glue("git clone https://github.com{repo}.git"),
       wait = TRUE)

After having cloned all repos, I simply smash their content together in one big text file (r_scripts_text.txt).

Step 2: Teach the Baby to Walk

So, we have a big text file now that is ready to be inspected by CodeR so that it can learn to produce own good pieces of code. But how does the training actually work? There are a few steps that need to be taken care of here

  1. Prepare the data in a way it can actually be learned by an LSTM
  2. Construct the network’s architecture
  3. The actual training step

The general idea behind step 1 is to slice the text data in overlapping sequences of characters with a pre-specified size s corresponding to the “time horizon”. For example, imagine a text file containing the string "STATWORX ROCKS!" and let s = 3, meaning that you want the LSTM to use the last three characters to predict the fourth one. From this text file, you generate the data which looks like this.

x1 x2 x3 y
‘S’ ‘T’ ‘A’ ‘T’
‘T’ ‘A’ ‘T’ ‘W’
‘A’ ‘T’ ‘W’ ‘O’
‘C’ ‘K’ ‘S’ ‘!’

In a next step, you have to represent each character as a numeric object so that your model can actually work with it. The most popular way is to represent characters as unit vectors. Making it more tangible, remember that in the sentence above, we have 11 distinct characters (including the blank space and the exclamation mark). The so-called vocabulary {'S', 'T', 'A', 'W', 'O', 'R', 'X', ' ', 'C', 'K', '!'}is utilized to represent each character by a 11-dimensional unit vector with the 1 at its respective character position, e.g. S = (1, 0, dots, 0)^top (because 'S' is the first character of the vocabulary), T = (0, 1, 0, dots, 0)^top, and so on. With these transformations, we finally have data our model can learn from.

Step 2 (building the model) is an ease with the R keras package, and it in fact took only 9 lines of code to build and LSTM with one input layer, 2 hidden LSTM layers with 128 units each and a softmax output layer, making it four layers in total. I kept the model that “simple” because I knew it is going to take a long time to learn. However, the learning results were not satisfying even after longer training times, so I decided to look out for ways of training networks on better (free) hardware in order to configure much more complex models. My search brought me to the Google Colaboratory, an environment that runs in the cloud and offers GPU support. Especially the GPU support gave training a huge time boost. However, for all R passionates out there, Google’s Colab has a drawback: it is a Jupyter Notebook environment and therefore requires you to write Python code, which makes my use case somewhat cynical since I now use Python to train a network which writes R code. Well, in the end, I suppose, we all have to make some sacrifices :)!

As I started translating my code into Python, I found that there is a very useful package textgenrnn that lets you very easily build and train a model. The advantage of the package is that its functions handle the whole data preparation step for you. The only thing you need to do is to specify the raw input text file from which the model learns and to configure the model, the rest is done for you (Credits go to Max Woolf for this great piece of work).

If you want to build your own version of CodeR, just copy this notebook to your Google Drive and follow the instructions.

Step 3: Let CodeR Talk to Us

After we have a trained version of CodeR, it is time to let it write some code. Starting with a blank sheet, CodeR is asked to sample the first character, which is a random one. In the next step, we feed that created character back to the model in order to write the next character. After that, we always use up to the last 40 characters as an input for the prediction of the next element in the text sequence.

There is a parameter in the corresponding textgenrnn function that can determine CodeR‘s creativity while writing R code (so-called temperature). The higher the value, the more creative, i.e. diverse, the text. However, the results are not checked for syntactical correctness, so choosing too high of a temperature leads to more syntax errors. On the other hand, lower values in temperature (e.g. 0.5) make CodeR more conservative in its predictions, being closer to what it has learned. For a value of temperature = 0.5, CodeR knows how to pass any code review:

partition_by = NULL,
                                                                                                                                                                                                                                                                                                                                                                                                  NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 

BATMAAAAAAAN!!! Looks like CodeR is lost in an infinite NA loop. Note the comma after the first statement. It is an artefact of the fact that we have some shiny code in the code base. This is, of course, an issue, as it leads to a very heterogenous text base, destabilizing RNN’s learning process. Making the training data more homogenous is terms of syntax is definitely a topic for future refinements.

But let’s look at some code now that is funny and, quite frankly, impressive. It will also teach us something about the learning set. Staying with a temerature value of 0.5, CodeR mainly produces, well, NA loops, but also writes a lot of roxygen comments.

#' @param x An object to transform to a string.
#'
#' @param x An object of column names
#' @param categorical_column A string with a model format.
#' @param ... Export to make setting to a selection of the length of the scale.
#' @param ordered If not supported values are stored in the data state on a post container.
#' @param conf.level character vector of structure (a, b, separate the second library in the first directory for the global environment for each vector of the same as a single argument.
#'
#' @param ... Additional arguments to exponentiate for the new name of the document, not a context
#'   the code{searchPackages} returns a function that dependencies to create a character vector of the specified values. The name of the command packages in the top level normal for a single static when the default to the selection

Quite entertaining, isn’t it? I like the way CodeR is using totally confusing argument names, such as categorical_column for a “string with a model format”.

With temperature values around 1, CodeR starts writing first syntactically correct functions (although the semantics might be a problem). Let’s look at some snippets I found in the output.

format.rdnore_configNames <- function(x) {
  class(x) <- y_string[[mull_bt]]
  x
}
addToN <- function(packagePath, ..., recursive = FALSE) {
  assert_that(is_string(id))

  base <- as.data.frame(names(purrr::packagesjitter))   stop("parents has columns")   layer_class = c(list(values, command), top = "Tar") } counts.latex <- function(x, ...) {   if(is.null(x)) {     stop("ROC Git classes:")   } }  spark_guess_value <- function(pd, stringsAsFactors = FALSE) {   if (!is.null(varify))     cat("Installing ", cols, "/", xdate)


  return(selfregistry) } </code></pre> It is impressive that <em>CodeR</em> correctly sets blank spaces and braces most of the time. I only needed to mildly correct it when it set a backtick instead of a quotation mark.  It can also use <code>dplyr</code> functions and the <code>magrittr</code> pipe (which is great since I am a big fan of the pipe as you can read <a href="https://statworx-1727.demosrv.review/de/blog/show-me-your-pipe/">here</a>). <pre><code class="language-r">rf_car <- function(operator, input_col, config, default.uniques) %>%   group_by(minorm) %>% summarise(week = 100) </code></pre> Of course, this is just the tip of the iceberg, and there was a lot of unusable code <em>CodeR</em> produced. So to be fair, let's take a look a the lines that didn't make it into the hall of fame. <pre><code class="language-r">#' @import knit_print.j installed file #' @param each HTML pages (in numbers matching metadata),       # in the seed location       if(tfitem %in% cl) {
        unknown <<- 100 : mutate(contsList)
      v = integer(1)
    }

    unused(
      {retries_cred_, revdeps, message_format = ","))
      

      if (flattenToBinour && !renamed) {

    # define validations for way top aggregated instead, constants that does not want
#' arguments  code{list}.
#'
#' @inheritParams dontrun{
#' # Default environment is supplied
#'
#' @keywords internal
str_replace_all <- function(pkg_lines, list(token = path),
                           list(using := force_init(), compare, installInfoname, ") %>%   ` %>% #'   modifyList(list(2, coord = FALSE)) #' </code></pre> If you set the temperature to 2, it becomes wildly creative. <pre><code class="language-r">x <- xp>$scenqNy89L'<JW]
#' Clear tuning
#r
# verifican
ignore <- wwMap.com:(.p/qafffs.tboods,4max LNh,	rmAR',5R}/6)  Y/AS_M(SB423eyt
mf(,9] **.4L2;3) # v1.3mDE); *}

g3 <%yype_3X-C(r63,JAE)Zsd <- 1

Summary and outlook

LSTMs offer many interesting and amusing use cases. As a free-time side project, I try to leverage the recurrent structure of an LSTM in order to train a model I call CodeR to write, well …, R code. The results are truly entertaining and informative, as they reveal some of the training data set’s structure. For example, we see that R code contains roxygen comments to a large extent, which makes sense as we included many R packages in the training set.

One point for further improvements of CodeR definitely is to remove all the shiny code from the training set in order to make the syntax more homogeneous and therefore to improve CodeR‘s output text quality. Furthermore, it may be worthwhile to remove all roxygen comments.

If you have any ideas what to do next with CodeR, if you have any suggestions on how to improve my code or if you just want to leave a comment, please feel free to shoot me a message. Especially if you trained a version of CodeR yourself, don’t hesitate to share your favorite lines of code with me. I would also be very curious if you could improve CodeR‘s output quality by altering the training set (e.g. in the way described above). If you want to learn more about keras, check out our open workshop!

Google AutoML Vision is a state-of-the-art cloud service from Google that is able to build deep learning models for image recognition completely fully automated and from scratch. In this post, Google AutoML Vision is used to build an image classification model on the Zalando Fashion-MNIST dataset, a recent variant of the classical MNIST dataset, which is considered to be more difficult to learn for ML models, compared to digit MNIST.

During the benchmark, both AutoML Vision training modes, “free” (0 $, limited to 1 hour computing time) and “paid” (approx. 500 $, 24 hours computing time) were used and evaluated:

Thereby, the free AutoML model achieved a macro AUC of 96.4% and an accuracy score of 88.9% on the test set at a computing time of approx. 30 minutes (early stopping). The paid AutoML model achieved a macro AUC of 98.5% on the test set with an accuracy score of 93.9%.

Introduction

Recently, there is a growing interest in automated machine learning solutions. Products like H2O Driverless AI or DataRobot, just to name a few, aim at corporate customers and continue to make their way into professional data science teams and environments. For many use cases, AutoML solutions can significantly speed up time-2-model cycles and therefore allow for faster iteration and deployment of models (and actually start saving / making money in production).

Automated machine learning solutions will transform the data science and ML landscape substantially in the next 3-5 years. Thereby, many ML models or applications that nowadays require respective human input or expertise will likely be partly or fully automated by AI / ML models themselves. Likely, this will also yield a decline in overall demand for “classical” data science profiles in favor of more engineering and operations related data science roles that bring models into production.

A recent example of the rapid advancements in automated machine learning this is the development of deep learning image recognition models. Not too long ago, building an image classifier was a very challenging task that only few people were acutally capable of doing. Due to computational, methodological and software advances, barriers have been dramatically lowered to the point where you can build your first deep learning model with Keras in 10 lines of Python code and getting “okayish” results.

Undoubtly, there will still be many ML applications and cases that cannot be (fully) automated in the near future. Those cases will likely be more complex because basic ML tasks, such as fitting a classifier to a simple dataset, can and will easily be automated by machines.

At this point, first attempts in moving into the direction of machine learning automation are made. Google as well as other companies are investing in AutoML research and product development. One of the first professional automated ML products on the market is Google AutoML Vision.

Google AutoML Vision

Google AutoML Vision (at this point in beta) is Google’s cloud service for automated machine learning for image classification tasks. Using AutoML Vision, you can train and evaluate deep learning models without any knowledge of coding, neural networks or whatsoever.

AutoML Vision operates in the Google Cloud and can be used either based on a graphical user interface or via, REST, command line or Python. AutoML Vision implements strategies from Neural Architecture Search (NAS), currently a scientific field of high interest in deep learning research. NAS is based on the idea that another model, typically a neural network or reinforcement learning model, is designing the architecture of the neural network that aims to solve the machine learning task. Cornerstones in NAS research were the paper from Zoph et at. (2017) as well as Pham et al. (2018). The latter has also been implemented in the Python package autokeras (currently in pre-release phase) and makes neural architecture search feasible on desktop computers with a single GPU opposed to 500 GPUs used in Zoph et al.

The idea that an algorithm is able to discover architectures of a neural network seems very promising, however is still kind of limited due to computational contraints (I hope you don’t mind that I consider a 500-1000 GPU cluster as as computational contraint). But how good does neural architecture search actually work in a pre-market-ready product?

Benchmark

In the following section, Google AutoML vision is used to build an image recognition model based on the Fashion-MNIST dataset.

Dataset

The Fashion-MNIST dataset is supposed to serve as a “drop-in replacement” for the traditional MNIST dataset and has been open-sourced by Europe’s online fashion giant Zalando‘s research department (check the Fashion-MNIST GitHub repo and the Zalando reseach website). It contains 60,000 training and 10,000 test images of 10 different clothing categories (tops, pants, shoes etc.). Just like in MNIST, each image is a 28×28 grayscale image. It shares the same image size and structure of training and test images. Below are some examples from the dataset:

The makers of Fashion-MNIST argue, that nowadays the traditional MNIST dataset is a too simple task to solve – even simple convolutional neural networks achieve >99% accuracy on the test set whereas classical ML algorithms easily score >97%. For this and other reasons, Fashion-MNIST was created.

The Fashion-MNIST repo contains helper functions for loading the data as well as some scripts for benchmarking and testing your models. Also, there’s a neat visualization of an ebmedding of the data on the repo. After cloning, you can import the Fashion-MNIST data using a simple Python function (check the code in the next section) and start to build your model.

Using Google AutoML Vision

Preparing the data

AutoML offers two ways of data ingestion: (1) upload a zip file that contains the training images in different folders, corresponding to the respective labels or (2) upload a CSV file that contains the Goolge cloud storage (GS) filepaths, labels and optionally the data partition for training, validation and test set. I decided to go with the CSV file because you can define the data partition (flag names are TRAIN, VALIDATION and TEST) in order to keep control over the experiment. Below is the required structure of the CSV file that needs to be uploaded to AutoML Vision (without the header!).

partition file label
TRAIN gs://bucket-name/folder/image_0.jpg 0
TRAIN gs://bucket-name/folder/image_1.jpg 2
VALIDATION gs://bucket-name/folder/image_22201.jpg 7
VALIDATION gs://bucket-name/folder/image_22202.jpg 9
TEST gs://bucket-name/folder/image_69998.jpg 4
TEST gs://bucket-name/folder/image_69999.jpg 1

Just like MNIST, Fashion-MNIST data contains the pixel values of the respective images. To actually upload image files, I developed a short python script that takes care of the image creation, export and upload to GCP. The script iterates over each row of the Fashion-MNIST dataset, exports the image and uploads it into a Google Cloud storage bucket.

import os
import gzip
import numpy as np
import pandas as pd
from google.cloud import storage
from keras.preprocessing.image import array_to_img


def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels


# Import training data
X_train, y_train = load_mnist(path='data', kind='train')
X_test, y_test = load_mnist(path='data', kind='t10k')

# Split validation data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=10000)

# Dataset placeholder
files = pd.DataFrame({'part': np.concatenate([np.repeat('TRAIN', 50000),
                                              np.repeat('VALIDATION', 10000),
                                              np.repeat('TEST', 10000)]),
                      'file': np.repeat('file', 70000),
                      'label': np.repeat('label', 70000)})

# Stack training and test data into single arrays
X_data = np.vstack([X_train, X_valid, X_test])
y_data = np.concatenate([y_train, y_valid, y_test])

# GS path
gs_path = 'gs://secret/fashionmnist'

# Storgae client
storage_client = storage.Client.from_service_account_json(json_credentials_path='secret.json')
bucket = storage_client.get_bucket('secret-bucket')

# Fill matrix
for i, x in enumerate(X_data):
    # Console print
    if i % 1000 == 0:
        print('Uploading image {image}'.format(image=i))
    # Reshape and export image
    img = array_to_img(x=x.reshape(28, 28, 1))
    img.save(fp='fashionmnist' + '/' + 'image_' + str(i) + '.jpg')
    # Add info to data frame
    files.iloc[i, 1] = gs_path + '/' + 'image_' + str(i) + '.jpg'
    files.iloc[i, 2] = y_data[i]
    # Upload to GCP
    blob = bucket.blob('fashionmnist/' + 'image_' + str(i) + '.jpg')
    blob.upload_from_filename('fashionmnist/' + 'image_' + str(i) + '.jpg')
    # Delete image file
    os.remove('fashionmnist/' + 'image_' + str(i) + '.jpg')

# Export CSV file
files.to_csv(path_or_buf='fashionmnist.csv', header=False, index=False)

The function load_mnist is from the Fashion-MNIST repository and imports the training and test arrays into Python. After importing the training set, 10,000 examples are sampled and sotored as validation data using train_test_split from sklean.model_selection. The training, validation and test arrays are then stacked into X_data in order to have a single object for iteration. A placeholder DataFrame is initialized to store the required information (partition, filepath and label), required by AutoML Vision. storage from google.cloud connects to GCP using a service account json file (which I will, of course, not share here). Finally, the main process takes place, iterating over X_data, generating an image for each row, saving it to disk, uploading it to GCP and deleting the image since it is no longer needed. Lastly, I uploaded the exported CSV file into the Google Cloud storage bucket of the project.

Getting into AutoML

AutoML Vision is currently in Beta, which means that you have to apply before trying it out. Since me and my colleagues are currently exploring the usage of automated machine learning in a computer vision project for one of our customers, I already have access to AutoML Vision through the GCP console.

The start screen looks pretty unspectacular at this point. You can start by clicking on “Get started with AutoML” or read the documentation, which is pretty basic so far but informative, especially when you’re not familiar with basic machine learning concepts such as train-test-splits, overfitting, prcision / recall etc.

After you started, Google AutoML takes you to the dataset dialog, which is the first step on the road to the final AutoML model. So far, nothing to report here. Later, you will find here all of your imported datasets.

Generating the dataset

After hitting “+ NEW DATASET” AutoML takes you to the “Create dataset” dialog. As mentioned before, new datasets can be added using two different methods, shown in the next image.

I’ve already uploaded the images from my computer as well as the CSV file containing the GS filepaths, partition information as well as the corresponding labels into the GS bucket. In order to add the dataset to AutoML Vision you must specify the filepath to the CSV file that contains the image GS-filepaths etc.

In the “Create dataset” dialog, you can also enable multi-label classification, if you have multiple labels per image, which is also a very helpful feature. After hitting “CREATE DATASET”, AutoML iterates over the provided file names and builds the dataset for modeling. What exactly is does, is neither visible nor documented. This import process may take a while, so it is showing you the funky “Knight Rider” progress bar.

After the import is finished, you will recieve an email from GCP, informing you that the import of the dataset is completed. I find this helpful because you don’t have to keep the browser window open and stare at the progress bar all the time.

The email looks a bit weird, but hey, it’s still beta…

Training a model

Back to AutoML. The first thing you see after building your dataset are the imported images. In this example, the images are a bit pixelated because they are only 28×28 in resolution. You can navigate through the different labels using the nav bar on the left side and also manually add labels so far unlabeled images. Furthermore, you can request a human labeling service if you do not have any labels that come with your images. Additionally, you can create new labels if you need to add a category etc.

Now let’s get serious. After going to the “TRAIN” dialog, AutoML informs you about the frequency distribution of your labels. It recommends a minimum count of $n=100$ labels per class (which I find quite low). Also, it seems that it shows you the frequencies of the whole dataset (train, validation and test together). A grouped frquency plot by data partition would be more informative at this point, in my opinion.

A click on “start training” takes you to a popup window where you can define the model name and the allocate a training budget (computing time / money) you are willing to invest. You have the choice between “1 compute hour”, whis is free for 10 models every month, or “24 compute hours (higher quality)” that comes with a price tag of approx. 480 $ (1 hour of AutoML computing costs 20 $. Hovever, if the architecture search converges at an earlier point, you will only pay the amount of computing time that has been consumed so far, which I find reasonable and fair. Lastly, there is also the option to define a custom training time, e.g. 5 hours.

In this experiment, I tried both, the “free” version of AutoML but I also went “all-in” and seleced the 24 hours option to achieve the best model possible (“paid model”). Let’s see, what you can expect from a 480 $ cutting edge AutoML solution. After hitting “START TRAINING” the familiar Knight Rider screen appears telling you, that you can close the browser window and let AutoML do the rest. Naise.

Results and evaluation

First, let’s start with the free model. It took approx. 30mins of training and seemed to have converged a solution very quickly. I am not sure, what exactly AutoML does when it evaluates convergence criteria but it seems to be different between the free and paid model, because the free model converged already around 30 minutes of computing and the paid model did not.

The overall model metrics of the free model look pretty decent. An average precision of 96.4% on the testset at a macro class 1 presision of 90.9% and a recall of 87.7%. The current accuracy benchmark on the Fashion-MNIST dataset is at 96.7% (WRN40-4 8.9M parameters) followed by 96.3% (WRN-28-10 + Random Erasing) while the accuracy of the low budget model is only at 89.0%. So the free AutoML model is pretty far away from the current Fashion-MNIST benchmark. Below, you’ll find the screenshot of the free model’s metrics.

The model metrics of the paid model look significantly better. It achieved an average precision of 98.5% on the testset at a macro class 1 presision of 95.0% and a recall of 92.8% as well as an accuracy score of 93.9%. Those results are close to the current benchmark, however, not so close as I hoped. Below, you’ll find the screenshot of the paid model’s metrics.

The “EVALUATE” tab also shows you further detailed metrics such as precision / recall curves as well as sliders for classification cutoffs that impact the model metrics respectively. At the bottom of the page you’ll find the confusion matrix with relative freuqencies of correct and misclassified examples. Furthermore, you can check images of false positives and negatives per class (which is very helpful, if you want to understand why and when your model is doing something wrong). Overall, the model evaluation functionalities are limited but user friendly. As a more profound user, of course, I would like to see more advanced features but considering the target group and the status of development I think it is pretty decent.

Prediction

After fitting and evaluating your model you can use several methods to predict new images. First, you can use the AutoML user interface to upload new images from your local machine. This is a great way for unexperienced users to apply their model to new images and get predictions. For advanced users and developers, AutoML vision exposes the model through an API on the GCP while taking care of all the technical infrastructure in the background. A simple Python script shows the basic usage of the API:

import sys
from google.cloud import automl_v1beta1


# Define client from service account json
client = automl_v1beta1.PredictionServiceClient.from_service_account_json(filename='automl-XXXXXX-d3d066fe6f8c.json')

# Endpoint
name = 'projects/automl-XXXXXX/locations/us-central1/models/ICNXXXXXX

# Import a single image
with open('image_10.jpg', 'rb') as ff:
    img = ff.read()

# Define payload
payload = {'image': {'image_bytes': img}}

# Prediction
request = client.predict(name=name, payload=payload, params={})
print(request)

# Console output
payload {
  classification {
    score: 0.9356002807617188
  }
  display_name: "a_0"
}

As a third method, it is also possible to curl the API in the command line, if you want to go full nerdcore. I think, the automated API exposure is a great feature because it lets you integrate your model in all kinds of scripts and applications. Furthermore, Google takes care of all the nitty-gritty things that come into play when you want to scale the model to hundrets or thousands of API requests simultaneously in a production environment.

Conclusion and outlook

In a nutshell, even the free model achieved pretty good results on the test set, given that the actual amount of time invested in the model was only a fraction of time it would have taken to build the model manually. The paid model achieved significantly better results, however at a cost note of 480 $. Obviously, the paid service is targeted at data science professionals and companies.

AutoML Vision is only a part of a set of new AutoML applications that come to the Google Cloud (check these announcements from Google Next 18), further shaping the positioning of the platform in the direction of machine learning and AI.

In my personal opinion, I am confident that automated machine learning solutions will continue to make their way into professional data science projects and applications. With automated machine learning, you can (1) build baseline models for benchmarking your custom solutions, (2) iterate use cases and data products faster and (3) get quicker to the point, when you actually start to make money with your data – in production.

Google AutoML Vision is a state-of-the-art cloud service from Google that is able to build deep learning models for image recognition completely fully automated and from scratch. In this post, Google AutoML Vision is used to build an image classification model on the Zalando Fashion-MNIST dataset, a recent variant of the classical MNIST dataset, which is considered to be more difficult to learn for ML models, compared to digit MNIST.

During the benchmark, both AutoML Vision training modes, “free” (0 $, limited to 1 hour computing time) and “paid” (approx. 500 $, 24 hours computing time) were used and evaluated:

Thereby, the free AutoML model achieved a macro AUC of 96.4% and an accuracy score of 88.9% on the test set at a computing time of approx. 30 minutes (early stopping). The paid AutoML model achieved a macro AUC of 98.5% on the test set with an accuracy score of 93.9%.

Introduction

Recently, there is a growing interest in automated machine learning solutions. Products like H2O Driverless AI or DataRobot, just to name a few, aim at corporate customers and continue to make their way into professional data science teams and environments. For many use cases, AutoML solutions can significantly speed up time-2-model cycles and therefore allow for faster iteration and deployment of models (and actually start saving / making money in production).

Automated machine learning solutions will transform the data science and ML landscape substantially in the next 3-5 years. Thereby, many ML models or applications that nowadays require respective human input or expertise will likely be partly or fully automated by AI / ML models themselves. Likely, this will also yield a decline in overall demand for “classical” data science profiles in favor of more engineering and operations related data science roles that bring models into production.

A recent example of the rapid advancements in automated machine learning this is the development of deep learning image recognition models. Not too long ago, building an image classifier was a very challenging task that only few people were acutally capable of doing. Due to computational, methodological and software advances, barriers have been dramatically lowered to the point where you can build your first deep learning model with Keras in 10 lines of Python code and getting “okayish” results.

Undoubtly, there will still be many ML applications and cases that cannot be (fully) automated in the near future. Those cases will likely be more complex because basic ML tasks, such as fitting a classifier to a simple dataset, can and will easily be automated by machines.

At this point, first attempts in moving into the direction of machine learning automation are made. Google as well as other companies are investing in AutoML research and product development. One of the first professional automated ML products on the market is Google AutoML Vision.

Google AutoML Vision

Google AutoML Vision (at this point in beta) is Google’s cloud service for automated machine learning for image classification tasks. Using AutoML Vision, you can train and evaluate deep learning models without any knowledge of coding, neural networks or whatsoever.

AutoML Vision operates in the Google Cloud and can be used either based on a graphical user interface or via, REST, command line or Python. AutoML Vision implements strategies from Neural Architecture Search (NAS), currently a scientific field of high interest in deep learning research. NAS is based on the idea that another model, typically a neural network or reinforcement learning model, is designing the architecture of the neural network that aims to solve the machine learning task. Cornerstones in NAS research were the paper from Zoph et at. (2017) as well as Pham et al. (2018). The latter has also been implemented in the Python package autokeras (currently in pre-release phase) and makes neural architecture search feasible on desktop computers with a single GPU opposed to 500 GPUs used in Zoph et al.

The idea that an algorithm is able to discover architectures of a neural network seems very promising, however is still kind of limited due to computational contraints (I hope you don’t mind that I consider a 500-1000 GPU cluster as as computational contraint). But how good does neural architecture search actually work in a pre-market-ready product?

Benchmark

In the following section, Google AutoML vision is used to build an image recognition model based on the Fashion-MNIST dataset.

Dataset

The Fashion-MNIST dataset is supposed to serve as a “drop-in replacement” for the traditional MNIST dataset and has been open-sourced by Europe’s online fashion giant Zalando‘s research department (check the Fashion-MNIST GitHub repo and the Zalando reseach website). It contains 60,000 training and 10,000 test images of 10 different clothing categories (tops, pants, shoes etc.). Just like in MNIST, each image is a 28×28 grayscale image. It shares the same image size and structure of training and test images. Below are some examples from the dataset:

The makers of Fashion-MNIST argue, that nowadays the traditional MNIST dataset is a too simple task to solve – even simple convolutional neural networks achieve >99% accuracy on the test set whereas classical ML algorithms easily score >97%. For this and other reasons, Fashion-MNIST was created.

The Fashion-MNIST repo contains helper functions for loading the data as well as some scripts for benchmarking and testing your models. Also, there’s a neat visualization of an ebmedding of the data on the repo. After cloning, you can import the Fashion-MNIST data using a simple Python function (check the code in the next section) and start to build your model.

Using Google AutoML Vision

Preparing the data

AutoML offers two ways of data ingestion: (1) upload a zip file that contains the training images in different folders, corresponding to the respective labels or (2) upload a CSV file that contains the Goolge cloud storage (GS) filepaths, labels and optionally the data partition for training, validation and test set. I decided to go with the CSV file because you can define the data partition (flag names are TRAIN, VALIDATION and TEST) in order to keep control over the experiment. Below is the required structure of the CSV file that needs to be uploaded to AutoML Vision (without the header!).

partition file label
TRAIN gs://bucket-name/folder/image_0.jpg 0
TRAIN gs://bucket-name/folder/image_1.jpg 2
VALIDATION gs://bucket-name/folder/image_22201.jpg 7
VALIDATION gs://bucket-name/folder/image_22202.jpg 9
TEST gs://bucket-name/folder/image_69998.jpg 4
TEST gs://bucket-name/folder/image_69999.jpg 1

Just like MNIST, Fashion-MNIST data contains the pixel values of the respective images. To actually upload image files, I developed a short python script that takes care of the image creation, export and upload to GCP. The script iterates over each row of the Fashion-MNIST dataset, exports the image and uploads it into a Google Cloud storage bucket.

import os
import gzip
import numpy as np
import pandas as pd
from google.cloud import storage
from keras.preprocessing.image import array_to_img


def load_mnist(path, kind='train'):
    """Load MNIST data from `path`"""
    labels_path = os.path.join(path,
                               '%s-labels-idx1-ubyte.gz'
                               % kind)
    images_path = os.path.join(path,
                               '%s-images-idx3-ubyte.gz'
                               % kind)

    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
                               offset=8)

    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,
                               offset=16).reshape(len(labels), 784)

    return images, labels


# Import training data
X_train, y_train = load_mnist(path='data', kind='train')
X_test, y_test = load_mnist(path='data', kind='t10k')

# Split validation data
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=10000)

# Dataset placeholder
files = pd.DataFrame({'part': np.concatenate([np.repeat('TRAIN', 50000),
                                              np.repeat('VALIDATION', 10000),
                                              np.repeat('TEST', 10000)]),
                      'file': np.repeat('file', 70000),
                      'label': np.repeat('label', 70000)})

# Stack training and test data into single arrays
X_data = np.vstack([X_train, X_valid, X_test])
y_data = np.concatenate([y_train, y_valid, y_test])

# GS path
gs_path = 'gs://secret/fashionmnist'

# Storgae client
storage_client = storage.Client.from_service_account_json(json_credentials_path='secret.json')
bucket = storage_client.get_bucket('secret-bucket')

# Fill matrix
for i, x in enumerate(X_data):
    # Console print
    if i % 1000 == 0:
        print('Uploading image {image}'.format(image=i))
    # Reshape and export image
    img = array_to_img(x=x.reshape(28, 28, 1))
    img.save(fp='fashionmnist' + '/' + 'image_' + str(i) + '.jpg')
    # Add info to data frame
    files.iloc[i, 1] = gs_path + '/' + 'image_' + str(i) + '.jpg'
    files.iloc[i, 2] = y_data[i]
    # Upload to GCP
    blob = bucket.blob('fashionmnist/' + 'image_' + str(i) + '.jpg')
    blob.upload_from_filename('fashionmnist/' + 'image_' + str(i) + '.jpg')
    # Delete image file
    os.remove('fashionmnist/' + 'image_' + str(i) + '.jpg')

# Export CSV file
files.to_csv(path_or_buf='fashionmnist.csv', header=False, index=False)

The function load_mnist is from the Fashion-MNIST repository and imports the training and test arrays into Python. After importing the training set, 10,000 examples are sampled and sotored as validation data using train_test_split from sklean.model_selection. The training, validation and test arrays are then stacked into X_data in order to have a single object for iteration. A placeholder DataFrame is initialized to store the required information (partition, filepath and label), required by AutoML Vision. storage from google.cloud connects to GCP using a service account json file (which I will, of course, not share here). Finally, the main process takes place, iterating over X_data, generating an image for each row, saving it to disk, uploading it to GCP and deleting the image since it is no longer needed. Lastly, I uploaded the exported CSV file into the Google Cloud storage bucket of the project.

Getting into AutoML

AutoML Vision is currently in Beta, which means that you have to apply before trying it out. Since me and my colleagues are currently exploring the usage of automated machine learning in a computer vision project for one of our customers, I already have access to AutoML Vision through the GCP console.

The start screen looks pretty unspectacular at this point. You can start by clicking on “Get started with AutoML” or read the documentation, which is pretty basic so far but informative, especially when you’re not familiar with basic machine learning concepts such as train-test-splits, overfitting, prcision / recall etc.

After you started, Google AutoML takes you to the dataset dialog, which is the first step on the road to the final AutoML model. So far, nothing to report here. Later, you will find here all of your imported datasets.

Generating the dataset

After hitting “+ NEW DATASET” AutoML takes you to the “Create dataset” dialog. As mentioned before, new datasets can be added using two different methods, shown in the next image.

I’ve already uploaded the images from my computer as well as the CSV file containing the GS filepaths, partition information as well as the corresponding labels into the GS bucket. In order to add the dataset to AutoML Vision you must specify the filepath to the CSV file that contains the image GS-filepaths etc.

In the “Create dataset” dialog, you can also enable multi-label classification, if you have multiple labels per image, which is also a very helpful feature. After hitting “CREATE DATASET”, AutoML iterates over the provided file names and builds the dataset for modeling. What exactly is does, is neither visible nor documented. This import process may take a while, so it is showing you the funky “Knight Rider” progress bar.

After the import is finished, you will recieve an email from GCP, informing you that the import of the dataset is completed. I find this helpful because you don’t have to keep the browser window open and stare at the progress bar all the time.

The email looks a bit weird, but hey, it’s still beta…

Training a model

Back to AutoML. The first thing you see after building your dataset are the imported images. In this example, the images are a bit pixelated because they are only 28×28 in resolution. You can navigate through the different labels using the nav bar on the left side and also manually add labels so far unlabeled images. Furthermore, you can request a human labeling service if you do not have any labels that come with your images. Additionally, you can create new labels if you need to add a category etc.

Now let’s get serious. After going to the “TRAIN” dialog, AutoML informs you about the frequency distribution of your labels. It recommends a minimum count of $n=100$ labels per class (which I find quite low). Also, it seems that it shows you the frequencies of the whole dataset (train, validation and test together). A grouped frquency plot by data partition would be more informative at this point, in my opinion.

A click on “start training” takes you to a popup window where you can define the model name and the allocate a training budget (computing time / money) you are willing to invest. You have the choice between “1 compute hour”, whis is free for 10 models every month, or “24 compute hours (higher quality)” that comes with a price tag of approx. 480 $ (1 hour of AutoML computing costs 20 $. Hovever, if the architecture search converges at an earlier point, you will only pay the amount of computing time that has been consumed so far, which I find reasonable and fair. Lastly, there is also the option to define a custom training time, e.g. 5 hours.

In this experiment, I tried both, the “free” version of AutoML but I also went “all-in” and seleced the 24 hours option to achieve the best model possible (“paid model”). Let’s see, what you can expect from a 480 $ cutting edge AutoML solution. After hitting “START TRAINING” the familiar Knight Rider screen appears telling you, that you can close the browser window and let AutoML do the rest. Naise.

Results and evaluation

First, let’s start with the free model. It took approx. 30mins of training and seemed to have converged a solution very quickly. I am not sure, what exactly AutoML does when it evaluates convergence criteria but it seems to be different between the free and paid model, because the free model converged already around 30 minutes of computing and the paid model did not.

The overall model metrics of the free model look pretty decent. An average precision of 96.4% on the testset at a macro class 1 presision of 90.9% and a recall of 87.7%. The current accuracy benchmark on the Fashion-MNIST dataset is at 96.7% (WRN40-4 8.9M parameters) followed by 96.3% (WRN-28-10 + Random Erasing) while the accuracy of the low budget model is only at 89.0%. So the free AutoML model is pretty far away from the current Fashion-MNIST benchmark. Below, you’ll find the screenshot of the free model’s metrics.

The model metrics of the paid model look significantly better. It achieved an average precision of 98.5% on the testset at a macro class 1 presision of 95.0% and a recall of 92.8% as well as an accuracy score of 93.9%. Those results are close to the current benchmark, however, not so close as I hoped. Below, you’ll find the screenshot of the paid model’s metrics.

The “EVALUATE” tab also shows you further detailed metrics such as precision / recall curves as well as sliders for classification cutoffs that impact the model metrics respectively. At the bottom of the page you’ll find the confusion matrix with relative freuqencies of correct and misclassified examples. Furthermore, you can check images of false positives and negatives per class (which is very helpful, if you want to understand why and when your model is doing something wrong). Overall, the model evaluation functionalities are limited but user friendly. As a more profound user, of course, I would like to see more advanced features but considering the target group and the status of development I think it is pretty decent.

Prediction

After fitting and evaluating your model you can use several methods to predict new images. First, you can use the AutoML user interface to upload new images from your local machine. This is a great way for unexperienced users to apply their model to new images and get predictions. For advanced users and developers, AutoML vision exposes the model through an API on the GCP while taking care of all the technical infrastructure in the background. A simple Python script shows the basic usage of the API:

import sys
from google.cloud import automl_v1beta1


# Define client from service account json
client = automl_v1beta1.PredictionServiceClient.from_service_account_json(filename='automl-XXXXXX-d3d066fe6f8c.json')

# Endpoint
name = 'projects/automl-XXXXXX/locations/us-central1/models/ICNXXXXXX

# Import a single image
with open('image_10.jpg', 'rb') as ff:
    img = ff.read()

# Define payload
payload = {'image': {'image_bytes': img}}

# Prediction
request = client.predict(name=name, payload=payload, params={})
print(request)

# Console output
payload {
  classification {
    score: 0.9356002807617188
  }
  display_name: "a_0"
}

As a third method, it is also possible to curl the API in the command line, if you want to go full nerdcore. I think, the automated API exposure is a great feature because it lets you integrate your model in all kinds of scripts and applications. Furthermore, Google takes care of all the nitty-gritty things that come into play when you want to scale the model to hundrets or thousands of API requests simultaneously in a production environment.

Conclusion and outlook

In a nutshell, even the free model achieved pretty good results on the test set, given that the actual amount of time invested in the model was only a fraction of time it would have taken to build the model manually. The paid model achieved significantly better results, however at a cost note of 480 $. Obviously, the paid service is targeted at data science professionals and companies.

AutoML Vision is only a part of a set of new AutoML applications that come to the Google Cloud (check these announcements from Google Next 18), further shaping the positioning of the platform in the direction of machine learning and AI.

In my personal opinion, I am confident that automated machine learning solutions will continue to make their way into professional data science projects and applications. With automated machine learning, you can (1) build baseline models for benchmarking your custom solutions, (2) iterate use cases and data products faster and (3) get quicker to the point, when you actually start to make money with your data – in production.