Here at STATWORX, a Data Science and AI consulting company, we thrive on creating data-driven solutions that can be acted on quickly and translate into real business value. We provide many of our solutions in some form of web application to our customers, to allow and support them in their data-driven decision-making.

Containerization Allows Flexible Solutions

At the start of a project, we typically need to decide where and how to implement the solution we are going to create. There are several good reasons to deploy the designed solutions directly into our customer IT infrastructure instead of acquiring an external solution. Often our data science solutions use sensitive data. By deploying directly to the customers’ infrastructure, we make sure to avoid data-related compliance or security issues. Furthermore, it allows us to build pipelines that automatically extract new data points from the source and incorporate them into the solution so that it is always up to date. However, this also imposes some constraints on us. We need to work with the infrastructure provided by our customers. On the one hand, that requires us to develop solutions that can exist in all sorts of different environments. On the other hand, we need to adapt to changes in the customers’ infrastructure quickly and efficiently. All of this can be achieved by containerizing our solutions.

The Advantages of Containerization

Containerization has evolved as a lightweight alternative to virtualization. It involves packaging up software code and all its dependencies in a “container” so that the software can run on practically any infrastructure. Traditionally, an application was developed in a specific computing development environment and then transferred to the production environment, often resulting in many bugs and errors; Especially when these environments were not mirroring each other. For example, when an application is transferred from a local desktop computer to a virtual machine or from a Linux to a Windows operating system. A container platform like Docker allows us to store the whole application with all the necessary code, system tools, libraries, and settings in a container that can be shipped to and work uniformly in any environment. We can develop our applications dockerized and do not have to worry about the specific infrastructure environment provided by our customers.
Docker Today
There are some other advantages that come with using Docker in comparison to traditional virtual machines for the deployment of data science applications.
  • Efficiency – As the container shares the machines’ OS system kernel and does not require a Guest OS per application, it uses the provided infrastructure more efficiently, resulting in lower infrastructure costs.
  • Speed – The start of a container does not require a Guest OS reboot; it can be started, stopped, replicated, and destroyed in seconds. That speeds up the development process, the time to market, and the operational speed. Releasing new software or updates has never been so fast: Bugs can be fixed, and new features implemented in hours or days.
  • Scalability – Horizontal scaling allows to start and stop additional container depending on the current demand.
  • Security – Docker provides the strongest default isolation capabilities in the industry. Containers run isolated from each other, which means that if one crashes, other containers serving the same applications will still be running.

The Key Benefits of a Microservices Architecture

In connection with the use of Docker for delivering data science solutions, we use another emerging method. Instead of providing a monolithic application that comes with all the required functionalities of an application, we create small, independent services that communicate with each other and together embody the complete application. Usually, we develop WebApps for our customers. As shown in the graphic, the WebApp will communicate directly with the different other backend microservices. Each one is designed for a specific task and has an exposed REST API that allows for different HTTP requests. Furthermore, the backend microservices are indirectly exposed to the mobile app. An API Gateway routes the requests to the desired microservices. It can also provide an API endpoint that invokes several backend microservices and aggregates the results. Moreover, it can be used for access control, caching, and load balancing. If suitable, you might also decide to place an API Gateway between the WebApp and the backend microservices.
Microservices
In summary, splitting the application into small microservices has several advantages for us:
  • Agility – As services operate independently, we can update or fix bugs for a specific microservice without redeploying the entire application.
  • Technology freedom – Different microservices can be based on different technologies or languages, thus allowing us to use the best of all worlds.
  • Fault isolation – If an individual microservice becomes unavailable, it will not crash the entire application. Only the function provided by the specific microservice will not be provided.
  • Scalability – Services can be scaled independently. It is possible to scale the services which do the work without scaling the application.
  • Reusability of service – Often, the functionalities of the services we create are also requested by other departments and other cases. We then expose application user interfaces so that the services can also be used independently of the focal application.

Containerized Microservices – The Best of Both Worlds!

The combination of docker with a clean microservices architecture allows us to combine the mentioned advantages. Each microservice lives in its own Docker container. We deliver fast solutions that are consistent across environments, efficient in terms of resource consumption, and easily scalable and updatable. We are not bound to a specific infrastructure and can adjust to changes quickly and efficiently.
Containerized_Microservices

Conclusion

Often the deployment of a data science solution is one of the most challenging tasks within data science projects. But without a proper deployment, there won’t be any business value created. Hopefully, I was able to help you figure out how to optimize the implementation of your data science application. If you need further help bringing your data science solution into production, feel free to contact us!

Sources

In this blog post, I want to present two models of how to secure a REST API. Both models work with JSON Web Tokens (JWT). We suppose that token creation has already happened somewhere else, e.g., on the customer side. We at STATWORX are often interested in the verification part itself. The first model extends a running unsecured interface without changing any code in it. This model does not even rely on Web Tokens and can be applied in a broader context. The second model demonstrates how token verification can be implemented inside a Flask application.

About JWT

JSON Web Tokens are more or less JSON objects. They contain authority information like the issuer, the type of signing, information on their validity span, and custom information like the username, user groups and further personal display data. To be more precise: it consists of three parts: the header, the payload, and the signature. All three parts are JSON objects, are base64url encoded and put together in the following fashion, where the dots separate the three parts:
base64url(header).base64url(payload).base64url(signature)
All information is transparent and readable to whoever the token has (just decode the parts). Its security and validity stem from its digital signing, the signing guarantees that a Web Token is from its issuer. There are two different ways of digital signing: with a secret or with a public-private-key pair in various forms and algorithms. This choice also influences the verification possibilities: The symmetric algorithm with a password can only be verified by the owner of the password – the issuer – whereas, for asymmetric algorithms, the public part can be distributed and, therefore, used for verification. Besides this feature, JWT offers additional advantages and features. Here is a brief overview
  • Decoupling: The application or the API will not need to implement a secure way to exchange passwords and verify them against a backend.
  • Purpose oriented: Tokens can be issued for a particular purpose (defined within the payload) and are only valid for this purpose.
  • Less password exchange: Tokens can be used multiple times without password interactions.
  • Expiration: Tokens expire. Even in the event of theft and criminal intent, the amount of information that can be obtained is limited to its validity span.
More information about JWT and also a debugger can be found at https://jwt.io

Authorization Header

Moving over to consideration on how Authorization is exchanged within HTTP requests. To authenticate against a server, you can use the Request Header Authorization, of which various types exist. Down below are two common examples:
  • Basic
  Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
where the latter part is a base64 encoded string of user:password
  • Bearer
  Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.eyJpbmZvIjoiSSdtIGEgc2lnbmVkIHRva2VuIn0.rjnRMAKcaRamEHnENhg0_Fqv7Obo-30U4bcI_v-nfEM
where the latter part is a JWT. Note that this information is totally transparent on transfer unless HTTPS protocol is used.

Pattern 1 – Sidecar Verification

Let’s start with the first pattern that I’d like to introduce. In this section, I’ll introduce a setup with nginx and use its feature of sub-requests. This is especially useful in cases where the implementation of an API can not be changed. Nginx is a versatile tool. It acts as a web server, a load balancer, a content cache, and a reversed proxy. With the latest Web Server Survey of April 2020, nginx is the most used web server for public web sites. By using nginx and sub-requests, each incoming request has to go through nginx, and the header, including the Authorization part, is passed to a sub-module (calling it Auth Service). This sub-module then checks the validity of the Authorization before sending it through to the actual REST API or declining it. This decision process depends on the following status codes of the Auth Service:
  • 20x: nginx passes the request over to the actual resource service
  • 401 or 403: nginx denies the access and sends the response from the authentication service instead
As you might have noticed, this setup does not exclusively build on JWTs, and therefore other authorization types can be used.
Sub-request model.
Schema of request handling: 1) pass over to Auth Service for verification, 2) pass over to REST API if the verification was successful.

Nginx configuration

Two configuration amendments need to be taken to have nginx configured for token validation:
  • Add an (internal) directive to the authorization service which verifies the request
  • Add authentication parameters to the directive that needs to be secured
The authorization directive is configured like this
location /auth {
        internal;
        proxy_pass                          http://auth-service/verify;
        proxy_pass_request_body off;
        proxy_set_header                Content-Length "";
        proxy_set_header                X-Original-URI $request-uri;
}
This cuts off the body and sends the rest of the request to the authorization service. Next, the configuration of the directive which forwards to your REST API:
location /protected {
        auth_request    auth/;
        proxy_pass      http://rest-service;
}
Note that auth_request points to the authorization directive we have set up before. Please see the Nginx documentation for additional information on the sub-request pattern.

Basic Implementation of the Auth Service

In the sub-module, we use the jwcrypto library for the token operations. It offers all features and algorithms needed for the authentication task. Many other libraries can do similar things, which can be found at Jwt.io. Suppose the token was created by an asymmetric cryptographical algorithm like RSA, and you have access to the public key (here called: public.pem) Following our configuration on nginx, we will add the directive /verify that does the verification job:
from jwcrypto import jwt, jwk
from flask import Flask

app = Flask(__name__)

# Load the public key
with open('public.pem', 'rb') as pemfile:
        public_key = jwk.JWK.from_pem(pemfile.read())

def parse_token(auth_header):
      # your implementation of Authorization Header extraction
    pass

@app.route("/verify")
def verify():
      token_raw = parse_token(request.headers["Authorization"])
      try:
            # decode token
        token = jwt.JWT(jwt=token_raw, key=public_key)
        return 200, "this is a secret information"
    except:
          return 403, ""
The script consists of three parts: Reading the public key with the start of the API, extracting the header information (not given here), and the actual verification that is embedded in a try-catch expression.

Pattern 2 – Verify within the API

In this section, we will implement the verification within our Flask API. There are packages in the Flask universe available like flask_jwt; however, they do not offer the full scope of features and, in particular, not for our case here. Instead, we again use the library jwcrypto like before. It is further assumed that you have access to the public key again, but this time, a specific directive is secured (here called: /secured ) – and also, access should only be granted to admin users.
from jwcrypto import jwt, jwk
from flask import Flask

app = Flask(__name__)

# Load the public key
with open('public.pem', 'rb') as pemfile:
        public_key = jwk.JWK.from_pem(pemfile.read())


@app.route("/secured")
def secured():
      token_raw = parse_token(request.headers["Authorization"])
      try:
            # decode token
        token = jwt.JWT(jwt=token_raw, key=public_key)
        if "admin" in token.claims["groups"]:
              return 200, "this is a secret information"
        else:
              return 403, ""
    except:
          return 403, ""
The setup is the same as in the previous example, but an additional check was added: whether the user is part of the admin group.

Summary

Two patterns were discussed, which both use Web Tokens to authenticate the request. While the first pattern’s charm lies in the decoupling of Authorization and API functionality, the second approach is more compact. It perfectly fits situations where the number of APIs is low, or the overhead caused by a separate service is too high. Instead, the sidecar pattern is perfect when you provide several APIs and like to unify the Authorization in one separate service. Web Tokens are used in popular authorization schemes like OAuth and OpenID Connect. In another blog post, I will focus on how they work and how they connect to the verification I presented. I hope you have enjoyed the post. Happy coding! In my previous blog post, I have shown you how to run your R-scripts inside a docker container. For many of the projects we work on here at STATWORX, we end up using the RShiny framework to build our R-scripts into interactive applications. Using containerization for the deployment of ShinyApps has a multitude of advantages. There are the usual suspects such as easy cloud deployment, scalability, and easy scheduling, but it also addresses one of RShiny’s essential drawbacks: Shiny creates only a single R session per app, meaning that if multiple users access the same app, they all work with the same R session, leading to a multitude of problems. With the help of Docker, we can address this issue and start a container instance for every user, circumventing this problem by giving every user access to their own instance of the app and their individual corresponding R session. If you’re not familiar with building R-scripts into a docker image or with Docker terminology, I would recommend you to first read my previous blog post. So let’s move on from simple R-scripts and run entire ShinyApps in Docker now!

The Setup

Setting up a project

It is highly advisable to use RStudio’s project setup when working with ShinyApps, especially when using Docker. Not only do projects make it easy to keep your RStudio neat and tidy, but they also allow us to use the renv package to set up a package library for our specific project. This will come in especially handy when installing the needed packages for our app to the Docker image. For demonstration purposes, I decided to use an example app created in a previous blog post, which you can clone from the STATWORX GitHub repository. It is located in the “example-app” subfolder and consists of the three typical scripts used by ShinyApps (global.R, ui.R, and server.R) as well as files belonging to the renv package library. If you choose to use the example app linked above, then you won’t have to set up your own RStudio Project, you can instead open “example-app.Rproj”, which opens the project context I have already set up. If you choose to work along with an app of your own and haven’t created a project for it yet, you can instead set up your own by following the instructions provided by RStudio.

Setting up a package library

The RStudio project I provided already comes with a package library stored in the renv.lock file. If you prefer to work with your own app, you can create your own renv.lock file by installing the renv package from within your RStudio project and executing renv::init(). This initializes renv for your project and creates a renv.lock file in your project root folder. You can find more information on renv over at RStudio’s introduction article on it.

The Dockerfile

The Dockerfile is once again the central piece of creating a Docker image. We now aim to repeat this process for an entire app where we previously only built a single script into an image. The step from a single script to a folder with multiple scripts is small, but there are some significant changes needed to make our app run smoothly.
# Base image https://hub.docker.com/u/rocker/
FROM rocker/shiny:latest

# system libraries of general use
## install debian packages
RUN apt-get update -qq && apt-get -y --no-install-recommends install 
    libxml2-dev 
    libcairo2-dev 
    libsqlite3-dev 
    libmariadbd-dev 
    libpq-dev 
    libssh2-1-dev 
    unixodbc-dev 
    libcurl4-openssl-dev 
    libssl-dev

## update system libraries
RUN apt-get update && 
    apt-get upgrade -y && 
    apt-get clean

# copy necessary files
## app folder
COPY /example-app ./app
## renv.lock file
COPY /example-app/renv.lock ./renv.lock

# install renv & restore packages
RUN Rscript -e 'install.packages("renv")'
RUN Rscript -e 'renv::consent(provided = TRUE)'
RUN Rscript -e 'renv::restore()'

# expose port
EXPOSE 3838

# run app on container start
CMD ["R", "-e", "shiny::runApp('/app', host = '0.0.0.0', port = 3838)"]

The base image

The first difference is in the base image. Because we’re dockerizing a ShinyApp here, we can save ourselves a lot of work by using the rocker/shiny base image. This image handles the necessary dependencies for running a ShinyApp and comes with multiple R packages already pre-installed.

Necessary files

It is necessary to copy all relevant scripts and files for your app to your Docker image, so the Dockerfile does precisely that by copying the entire folder containing the app to the image. We can also make use of renv to handle package installation for us. This is why we first copy the renv.lock file to the image separately. We also need to install the renv package separately by using the Dockerfile’s ability to execute R-code by prefacing it with RUN Rscript -e. This package installation allows us to then call renv directly and restore our package library inside the image with renv::restore(). Now our entire project package library will be installed in our Docker image, with the exact same version and source of all the packages as in your local development environment. All this with just a few lines of code in our Dockerfile.

Starting the App at Runtime

At the very end of our Dockerfile, we tell the container to execute the following R-command:
shiny::runApp('/app', host = '0.0.0.0', port = 3838)
The first argument allows us to specify the file path to our scripts, which in our case is ./app. For the exposed port, I have chosen 3838, as this is the default choice for RStudio Server, but can be freely changed to whatever suits you best. With the final command in place every container based on this image will start the app in question automatically at runtime (and of course close it again once it’s been terminated).

The Finishing Touches

With the Dockerfile set up we’re now almost finished. All that remains is building the image and starting a container of said image.

Building the image

We open the terminal, navigate to the folder containing our new Dockerfile, and start the building process:
docker build -t my-shinyapp-image . 

Starting a container

After the building process has finished, we can now test our newly built image by starting a container:
docker run -d --rm -p 3838:3838 my-shinyapp-image
And there it is, running on localhost:3838.
docker-shiny-app-example

Outlook

Now that you have your ShinyApp running inside a Docker container, it is ready for deployment! Having containerized our app already makes this process a lot easier; there are further tools we can employ to ensure state of the art security, scalability, and seamless deployment. Stay tuned until next time, when we’ll go deeper into the full range of RShiny and Docker capabilities by introducing ShinyProxy. Getting the data in the quantity, quality and format you need is often the most challenging part of data science projects. But it’s also one, if not the most important part. That’s why my colleagues and I at STATWORX tend to spend a lot of time setting up good ETL processes. Thanks to frameworks like Airflow this isn’t just a Data Engineer prerogative anymore. If you know a bit of SQL and Python, you can orchestrate your own ETL process like a pro. Read on to find out how!

ETL does not stand for Extraterrestrial Life

At least not in Data Science and Engineering. ETL stands for Extract, Transform, Load and describes a set of database operations. Extracting means we read the data from one or more data sources. Transforming means we clean, aggregate or combine the data to get it into the shape we want. Finally, we load it to a destination database. Does your ETL process consist of Karen sending you an Excel sheet that you can spend your workday just scrolling down? Or do you have to manually query a database every day, tweaking your SQL queries to the occasion? If yes, venture a peek at Airflow.

How Airflow can help with ETL processes

Airflow is a python based framework that allows you to programmatically create, schedule and monitor workflows. These workflows consist of tasks and dependencies that can be automated to run on a schedule. If anything fails, there are logs and error handling facilities to help you fix it. Using Airflow can make your workflows more manageable, transparent and efficient. And yes, I’m talking to you fellow Data Scientists! Getting access to up-to-date, high-quality data is far too important to leave it only to the Data Engineers 😉 (we still love you). The point is, if you’re working with data, you’ll profit from knowing how to wield this powerful tool.

How Airflow works

To learn more about Airflow, check out this blog post from my colleague Marvin. It will get you up to speed quickly. Marvin explains in more detail how Airflow works and what advantages/disadvantages it has as a workflow manager. Also, he has an excellent quick-start guide with Docker. What matters to us is knowing that Airflow’s based on DAGs or Directed Acyclic Graphs that describe what tasks our workflow consists of and how these are connected.
Note that in this tutorial we’re not actually going to deploy the pipeline. Otherwise, this post would be even longer. And you probably have friends and family that like to see you. So today is all about creating a DAG step by step. If you care more about the deployment side of things, stay tuned though! I plan to do a step by step guide of how to do that in the next post. As a small solace, know that you can test every task of your workflow with:
airflow test [your dag id] [your task id] [execution date]
There are more options, but that’s all we need for now.

What you need to follow this tutorial

This tutorial shows you how you can use Airflow in combination with BigQuery and Google Cloud Storage to run a daily ETL process. So what you need is: If you already have a Google Cloud account, you can hit the ground running! If not, consider opening one. You do need to provide a credit card. But don’t worry, this tutorial won’t cost you. If you sign up new, you get a free yearly trial period. But even if that one’s expired, we’re staying well within the bounds of Google’s Always Free Tier. Finally, it helps if you know some SQL. I know it’s something most Data Scientists don’t find too sexy, but the more you use it the more you like it. I guess it’s like the orthopedic shoes of Data Science. A bit ugly sure, but unbeatable in what it’s designed for. If you’re not familiar with SQL or dread it like the plague, don’t sweat it. Each query’s name says what it does.

What BigQuery and Google Cloud Storage are

BigQuery and Cloud Storage are some of the most popular products of the Google Cloud Platform (GCP). BigQuery is a serverless cloud data warehouse that allows you to analyze up to petabytes of data at high speeds. Cloud Storage, on the other hand, is just that: a cloud-based object storage. Grossly simplified, we use BigQuery as a database to query and Cloud Storage as a place to save the results. In more detail, our ETL process:
  • checks for the existence of data in BigQuery and Google Cloud Storage
  • queries a BigQuery source table and writes the result to a table
  • ingests the Cloud Storage data into another BigQuery table
  • merges the two tables and writes the result back to Cloud Storage as a CSV

Connecting Airflow to these services

If you set up a Google Cloud account you should have a JSON authentication file. I suggest putting this in your home directory. We use this file to connect Airflow to BigQuery and Cloud Storage. To do this, just copy and paste these lines in your terminal, substituting your project ID and JSON path. You can read more about connections here.
# for bigquery
airflow connections -d --conn_id bigquery_default

airflow connections -a --conn_id bigquery_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'


# for google cloud storage
airflow connections -d --conn_id google_cloud_default

airflow connections -a --conn_id google_cloud_default --conn_uri 'google-cloud-platform://:@:?extra__google_cloud_platform__project=[YOUR PROJECT ID]&extra__google_cloud_platform__key_path=[PATH TO YOUR JSON]'

Writing our DAG

Task 0: Setting the start date and the schedule interval

We are ready to define our DAG! This DAG consists of multiple tasks or things our ETL process should do. Each task is instantiated by a so-called operator. Since we’re working with BigQuery and Cloud Storage, we take the appropriate Google Cloud Platform (GCP) operators. Before defining any tasks, we specify the start date and schedule interval. The start date is the date when the DAG first runs. In this case, I picked February 20th, 2020. The schedule interval is how often the DAG runs, i.e. on what schedule. You can use cron notation here, a timedelta object or one of airflow’s cron presets (e.g. ‘@daily’). Tip: you normally want to keep the start date static to avoid unpredictable behavior (so no datetime.now() shenanigans even though it may seem tempting).
# set start date and schedule interval
start_date = datetime(2020, 2, 20)
schedule_interval = timedelta(days=1)
You find the complete DAG file on our STATWORX Github. There you also see all the config parameters I set, e.g., what our project, dataset, buckets, and tables are called, as well as all the queries. We gloss over it here, as it’s not central to understanding the DAG.

Task 1: Check that there is data in BigQuery

We set up our DAG taking advantage of python’s context manager. You don’t have to do this, but it saves some typing. A little detail: I’m setting catchup=False because I don’t want Airflow to do a backfill on my data.
# write dag
with DAG(dag_id='blog', default_args=default_args, schedule_interval=schedule_interval, catchup=False) as dag:

    t1 = BigQueryCheckOperator(task_id='check_bq_data_exists',
                               sql=queries.check_bq_data_exists,
                               use_legacy_sql=False)
We start by checking if the data for the date we’re interested in is available in the source table. Our source table, in this case, is the Google public dataset bigquery-public-data.austin_waste.waste_and_diversion. To perform this check we use the aptly named BigQueryCheckOperator and pass it an SQL query. If the check_bq_data_exists query returns even one non-null row of data, we consider it successful and the next task can run. Notice that we’re making use of macros and Jinja templating to dynamically insert dates into our queries that are rendered at runtime.
check_bq_data_exists = """
SELECT load_id
FROM `bigquery-public-data.austin_waste.waste_and_diversion`
WHERE report_date BETWEEN DATE('{{ macros.ds_add(ds, -365) }}') AND DATE('{{ ds }}')
"""

Task 2: Check that there is data in Cloud Storage

Next, let’s check that the CSV file is in Cloud Storage. If you’re following along, just download the file from the STATWORX Github and upload it to your bucket. We pretend that this CSV gets uploaded to Cloud Storage by another process and contains data that we need (in reality I just extracted it from the same source table). So we don’t care how the CSV got into the bucket, we just want to know: is it there? This can easily be verified in Airflow with the GoogleCloudStorageObjectSensor which checks for the existence of a file in Cloud Storage. Notice the indent because it’s still part of our DAG context. Defining the task itself is simple: just tell Airflow which object to look for in your bucket.
    t2 = GoogleCloudStorageObjectSensor(task_id='check_gcs_file_exists',
                                        bucket=cfg.BUCKET,
                                        object=cfg.SOURCE_OBJECT)

Task 3: Extract data and save it to BigQuery

If the first two tasks succeed, then all the data we need is available! Now let’s extract some data from the source table and save it to a new table of our own. For this purpose, there’s none better than the BigQueryOperator.
    t3 = BigQueryOperator(task_id='write_weight_data_to_bq',
                          sql=queries.write_weight_data_to_bq,
                          destination_dataset_table=cfg.BQ_TABLE_WEIGHT,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE',
                          use_legacy_sql=False)
This operator sends a query called write_weight_data_to_bq to BigQuery and saves the result in a table specified by the config parameter cfg.BQ_TABLE_WEIGHT. We can also set a create and write disposition if we so choose. The query itself pulls the total weight of dead animals collected every day by Austin waste management services for a year. If the thought of possum pancakes makes you queasy, just substitute ‘RECYCLING – PAPER’ for the TYPE variable in the config file.

Task 4: Ingest Cloud Storage data into BigQuery

Once we’re done extracting the data above, we need to get the data that’s currently in our Cloud Storage bucket into BigQuery as well. To do this, just tell Airflow what (source) object from your Cloud Storage bucket should go to which (destination) table in your BigQuery dataset. Tip: You can also specify a schema at this step, but I didn’t bother since the autodetect option worked well.
    t4 = GoogleCloudStorageToBigQueryOperator(task_id='write_route_data_to_bq',
                                              bucket=cfg.BUCKET,
                                              source_objects=[cfg.SOURCE_OBJECT],
                                              field_delimiter=';',
                                              destination_project_dataset_table=cfg.BQ_TABLE_ROUTE,
                                              create_disposition='CREATE_IF_NEEDED',
                                              write_disposition='WRITE_TRUNCATE',
                                              skip_leading_rows=1)

Task 5: Merge BigQuery and Cloud Storage data

Now we have both the BigQuery source table extract and the CSV data from Cloud Storage in two separate BigQuery tables. Time to merge them! How do we do this? The BigQueryOperator is our friend here. We just pass it a SQL query that specifies how we want the tables merged. By specifying the destination_dataset argument, it’ll put the result into a table that we choose.
    t5 = BigQueryOperator(task_id='prepare_and_merge_data',
                          sql=queries.prepare_and_merge_data,
                          use_legacy_sql=False,
                          destination_dataset_table=cfg.BQ_TABLE_MERGE,
                          create_disposition='CREATE_IF_NEEDED',
                          write_disposition='WRITE_TRUNCATE')
Click below if you want to see the query. I know it looks long and excruciating, but trust me, there are Tinder dates worse than this (‘So, uh, do you like SQL?’ – ‘Which one?’). If he/she follows it up with Lord of the Rings or ‘Yes’, propose!
What is it that this query is doing? Let’s recap: We have two tables at the moment. One is basically a time series on how much dead animal waste Austin public services collected over the course of a year. The second table contains information on what type of routes were driven on those days. As if this pipeline wasn’t weird enough, we now also want to know what the most common route type was on a given day. So we start by counting what route types were recorded on a given day. Next, we use a window function (... OVER (PARTITION BY ... ORDER BY ...)) to find the route type with the highest count for each day. In the end, we pull it out and using the date as a key, merge it to the table with the waste info.
prepare_and_merge_data = """
WITH
simple_route_counts AS (
SELECT report_date,
       route_type,
       count(route_type) AS count
FROM `my-first-project-238015.waste.route` 
GROUP BY report_date, route_type
),
max_route_counts AS (
SELECT report_date,
       FIRST_VALUE(route_type) OVER (PARTITION BY report_date ORDER BY count DESC) AS top_route,
       ROW_NUMBER() OVER (PARTITION BY report_date ORDER BY count desc) AS row_number
FROM simple_route_counts
),
top_routes AS (
SELECT report_date AS date,
       top_route,
FROM max_route_counts
WHERE row_number = 1
)
SELECT a.date,
       a.type,
       a.weight,
       b.top_route
FROM `my-first-project-238015.waste.weight` a
LEFT JOIN top_routes b
ON a.date = b.date
ORDER BY a.date DESC
"""

Task 6: Export result to Google Cloud Storage

Let’s finish off this process by exporting our result back to Cloud Storage. By now, you’re probably guessing what the right operator is called. If you guessed BigQueryToCloudStorageOperator, you’re spot on. How to use it though? Just specify what the source table and the path (uri) to the Cloud Storage bucket are called.
    t6 = BigQueryToCloudStorageOperator(task_id='export_results_to_gcs',
                                        source_project_dataset_table=cfg.BQ_TABLE_MERGE,
                                        destination_cloud_storage_uris=cfg.DESTINATION_URI,
                                        export_format='CSV')
The only thing left to do now is to determine how the tasks relate to each other, i.e. set the dependencies. We can do this using the >> notation which I find more readable than set_upstream() or set_downstream(). But take your pick.
    t1 >> t2 >> [t3, t4] >> t5 >> t6
The notation above says: if data is available in the BigQuery source table, check next if data is also available in Cloud Storage. If so, go ahead, extract the data from the source table and save it to a new BigQuery table. In addition, transfer the CSV file data from Cloud Storage into a separate BigQuery table. Once those two tasks are done, merge the two newly created tables. At last, export the merged table to Cloud Storage as a CSV.

Conclusion

That’s it! Thank you very much for sticking with me to the end! Let’s wrap up what we did: we wrote a DAG file to define an automated ETL process that extracts, transforms and loads data with the help of two Google Cloud Platform services: BigQuery and Cloud Storage. Everything we needed was some Python and SQL code. What’s next? There’s much more to explore, from using the Web UI, monitoring your workflows, dynamically creating tasks, orchestrating machine learning models, and, and, and. We barely scratched the surface here. So check out Airflow’s official website to learn more. For now, I hope you got a better sense of the possibilities you have with Airflow and how you can harness its power to manage and automate your workflows.

References

REST APIs have become a quasi-standard, be it to provide an interface to your application processes, be by setting up a flexible microservice architecture. Sooner or later, you might ask yourself what a proper testing schema would look like and which tools can support you. Some time ago, we at STATWORX asked this ourselves. A toolset that helps us with this task is Newman and Postman, which I will present to you in this blog post. Many of you, who are regularly using and developing REST, might already be quite familiar with Postman. It’s a handy and comfortable desktop tool that comes with some excellent features (see below). Newman, instead, is a command-line agent that runs the previously defined requests. Because of its lean interface, it can be used in several situations, for instance, it can be easily integrated into testing stages of Pipelines.
In the following, I will explain how these siblings can be used to implement a neat testing environment. We start with Postman’s feature sets, then move on to the ability to interact with Newman. We will further have a look at a testing schema, touching some test cases, and lastly, integrate it into a Jenkins pipeline.

About Postman

Postman is a convenient desktop tool handling REST request. Furthermore, Postman gives you the possibility to define test cases (in JavaScript), has a feature to switch environments, and provides you with Pre-Request steps to set up the setting before your calls. In the following, I will give you examples of some interesting features.

Collection and Requests

Requests are the basic unit in Postman, and everything else spins around them. As I said previously, Postman’s GUI provides you with a comfortable way to define these: request method can be picked from a drop-down list, header information is presented clearly, there is a helper for authorization, and many more.
You should have at least one collection per REST interface defined to bundle your requests. At the very end of the definition process, collections can be exported into JSON format. This result will, later on, be exploited for Newman.

Environments

Postman also implements the concept of environment variables. This means: Depending on where your requests are fired from, the variables adapt. The API’s hostname is a good example that should be kept variable: In the development stage, it may be just your localhost but could be different in a dockerized environment.
The syntax of environment variables is double-curly brackets. If you want to use the hostname variable hostname put it like this: {{ hostname }}
Like for collections, environments can be exported into JSON files. We should keep this in mind when we move to Newman.

Tests

Each API request in Postman should come along with at least one test. I propose the following list as an orientation on what to test:
  • the status code: Check the status code according to your expectation: regular GET requests are supposed to return 200 OK, POST requests 201 Created if successful. On the other hand, authorization should be tested as well as invalid client requests which are supposed to return 40x. – See below a POST request test:
pm.test("Successful POST request", function () {
     pm.expect(pm.response.code).to.be.oneOf([201,202]);
 });
  • whether data is returned Test if the response has any data as a first approximation
  • the schema of returned data Test if the structure of the request data fits the expectations: non-nullable fields, data types, names of properties. Find below an example of a schema validation:
pm.test("Response has correct schema", function () {
    var schema = {"type":"object",
                  "properties":{
                      "access_token":{"type":"string"},
                      "created_on":{"type":"string"},
                      "expires_seconds":{"type":"number"}
                  }};
    var jsonData = pm.response.json();
    pm.expect(tv4.validate(jsonData,schema)).to.be.true;
});
  • values of returned data: Check if the values of the response data are sound; for non-negative values:
pm.test("Expires non negative", function() {
    pm.expect(pm.response.json().expires_seconds).to.be.above(0);
})
  • Header values Check the header of the response if useful relevant is stored there.
All tests have to be written in JavaScript. Postman ships with its own library and tv4 for schema validation. Below you find a complete running test:

Introduction to Newman

As mentioned before, Newman acts as an executor of what was defined in Postman. To generate results, Newman uses reporters. Reporters can be the command line interface itself, but also known standards as JUnit can be found. The simplest way to install newman is via NPM (Node package manager). There are ready to use docker images of NodeJS on DockerHub. Install the package via npm install -g newman. There are two ways to call Newman: command-line interface and within JS code. We will only focus on the first.

Calling the CLI

To run a predefined test collections use the command newman run. Please see the example below:
newman run
            --reporters cli,junit
            --reporter-junit-export /test/out/report.xml
            -e /test/env/auth_jwt-docker.pmenv.json
            /test/src/auth_jwt-test.pmc.json
Let us take a closer look: Recall that we have previously exported the collection and the environment from Postman. The environment can be attached with the -e option. Moreover, two reporters were specified: the cli itself which prints into the terminal and junit which additional shall export a report to the file report.xml The CLI reporter prints the following (Note that the first three test cases are those from the test schema proposal):
→ jwt-new-token
  POST http://tp_auth_jwt:5000/new-token/bot123 [201 CREATED, 523B, 42ms]
  ✓  Successful POST request
  ✓  Response has correct schema
  ✓  Expires non negative

→ jwt-auth
  POST http://tp_auth_jwt:5000/new-token/test [201 CREATED, 521B, 11ms]
  GET http://tp_auth_jwt:5000/auth [200 OK, 176B, 9ms]
  ✓  Status code is 200
  ✓  Login name is correct

→ jwt-auth-no-token
  GET http://tp_auth_jwt:5000/auth [401 UNAUTHORIZED, 201B, 9ms]
  ✓  Status is 401 or 403

→ jwt-auth-bad-token
  GET http://tp_auth_jwt:5000/auth [403 FORBIDDEN, 166B, 6ms]
  ✓  Status is 401 or 403

Integration into Jenkins

Newman functionality can now be integrated into (almost?) any Pipeline tool. For Jenkins, we create a docker image based on NodeJS and with Newman installed. Next, we either pack or mount both the environment and the collection file into the docker container. When running the container, we use Newman as a command-line tool, just as we did before. To use this in a test stage of a Pipeline, we have to make sure that the REST API is actually running when Newman is executed. In the following example, the functionalities were defined as targets of a Makefile:
  • run to run the REST API with all dependencies
  • test to run Newman container which itself runs the testing collections
  • rm to stop and remove the REST API
After the API has been tested the report from JUnit is digested by Jenkins with the command junit <report> See below a Pipeline snippet of a test run:
node{
       stage('Test'){
            try{
                sh "cd docker && make run"
                sh "sleep 5"
                sh "cd docker && make test"
                junit "source/test/out/report.xml"

            } catch (Exception e){
                    echo e
            } finally {
                    sh "cd docker && make rm"
            }
        }
}

Summary

Now it’s time to code tests for your REST API. Please also try to integrate it into your build-test cycle and into your automation pipeline because automation and defined processes are crucial to delivering reliable code and packages. I hope with this blog post, you now have a better understanding of how Postman and Newman can be used to implement a test framework for REST APIs. Postman was used as a definition tool, whereas Newman was the runner of these definitions. Because of his nature, we have also seen that Newman is the tool for your build pipeline. Happy coding!

We’re hiring!

Data Engineering is your jam and you’re looking for a job? We’re currently looking for Junior Consultants and Consultants in Data Engineering. Check the requirements and benefits of working with us on our career site. We’re looking forward to your application! Data operations is an increasingly important part of data science because it enables companies to feed large business data back into production effectively. We at STATWORX, therefore, operationalize our models and algorithms by translating them into Application Programming Interfaces (APIs). Representational State Transfer (REST) APIs are well suited to be implemented as part of a modern micro-services infrastructure. They are flexible, easy to deploy, scale, and maintain, and they are further accessible by multiple clients and client types at the same time. Their primary purpose is to simplify programming by abstracting the underlying implementation and only exposing objects and actions that are needed for any further development and interaction. An additional advantage of APIs is that they allow for an easy combination of code written in different programming languages or by different development teams. This is because APIs are naturally separated from each other, and communication with and between APIs is handled by IP or URL (http), typically using JSON or XML format. Imagine, e.g., an infrastructure, where an API that’s written in Python and one that’s written in R communicate with each other and serve an application written in JavaScript. In this blog post, I will show you how to translate a simple R script, which transforms tables from wide to long format, into a REST API with the R package Plumber and how to run it locally or with Docker. I have created this example API for our trainee program, and it serves our new data scientists and engineers as a starting point to familiarize themselves with the subject.

Translate the R Script

Transforming an R script into a REST API is quite easy. All you need, in addition to R and RStudio, is the package Plumber and optionally Docker. REST APIs can be interacted with by sending a REST Request, and the probably most commonly used ones are GET, PUT, POST, and DELETE. Here is the code of the example API, that transforms tables from wide to long or from long to wide format:
## transform wide to long and long to wide format
#' @post /widelong
#' @get /widelong
function(req) {
  # library
  require(tidyr)
  require(dplyr)
  require(magrittr)
  require(httr)
  require(jsonlite)

  # post body
  body <- jsonlite::fromJSON(req$postBody)

  .data <- body$.data
  .trans <- body$.trans
  .key <- body$.key
  .value <- body$.value
  .select <- body$.select

  # wide or long transformation
  if(.trans == 'l' || .trans == 'long') {
    .data %<>% gather(key = !!.key, value = !!.value, !!.select)
    return(.data)
  } else if(.trans == 'w' || .trans == 'wide') {
    .data %<>% spread(key = !!.key, value = !!.value)
    return(.data)
  } else {
    print('Please specify the transformation')
  }
}
As you can see, it is a standard R function, that is extended by the special plumber comments @post and @get, which enable the API to respond to those types of requests. It is necessary to add the path, /widelong, to any incoming request. That is done because it is possible to stack several API functions, which respond to different paths. We could, e.g., add another function with the path /naremove to our API, which removes NAs from tables. The R function itself has one function argument req, which is used to receive a (POST) Request Body. In general, there are two different possibilities to send additional arguments and objects to a REST API, the header and the body. I decided to use a body only and no header at all, which makes the API cleaner, safer and allows us to send larger objects. A header could, e.g., be used to set some optional function arguments, but should be used sparsely otherwise. Using a body with the API is also the reason to allow for GET and POST Requests (@post, @get) at the same time. While some clients prefer to send a body with a GET Request, when they do not permanently post something to the server etc., many other clients do not have the option to send a body with a GET Request at all. In this case, it is mandatory to add a POST Request. Typical clients are Applications, Integrated Development Environments (IDEs), and other APIs. By accepting both request types, our API, therefore, gains greater response flexibility. For the request-response format of the API, I have decided to stick with the JavaScript Object Notation (JSON), which is probably the most common format. It would be possible to use Extensible Markup Language (XML) with R Plumber instead as well. The decision for one or the other will most likely depend on which additional R packages you want to use or on which format the API’s clients are predominantly using. The R packages that are used to handle REST Requests in my example API are jsonlite and httr. The three Tidyverse packages are used to do the table transformation to wide or long.

RUN the API

The finished REST API can be run locally with R or RStudio as follows:
library(plumber)

widelong_api <- plumber::plumb("./path/to/directory/widelongwide.R")
widelong_api$run(host = '127.0.0.1', port = 8000)
Upon starting the API, the Plumber package provides us with an IP address, and a port and a client, e.g., another R instance, can now begin to send REST Requests. It also opens a browser tool called Swagger, which can be useful to check if your API is working as intended. Once the development of an API is finished, I would suggest to build a docker image and run it in a container. That makes the API highly portable and independent of its host system. Since we want to use most APIs in production and deploy them to, e.g., a company server or the cloud, this is especially important. Here is the Dockerfile to build the docker image of the example API:
FROM trestletech/plumber

# Install dependencies
RUN apt-get update --allow-releaseinfo-change && apt-get install -y 
    liblapack-dev 
    libpq-dev

# Install R packages
RUN R -e "install.packages(c('tidyr', 'dplyr', 'magrittr', 'httr', 'jsonlite'), 
repos = 'http://cran.us.r-project.org')"

# Add API
COPY ./path/to/directory/widelongwide.R /widelongwide.R

# Make port available
EXPOSE 8000

# Entrypoint
ENTRYPOINT ["R", "-e", 
"widelong <- plumber::plumb('widelongwide.R'); 
widelong$run(host = '0.0.0.0', port= 8000)"]

CMD ["/widelongwide.R"]

Send a REST Request

The wide-long example API can generally respond to any client sending a POST or GET Request with a Body in JSON format, that contains a table in csv format and all needed information on how to transform it. Here is an example for a web application, which I have written for our trainee program to supplement the wide-long API:
The application is written in R Shiny, which is a great R package to transform your static plots and outputs into an interactive dashboard. If you are interested in how to create dashboards in R, check out other posts on our STATWORX Blog. Last but not least here is an example on how to send a REST Request from R or RStudio:
library(httr)
library(jsonlite)
options(stringsAsFactors = FALSE)

# url for local testing
url <- "http://127.0.0.1:8000"

# url for docker container
url <- "http://0.0.0.0:8000"

# read example stock data
.data <- read.csv('./path/to/data/stocks.csv')

# create example body
body <- list(
  .data = .data,
  .trans = "w",
  .key = "stock",
  .value = "price",
  .select = c("X","Y","Z")
)

# set API path
path <- 'widelong'

# send POST Request to API
raw.result <- POST(url = url, path = path, body = body, encode = 'json')

# check status code
raw.result$status_code

# retrieve transformed example stock data
.t_data <- fromJSON(rawToChar(raw.result$content))
As you can see, it is quite easy to make REST Requests in R. If you need some test data, you could use the stocks data example from the Tidyverse.

Summary

In this blog post, I showed you how to translate a simple R script, which transforms tables from wide to long format, into a REST API with the R package Plumber and how to run it locally or with Docker. I hope you enjoyed the read and learned something about operationalizing R scripts into REST APIs with the R package Plumber and how to run them locally and with Docker. You are of welcome to copy and use any code from this blog post to start and create your REST APIs with R. Until then, stay tuned and visit our STATWORX Blog again soon.

We’re hiring!

Data Engineering is your jam and you’re looking for a job? We’re currently looking for Junior Consultants and Consultants in Data Engineering. Check the requirements and benefits of working with us on our career site. We’re looking forward to your application! Livy is a REST web service for submitting Spark Jobs or accessing – and thus sharing – long-running Spark Sessions from a remote place. Instead of tedious configuration and installation of your Spark client, Livy takes over the work and provides you with a simple and convenient interface. We at STATWORX use Livy to submit Spark Jobs from Apache’s workflow tool Airflow on volatile Amazon EMR cluster. Besides, several colleagues with different scripting language skills share a running Spark cluster. Another great aspect of Livy, namely, is that you can choose from a range of scripting languages: Java, Scala, Python, R. As it is the case for Spark, which one of them you actually should/can use, depends on your use case (and on your skills).
livy-architecture
Architecture – https://livy.incubator.apache.org/
Apache Livy is still in the Incubator state, and code can be found at the Git project.

When you should use it

Since REST APIs are easy to integrate into your application, you should use it when:
  • multiple clients want to share a Spark Session.
  • the clients are lean and should not be overloaded with installation and configuration.
  • you need a quick setup to access your Spark cluster.
  • you want to Integrate Spark into an app on your mobile device.
  • you have volatile clusters, and you do not want to adapt configuration every time.
  • a remote workflow tool submits spark jobs.

Preconditions

Livy is generally user-friendly, and you do not really need too much preparation. All you basically need is an HTTP client to communicate to Livy’s REST API. REST APIs are known to be easy to access (states and lists are accessible even by browsers), HTTP(s) is a familiar protocol (status codes to handle exceptions, actions like GET and POST, etc.) while providing all security measures needed. Since Livy is an agent for your Spark requests and carries your code (either as script-snippets or packages for submission) to the cluster, you actually have to write code (or have someone writing the code for you or have a package ready for submission at hand). I opted to maily use python as Spark script language in this blog post and to also interact with the Livy interface itself. Some examples were executed via curl, too.

How to use Livy

There are two modes to interact with the Livy interface:
  • Interactive Sessions have a running session where you can send statements over. Provided that resources are available, these will be executed, and output can be obtained. It can be used to experiment with data or to have quick calculations done.
  • Jobs/Batch submit code packages like programs. A typical use case is a regular task equipped with some arguments and workload done in the background. This could be a data preparation task, for instance, which takes input and output directories as parameters.
In the following, we will have a closer look at both cases and the typical process of submission. Each case will be illustrated by examples.

Interactive Sessions

Let’s start with an example of an interactive Spark Session. Throughout the example, I use python and its requests package to send requests to and retrieve responses from the REST API. As mentioned before, you do not have to follow this path, and you could use your preferred HTTP client instead (provided that it also supports POST and DELETE requests). Starting with a Spark Session. There is a bunch of parameters to configure (you can look up the specifics at Livy Documentation), but for this blog post, we stick to the basics, and we will specify its name and the kind of code. If you have already submitted Spark code without Livy, parameters like executorMemory, (YARN) queue might sound familiar, and in case you run more elaborate tasks that need extra packages, you will definitely know that the jars parameter needs configuration as well. To initiate the session we have to send a POST request to the directive /sessions along with the parameters.
import requests
LIVY_HOST = 'http://livy-server'

directive = '/sessions'
headers = {'Content-Type': 'application/json'}

data = {'kind':'pyspark','name':'first-livy'}

resp = requests.post(LIVY_HOST+directive, headers=headers, data=json.dumps(data))

if resp.status_code == requests.codes.created:
    session_id = resp.json()['id']
else:
    raise CustomError()
Livy, in return, responds with an identifier for the session that we extract from its response. Note that the session might need some boot time until YARN (a resource manager in the Hadoop world) has allocated all the resources. Meanwhile, we check the state of the session by querying the directive: /sessions/{session_id}/state. Once the state is idle, we are able to execute commands against it. To execute spark code, statements are the way to go. The code is wrapped into the body of a POST request and sent to the right directive: sessions/{session_id}/statements.
directive = f'/sessions/{session_id}/statements'

data = {'code':'...'}

resp = request.post(LIVY_HOST+directive, headers=headers, data=json.dumps(data))
As response message, we are provided with the following attributes:
attribute meaning
id to identify the statement
code the code, once again, that has been executed
state the state of the execution
output the output of the statement
The statement passes some states (see below) and depending on your code, your interaction (statement can also be canceled) and the resources available, it will end up more or less likely in the success state. The crucial point here is that we have control over the status and can act correspondingly.
statement-state
By the way, cancelling a statement is done via GET request /sessions/{session_id}/statements/{statement_id}/cancel It is time now to submit a statement: Let us imagine to be one of the classmates of Gauss and being asked to sum up the numbers from 1 to 1000. Luckily you have access to a spark cluster and – even more luckily – it has the Livy REST API running which we are connected to via our mobile app: what we just have to do is write the following spark code:
import textwrap

code = textwrap.dedent("""df = spark.createDataFrame(list(range(1,1000)),'int')
df.groupBy().sum().collect()[0]['sum(value)']""")

code_packed = {'code':code}
This is all the logic we need to define. The rest is the execution against the REST API:
import time

directive = f'/sessions/{session_id}/statements'
resp = requests.post(LIVY_HOST+directive, headers=headers, data=json.dumps(code_packed))
if resp.status_code == requests.codes.created:
    stmt_id = resp.json()['id']

    while True:
        info_resp = requests.get(LIVY_HOST+f'/sessions/{session_id}/statements/{stmt_id}')
        if info_resp.status_code == requests.codes.ok:
            state = info_resp.json()['state']
                if state in ('waiting','running'):
                    time.sleep(2)
                elif state in ('cancelling','cancelled','error'):
                    raise CustomException()
                else:
                    break
        else:
            raise CustomException()
    print(info_resp.json()['output'])

else:
  #something went wrong with creation
  raise CustomException()
Every 2 seconds, we check the state of statement and treat the outcome accordingly: So we stop the monitoring as soon as state equals available. Obviously, some more additions need to be made: probably error state would be treated differently to the cancel cases, and it would also be wise to set up a timeout to jump out of the loop at some point in time. Assuming the code was executed successfully, we take a look at the output attribute of the response:
{'status': 'ok', 'execution_count': 2, 'data': {'text/plain': '499500'}}
There we go, the answer is 499500. Finally, we kill the session again to free resources for others:
directive = f'/sessions/{session_id}/statements'
requests.delete(LIVY_HOST+directive)

Job Submission

We now want to move to a more compact solution. Say we have a package ready to solve some sort of problem packed as a jar or as a python script. What only needs to be added are some parameters – like input files, output directory, and some flags. For the sake of simplicity, we will make use of the well known Wordcount example, which Spark gladly offers an implementation of: Read a rather big file and determine how often each word appears. We again pick python as Spark language. This time curl is used as an HTTP client. As an example file, I have copied the Wikipedia entry found when typing in Livy. The text is actually about the roman historian Titus Livius. I have moved to the AWS cloud for this example because it offers a convenient way to set up a cluster equipped with Livy, and files can easily be stored in S3 by an upload handler. Let’s now see, how we should proceed:
curl -X POST --data '{"file":"s3://livy-example/wordcount.py","args":[s3://livy-example/livy_life.txt"]}' 
-H "Content-Type: application/json" http://livy-server:8998/batches
The structure is quite similar to what we have seen before. By passing over the batch to Livy, we get an identifier in return along with some other information like the current state.
{"id":1,"name":null,"state":"running","appId":"application_1567416002081_0005",...}
To monitor the progress of the job, there is also a directive to call: /batches/{batch_id}/state Most probably, we want to guarantee at first that the job ran successfully. In all other cases, we need to find out what has happened to our job. The directive /batches/{batchId}/log can be a help here to inspect the run. Finally, the session is removed by:
curl -X DELETE http://livy-server:8998/batches/1 
which returns: {"msg":"deleted"} and we are done.

Trivia

  • AWS’ Hadoop cluster service EMR supports Livy natively as Software Configuration option.
aws-emr
  • Apache’s notebook tool Zeppelin supports Livy as an Interpreter, i.e. you write code within the notebook and execute it directly against Livy REST API without handling HTTP yourself.
  • Be cautious not to use Livy in every case when you want to query a Spark cluster: Namely, In case you want to use Spark as Query backend and access data via Spark SQL, rather check out Thriftserver instead of building around Livy.
  • Kerberos can be integrated into Livy for authentication purposes.
In the last Docker tutorial Olli presented how to build a Docker image of R-Base scripts with rocker and how to run them in a container. Based on that, I’m going to discuss how to automate the process by using a bash/shell script. Since we usually use containers to deploy our apps at STATWORX, I created a small test app with R-shiny to be saved in a test container. It is, of course, possible to store any other application with this automated script as well if you like. I also created a repository at our blog github, where you can find all files and the test app. Feel free to test and use any of its content. If you are interested in writing a setup script file yourself, note that it is possible to use alternative programming languages such as python as well.

the idea behind it

$ docker-machine ls
NAME          ACTIVE   DRIVER       STATE     URL   SWARM   DOCKER    ERRORS
Dataiku       -        virtualbox   Stopped                 Unknown   
default       -        virtualbox   Stopped                 Unknown   
ShowCase      -        virtualbox   Stopped                 Unknown   
SQLworkshop   -        virtualbox   Stopped                 Unknown   
TestMachine   -        virtualbox   Stopped                 Unknown   

$ docker-machine start TestMachine
Starting "TestMachine"...
(TestMachine) Check network to re-create if needed...
(TestMachine) Waiting for an IP...
Machine "TestMachine" was started.
Waiting for SSH to be available...
Detecting the provisioner...
Started machines may have new IP addresses. You may need to re-run the `docker-machine env` command.

$ eval $(docker-machine env --no-proxy TestMachine)

$ docker ps -a
CONTAINER ID        IMAGE               COMMAND             CREATED             
cfa02575ca2c        testimage           "/sbin/my_init"     2 weeks ago         
STATUS                            PORTS                    NAMES
Exited (255) About a minute ago   0.0.0.0:2000->3838/tcp   testcontainer

$ docker start testcontainer
testcontainer

$ docker ps
...
Building and rebuilding Docker images over and over again every time you make some changes to your application can get a little tedious at times, especially if you type in the same old commands all the time. Olli discussed the advantage of creating an intermediary image for the most time-consuming processes, like installing R packages to speed things up during content creation. That’s an excellent practice, and you should try and do this for every project viable. But how about speeding up the containerisation itself? A small helper tool is needed that, once it’s written, does all the work for you.

the tools to use

To create Docker images and containers, you need to install Docker on your computer. If you want to test or use all the material provided in this blog post and on our blog github, you should also install VirtualBox, R and RStudio, if you do not already have them. If you use Windows (10) as your Operating System, you also need to install the Windows Subsystem for Linux. Alternatively, you can create your own script files with PowerShell or something similar.
The tool itself is a bash/shell script that builds and runs docker containers for you. All you have to do to use it, is to copy the docker_setup executable into your project directory and execute it. The only thing the tool requires from you afterwards is some naming input. If the execution for some reason fails or produces errors, try to run the tool via the terminal.
source ./docker_setup
To replicate or start a new bash/shell script yourself, open your preferred text editor, create a new text file, place the preamble #!/bin/bash at the very top of it and save it. Next, open your terminal, navigate to the directory where you just saved your script and change its mode by typing chmod +x your_script_name. To test if it works correctly, you can e.g. add the line echo 'it works!' below your preamble.
#!/bin/bash
echo 'it works!'
If you want to check out all available mode options, visit the wiki and for a complete guide visit the Linux Shell Scripting Tutorial.

the code that runs it

If you open the docker_setup executable with your preferred text editor, the code might feel a little overwhelming or confusing at first, but it is pretty straight forward.
#!/bin/bash

# This is a setup bash for docker containers!
# activate via: chmod 0755 setup_bash or chmod +x setup_bash
# navigate to wd docker_contents
# excute in terminal via source ./setup_bash

echo ""
echo "Welcome, You are executing a setup script bash for docker containers."
echo ""

echo "Do you want to use the Default from the global configurations?"
echo ""
source global_conf.sh
echo "machine name = $machine_name"
echo "container = $container_name"
echo "image = $image_name"
echo "app name = $app_name"
echo "password = $password_name"
echo ""

docker-machine ls
echo ""
read -p "What is the name of your docker-machine [default]? " machine_name
echo ""
if [[ "$(docker-machine status $machine_name 2> /dev/null)" == "" ]]; then
    echo "creating machine..." 
        && docker-machine create $machine_name
else
    echo "machine already exists, starting machine..." 
        && docker-machine start $machine_name
fi
echo ""
echo "activating machine..."
eval $(docker-machine env --no-proxy $machine_name)
echo ""

docker ps -a
echo ""
read -p "What is the name of your docker container? " container_name
echo ""

docker image ls
echo ""
read -p "What is the name of your docker image? (lower case only!!) " image_name
echo ""
The main code structure rests on nested if statements. Contrary to a manual docker setup via the terminal, the script needs to account for many different possibilities and even leave some error margin. The first if statement for example – depicted in the picture above – checks if a requested docker-machine already exists. If the machine does not exist, it will be created. If it does exist, it is simply started for usage. The utilised code elements or commands are even more straightforward. The echo command returns some sort of information or a blank for better readability. The read command allows for user input to be read and stored as a variable, which in return enters all further code instances necessary. Most other code elements are docker commands and are essentially the same as the ones entered manually via the terminal. If you are interested in learning more about docker commands check the documentation and Olli‘s awesome blog post.

the git repository

The focal point of the Git Repository at our blog github is the automated docker setup, but also contains some other conveniences and hopefully will grow into an entire collection of useful scripts and bashes. I am aware that there are potentially better, faster and more convenient solutions for everything included in the repository, but if we view it as an exercise and a form of creative exchange, I think we can get some use out of it.
The docker_error_logs executable allows for quick troubleshooting and storage of log files if your program or app fails to work within your docker container. The git_repair executable is not fully tested yet and should be used with care. The idea is to quickly check if your local project or repository is connected to a corresponding Git Hub repository, given an URL address, and if not to eventually ‘repair’ the connection. It can further manage git pulls, commits and pushes for you, but again please use carefully.

the next project to come

As mentioned, I plan on further expanding the collection and usefulness of our blog github soon. In the next step I will add more convenience to the docker setup by adding a separate file that provides the option to write and store default values for repeated executions. So stay tuned and visit our STATWORX Blog again soon. Until then, happy coding. Since its release in 2014, Docker has become an essential tool for deploying applications. At STATWORX, R is part of our daily toolset. Clearly, many of us were thrilled to learn about RStudio’s Rocker Project, which makes containerizing R code easier than ever. Containerization is useful in a lot of different situations. To me, it is very helpful when I’m deploying R code in a cloud computing environment, where the coded workflow needs to be run on a regular schedule. Docker is a perfect fit for this task for two reasons: On the one hand, you can simply schedule a container to be started at your desired interval. On the other hand, you always know what behavior and what output to expect, because of the static nature of containers. So if you’re tasked with deploying a machine-learning model that should regularly make predictions, consider doing so with the help of Docker. This blog entry will guide you through the entire process of getting your R script to run in a Docker container one step at a time. For the sake of simplicity, we’ll be working with a local dataset. I’d like to start off with emphasizing that this blog entry is not a general Docker tutorial. If you don’t really know what images and containers are, I recommend that you take a look at the Docker Curriculum first. If you’re interested in running an RStudio session within a Docker container, then I suggest you pay a visit to the OpenSciLabs Docker Tutorial instead. This blog specifically focuses on containerizing an R script to eventually execute it automatically each time the container is started, without any user interaction – thus eliminating the need for the RStudio IDE. The syntax used in the Dockerfile and the command line will only be treated briefly here, so it’s best to get familiar with the basics of Docker before reading any further.

What we’ll need

For the entire procedure we’ll be needing the following:
  • An R script which we’ll build into an image
  • A base image on top of which we’ll build our new image
  • A Dockerfile which we’ll use to build our new image
You can clone all following files and the folder structure I used from the STATWORX GitHub Repository.
building an R script into an image

The R script

We’re working with a very simple R script that imports a dataframe, manipulates it, creates a plot based on the manipulated data and, in the end, exports both the plot and the data it is based on. The dataframe used for this example is the US 500 Records dataset provided by Brian Dunning. If you’d like to work along, I’d recommend you to copy this dataset into the 01_data folder.
library(readr)
library(dplyr)
library(ggplot2)
library(forcats)

# import dataframe
df <- read_csv("01_data/us-500.csv")

# manipulate data
plot_data <- df %>%
  group_by(state) %>%
  count()

# save manipulated data to output folder
write_csv(plot_data, "03_output/plot_data.csv")

# create plot based on manipulated data
plot <- plot_data %>% 
  ggplot()+
  geom_col(aes(fct_reorder(state, n), 
               n, 
               fill = n))+
  coord_flip()+
  labs(
    title = "Number of people by state",
    subtitle = "From US-500 dataset",
    x = "State",
    y = "Number of people"
  )+ 
  theme_bw()

# save plot to output folder
ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100)
This creates a simple bar plot based on our dataset:
example plot from US-500 dataset
We use this script not only to run R code inside a Docker container, but we also want to run it on data from outside our container and afterward save our results.

The base image

The DockerHub page of the Rocker project lists all available Rocker repositories. Seeing as we’re using Tidyverse-packages in our script the rocker/tidyverse image should be an obvious choice. The problem with this repository is that it also includes RStudio, which is not something we want for this specific project. This means that we’ll have to work with the r-base repository instead and build our own Tidyverse-enabled image. We can pull the rocker/r-base image from DockerHub by executing the following command in the terminal:
docker pull rocker/r-base
This will pull the Base-R image from the Rocker DockerHub repository. We can run a container based on this image by typing the following into the terminal:
docker run -it --rm rocker/r-base
Congratulations! You are now running R inside a Docker container! The terminal was turned into an R console, which we can now interact with thanks to the -it argument. The —-rm argument makes sure the container is automatically removed once we stop it. You’re free to experiment with your containerized R session, which you can exit by executing the q() function from the R console. You could, for example, start installing the packages you need for your workflow with install.packages(), but that’s usually a tedious and time-consuming task. It is better to already build your desired packages into the image, so you don’t have to bother with manually installing the packages you need every time you start a container. For that, we need a Dockerfile.

The Dockerfile

With a Dockerfile, we tell Docker how to build our new image. A Dockerfile is a text file that must be called “Dockerfile.txt” and by default is assumed to be located in the build-context root directory (which in our case would be the “R-Script in Docker” folder). First, we have to define the image on top of which we’d like to build ours. Depending on how we’d like our image to be set up, we give it a list of instructions so that running containers will be as smooth and efficient as possible. In this case, I’d like to base our new image on the previously discussed rocker/r-base image. Next, we replicate the local folder structure, so we can specify the directories we want in the Dockerfile. After that we copy the files which we want our image to have access to into said directories – this is how you get your R script into the Docker image. Furthermore, this allows us to prevent having to manually install packages after starting a container, as we can prepare a second R script that takes care of the package installation. Simply copying the R script is not enough, we also need to tell Docker to automatically run it when building the image. And that’s our first Dockerfile!
# Base image https://hub.docker.com/u/rocker/
FROM rocker/r-base:latest

## create directories
RUN mkdir -p /01_data
RUN mkdir -p /02_code
RUN mkdir -p /03_output

## copy files
COPY /02_code/install_packages.R /02_code/install_packages.R
COPY /02_code/myScript.R /02_code/myScript.R

## install R-packages
RUN Rscript /02_code/install_packages.R
Don’t forget preparing and saving your appropriate install_packages.R script, where you specify which R packages you need to be pre-installed in your image. In our case the file would look like this:
install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("forcats")

Building and running the image

Now we have assembled all necessary parts for our new Docker image. Use the terminal to navigate to the folder where your Dockerfile is located and build the image with
docker build -t myname/myimage .
The process will take a while due to the package installation. Once it’s finished we can test our new image by starting a container with
docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage
Using the -v arguments signales Docker which local folders to map to the created folders inside the container. This is important because we want to both get our dataframe inside the container and save our output from the workflow locally so it isn’t lost once the container is stopped. This container can now interact with our dataframe in the 01_data folder and has a copy of our workflow-script inside its own 02_code folder. Telling R to source("02_code/myScript.R") will run the script and save the output into the 03_output folder, from where it will also be copied to our local 03_output folder.
running a container based on our image

Improving on what we have

Now that we have tested and confirmed that our R script runs as expected when containerized, there’s only a few things missing.
  1. We don’t want to manually have to source the script from inside the container, but have it run automatically whenever the container is started.
We can achieve this very easily by simply adding the following command to the end of our Dockerfile:
## run the script
CMD Rscript /02_code/myScript.R
This points towards the location of our script within the folder structure of our container, marks it as R code and then tells it to run whenever the container is started. Making changes to our Dockerfile, of course, means that we have to rebuild our image and that in turn means that we have to start the slow process of pre-installing our packages all over again. This is tedious, especially if chances are that there will be further revisions of any of the components of our image down the road. That’s why I suggest we
  1. Create an intermediary Docker image where we install all important packages and dependencies so that we can then build our final, desired image on top.
This way we can quickly rebuild our image within seconds, which allows us to freely experiment with our code without having to sit through Docker installing packages over and over again.

Building an intermediary image

The Dockerfile for our intermediary image looks very similar to our previous example. Because I decided to modify my install_packages() script to include the entire tidyverse for future use, I also needed to install a few debian packages the tidyverse depends upon. Not all of these are 100% necessary, but all of them should be useful in one way or another.
# Base image https://hub.docker.com/u/rocker/
FROM rocker/r-base:latest

## install debian packages
RUN apt-get update -qq && apt-get -y --no-install-recommends install 
libxml2-dev 
libcairo2-dev 
libsqlite3-dev 
libmariadbd-dev 
libpq-dev 
libssh2-1-dev 
unixodbc-dev 
libcurl4-openssl-dev 
libssl-dev

## copy files
COPY 02_code/install_packages.R /install_packages.R

## install R-packages
RUN Rscript /install_packages.R
I build the image by navigating to the folder where my Dockerfile sits and executing the Docker build command again:
docker build -t oliverstatworx/base-r-tidyverse .
I have also pushed this image to my DockerHub so if you ever need a base-R image with the tidyverse pre-installed you can simply build it ontop of my image without having to go through the hassle of building it yourself. Now that the intermediary image has been built we can change our original Dockerfile to build on top of it instead of rocker/r-base and remove the package-installation because our intermediary image already takes care of that. We also add the last line that automatically starts running our script whenever the container is started. Our final Dockerfile should look something like this:
# Base image https://hub.docker.com/u/oliverstatworx/
FROM oliverstatworx/base-r-tidyverse:latest

## create directories
RUN mkdir -p /01_data
RUN mkdir -p /02_code
RUN mkdir -p /03_output

## copy files
COPY /02_code/myScript.R /02_code/myScript.R

## run the script
CMD Rscript /02_code/myScript.R

The final touches

Since we built our image on top of an intermediary image with all our needed packages, we can now easily modify parts of our final image to our liking. I like making my R script less verbose by suppressing warnings and messages that are not of interest anymore (since I already tested the image and know that everything works as expected) and adding messages that tell the user which part of the script is currently being executed by the running container.
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(forcats))

options(scipen = 999,
        readr.num_columns = 0)

print("Starting Workflow")

# import dataframe
print("Importing Dataframe")
df <- read_csv("01_data/us-500.csv")

# manipulate data
print("Manipulating Data")
plot_data <- df %>%
  group_by(state) %>%
  count()

# save manipulated data to output folder
print("Writing manipulated Data to .csv")
write_csv(plot_data, "03_output/plot_data.csv")

# create plot based on manipulated data
print("Creating Plot")
plot <- plot_data %>% 
  ggplot()+
  geom_col(aes(fct_reorder(state, n), 
               n, 
               fill = n))+
  coord_flip()+
  labs(
    title = "Number of people by state",
    subtitle = "From US-500 dataset",
    x = "State",
    y = "Number of people"
  )+ 
  theme_bw()

# save plot to output folder
print("Saving Plot")
ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100)
print("Worflow Finished")
After navigating to the folder where our Dockerfile is located we rebuild our image once more with: docker build -t myname/myimage . Once again we start a container based on our image and map the 01_data and 03_output folders to our local directories. This way we can import our data and save our created output locally:
docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage
Congratulations, you now have a clean Docker image that not only automatically runs your R script whenever a container is started, but also tells you exactly which part of the code it is executing via console messages. Happy docking!
Since its release in 2014, Docker has become an essential tool for deploying applications. At STATWORX, R is part of our daily toolset. Clearly, many of us were thrilled to learn about RStudio’s Rocker Project, which makes containerizing R code easier than ever. Containerization is useful in a lot of different situations. To me, it is very helpful when I’m deploying R code in a cloud computing environment, where the coded workflow needs to be run on a regular schedule. Docker is a perfect fit for this task for two reasons: On the one hand, you can simply schedule a container to be started at your desired interval. On the other hand, you always know what behavior and what output to expect, because of the static nature of containers. So if you’re tasked with deploying a machine-learning model that should regularly make predictions, consider doing so with the help of Docker. This blog entry will guide you through the entire process of getting your R script to run in a Docker container one step at a time. For the sake of simplicity, we’ll be working with a local dataset. I’d like to start off with emphasizing that this blog entry is not a general Docker tutorial. If you don’t really know what images and containers are, I recommend that you take a look at the Docker Curriculum first. If you’re interested in running an RStudio session within a Docker container, then I suggest you pay a visit to the OpenSciLabs Docker Tutorial instead. This blog specifically focuses on containerizing an R script to eventually execute it automatically each time the container is started, without any user interaction – thus eliminating the need for the RStudio IDE. The syntax used in the Dockerfile and the command line will only be treated briefly here, so it’s best to get familiar with the basics of Docker before reading any further.

What we’ll need

For the entire procedure we’ll be needing the following: You can clone all following files and the folder structure I used from the STATWORX GitHub Repository.
building an R script into an image

The R script

We’re working with a very simple R script that imports a dataframe, manipulates it, creates a plot based on the manipulated data and, in the end, exports both the plot and the data it is based on. The dataframe used for this example is the US 500 Records dataset provided by Brian Dunning. If you’d like to work along, I’d recommend you to copy this dataset into the 01_data folder.
library(readr)
library(dplyr)
library(ggplot2)
library(forcats)

# import dataframe
df <- read_csv("01_data/us-500.csv")

# manipulate data
plot_data <- df %>%
  group_by(state) %>%
  count()

# save manipulated data to output folder
write_csv(plot_data, "03_output/plot_data.csv")

# create plot based on manipulated data
plot <- plot_data %>% 
  ggplot()+
  geom_col(aes(fct_reorder(state, n), 
               n, 
               fill = n))+
  coord_flip()+
  labs(
    title = "Number of people by state",
    subtitle = "From US-500 dataset",
    x = "State",
    y = "Number of people"
  )+ 
  theme_bw()

# save plot to output folder
ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100)
This creates a simple bar plot based on our dataset:
example plot from US-500 dataset
We use this script not only to run R code inside a Docker container, but we also want to run it on data from outside our container and afterward save our results.

The base image

The DockerHub page of the Rocker project lists all available Rocker repositories. Seeing as we’re using Tidyverse-packages in our script the rocker/tidyverse image should be an obvious choice. The problem with this repository is that it also includes RStudio, which is not something we want for this specific project. This means that we’ll have to work with the r-base repository instead and build our own Tidyverse-enabled image. We can pull the rocker/r-base image from DockerHub by executing the following command in the terminal:
docker pull rocker/r-base
This will pull the Base-R image from the Rocker DockerHub repository. We can run a container based on this image by typing the following into the terminal:
docker run -it --rm rocker/r-base
Congratulations! You are now running R inside a Docker container! The terminal was turned into an R console, which we can now interact with thanks to the -it argument. The —-rm argument makes sure the container is automatically removed once we stop it. You’re free to experiment with your containerized R session, which you can exit by executing the q() function from the R console. You could, for example, start installing the packages you need for your workflow with install.packages(), but that’s usually a tedious and time-consuming task. It is better to already build your desired packages into the image, so you don’t have to bother with manually installing the packages you need every time you start a container. For that, we need a Dockerfile.

The Dockerfile

With a Dockerfile, we tell Docker how to build our new image. A Dockerfile is a text file that must be called “Dockerfile.txt” and by default is assumed to be located in the build-context root directory (which in our case would be the “R-Script in Docker” folder). First, we have to define the image on top of which we’d like to build ours. Depending on how we’d like our image to be set up, we give it a list of instructions so that running containers will be as smooth and efficient as possible. In this case, I’d like to base our new image on the previously discussed rocker/r-base image. Next, we replicate the local folder structure, so we can specify the directories we want in the Dockerfile. After that we copy the files which we want our image to have access to into said directories – this is how you get your R script into the Docker image. Furthermore, this allows us to prevent having to manually install packages after starting a container, as we can prepare a second R script that takes care of the package installation. Simply copying the R script is not enough, we also need to tell Docker to automatically run it when building the image. And that’s our first Dockerfile!
# Base image https://hub.docker.com/u/rocker/
FROM rocker/r-base:latest

## create directories
RUN mkdir -p /01_data
RUN mkdir -p /02_code
RUN mkdir -p /03_output

## copy files
COPY /02_code/install_packages.R /02_code/install_packages.R
COPY /02_code/myScript.R /02_code/myScript.R

## install R-packages
RUN Rscript /02_code/install_packages.R
Don’t forget preparing and saving your appropriate install_packages.R script, where you specify which R packages you need to be pre-installed in your image. In our case the file would look like this:
install.packages("readr")
install.packages("dplyr")
install.packages("ggplot2")
install.packages("forcats")

Building and running the image

Now we have assembled all necessary parts for our new Docker image. Use the terminal to navigate to the folder where your Dockerfile is located and build the image with
docker build -t myname/myimage .
The process will take a while due to the package installation. Once it’s finished we can test our new image by starting a container with
docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage
Using the -v arguments signales Docker which local folders to map to the created folders inside the container. This is important because we want to both get our dataframe inside the container and save our output from the workflow locally so it isn’t lost once the container is stopped. This container can now interact with our dataframe in the 01_data folder and has a copy of our workflow-script inside its own 02_code folder. Telling R to source("02_code/myScript.R") will run the script and save the output into the 03_output folder, from where it will also be copied to our local 03_output folder.
running a container based on our image

Improving on what we have

Now that we have tested and confirmed that our R script runs as expected when containerized, there’s only a few things missing.
  1. We don’t want to manually have to source the script from inside the container, but have it run automatically whenever the container is started.
We can achieve this very easily by simply adding the following command to the end of our Dockerfile:
## run the script
CMD Rscript /02_code/myScript.R
This points towards the location of our script within the folder structure of our container, marks it as R code and then tells it to run whenever the container is started. Making changes to our Dockerfile, of course, means that we have to rebuild our image and that in turn means that we have to start the slow process of pre-installing our packages all over again. This is tedious, especially if chances are that there will be further revisions of any of the components of our image down the road. That’s why I suggest we
  1. Create an intermediary Docker image where we install all important packages and dependencies so that we can then build our final, desired image on top.
This way we can quickly rebuild our image within seconds, which allows us to freely experiment with our code without having to sit through Docker installing packages over and over again.

Building an intermediary image

The Dockerfile for our intermediary image looks very similar to our previous example. Because I decided to modify my install_packages() script to include the entire tidyverse for future use, I also needed to install a few debian packages the tidyverse depends upon. Not all of these are 100% necessary, but all of them should be useful in one way or another.
# Base image https://hub.docker.com/u/rocker/
FROM rocker/r-base:latest

## install debian packages
RUN apt-get update -qq && apt-get -y --no-install-recommends install 
libxml2-dev 
libcairo2-dev 
libsqlite3-dev 
libmariadbd-dev 
libpq-dev 
libssh2-1-dev 
unixodbc-dev 
libcurl4-openssl-dev 
libssl-dev

## copy files
COPY 02_code/install_packages.R /install_packages.R

## install R-packages
RUN Rscript /install_packages.R
I build the image by navigating to the folder where my Dockerfile sits and executing the Docker build command again:
docker build -t oliverstatworx/base-r-tidyverse .
I have also pushed this image to my DockerHub so if you ever need a base-R image with the tidyverse pre-installed you can simply build it ontop of my image without having to go through the hassle of building it yourself. Now that the intermediary image has been built we can change our original Dockerfile to build on top of it instead of rocker/r-base and remove the package-installation because our intermediary image already takes care of that. We also add the last line that automatically starts running our script whenever the container is started. Our final Dockerfile should look something like this:
# Base image https://hub.docker.com/u/oliverstatworx/
FROM oliverstatworx/base-r-tidyverse:latest

## create directories
RUN mkdir -p /01_data
RUN mkdir -p /02_code
RUN mkdir -p /03_output

## copy files
COPY /02_code/myScript.R /02_code/myScript.R

## run the script
CMD Rscript /02_code/myScript.R

The final touches

Since we built our image on top of an intermediary image with all our needed packages, we can now easily modify parts of our final image to our liking. I like making my R script less verbose by suppressing warnings and messages that are not of interest anymore (since I already tested the image and know that everything works as expected) and adding messages that tell the user which part of the script is currently being executed by the running container.
suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(forcats))

options(scipen = 999,
        readr.num_columns = 0)

print("Starting Workflow")

# import dataframe
print("Importing Dataframe")
df <- read_csv("01_data/us-500.csv")

# manipulate data
print("Manipulating Data")
plot_data <- df %>%
  group_by(state) %>%
  count()

# save manipulated data to output folder
print("Writing manipulated Data to .csv")
write_csv(plot_data, "03_output/plot_data.csv")

# create plot based on manipulated data
print("Creating Plot")
plot <- plot_data %>% 
  ggplot()+
  geom_col(aes(fct_reorder(state, n), 
               n, 
               fill = n))+
  coord_flip()+
  labs(
    title = "Number of people by state",
    subtitle = "From US-500 dataset",
    x = "State",
    y = "Number of people"
  )+ 
  theme_bw()

# save plot to output folder
print("Saving Plot")
ggsave("03_output/myplot.png", width = 10, height = 8, dpi = 100)
print("Worflow Finished")
After navigating to the folder where our Dockerfile is located we rebuild our image once more with: docker build -t myname/myimage . Once again we start a container based on our image and map the 01_data and 03_output folders to our local directories. This way we can import our data and save our created output locally:
docker run -it --rm -v ~/"R-Script in Docker"/01_data:/01_data -v ~/"R-Script in Docker"/03_output:/03_output myname/myimage
Congratulations, you now have a clean Docker image that not only automatically runs your R script whenever a container is started, but also tells you exactly which part of the code it is executing via console messages. Happy docking!