Speed up R Workflow in Github using Docker

Overview

To run an R script on Github there has to be a suitable R environment in which all required packages are available. One approach is to prepare the R environment each time the actions are run (see e.g. here:) . However, setting up the environment takes a lot of computation time, which isn’t free in Github. An alternative approach is to create a docker container (including all required software and packages) and then execute the script within this container without regenerating it. This takes a lot less computation time.

The usual way of running an R script

Assuming you have a Github repository with a script and a corresponding workflow file that runs it. E.g:

├── .Github
│   └── workflows
│       └── run_script.yml
└── script.R

Then usually you write the commands to install all dependencies of script.R into the run_script.yml file. This could be system dependencies (e.g. R) or R dependencies (e.g. Tidyverse).

Github will then setup a virtual machine every time the script is run. This can take a really long time for multiple reasons. One reason is that most R packages are compiled from source code, which is very time consuming.

What we are going to do

Let’s first look at the resulting changes:

├── Dockerfile
├── .Github
│   └── workflows
│       ├── run_script.yml
│       └── upload_docker.yml
└── script.R

The script.R file will stay the same. But we will change run_script.yml to run the script inside of a docker container. We will no longer set up the environment each time we run the script.

In the Dockerfile we define said environment in a similar way we previously did in scirpt.R.

The workflow defined by upload_docker.yml will be used to generate a docker image from the Dockerfile and load it up to the Github Repository.

Setting up the Dockerfile

To get a docker image in Github, a Dockerfile has to be written.

Use the r-base docker image by rocker as your base image.

FROM r-base:latest

Any libraries used in the scripts should be installed here. Make sure to go through the script you are trying to run, and list all the libraries used.

For example:

RUN R -e \
    'install.packages(c("tinytex, profvis","codebook",
        "tidyverse", "pander", "tidyverse","haven","rmarkdown",
        "knitr","flextable","showtext","mailR","lubridate","stringr"),
    "/usr/lib/R/site-library",
    Ncpus = 4)'

To test the Dockerfile locally, download Docker on your computer. Once it is installed, run

docker build -t script_r_environment .

in the directory of your Dockerfile. This builds the image and names it “script_r_environment”, but you can name it something else. If it does not throw any errors you are good to go.

Unfortunately, this will probably throw errors, since many R packages have hidden system dependencies (installed using apt since r-base is a debian-image). Make sure to find these by reading the error logs and doing online research.

Note that figuring out what packages are needed was the hardest part in this tutorial, but it is required even for setups which aren’t using docker.

Let’s take a look at the complete Dockerfile used in our project.

FROM r-base:latest

# tex packages are installed in /root/bin so we have to make sure those
# packages accessible by adding that directory to the PATH variable.
ENV PATH="$PATH:/root/bin"

# system dependencies (apt-get since r-base is debian based)
RUN apt-get update; \
    apt-get install -y build-essential gfortran\
    libapparmor-dev libboost-all-dev  libcairo2-dev libcurl4-gnutls-dev\
    libfontconfig1-dev libgsl-dev libjpeg-dev liblapack-dev libpng-dev\
    libproj-dev libsodium-dev libssl-dev libudunits2-dev libxml2-dev\
    mesa-common-dev  libglu1-mesa-dev libharfbuzz-dev libfribidi-dev\
    default-jre default-jdk pandoc git gnupg;

# R dependencies
RUN R -e \
    'install.packages("remotes","/usr/lib/R/site-library",Ncpus = 4);\
        remotes::install_Github("rubenarslan/formr", upgrade_dependencies = FALSE);\
        install.packages(c("tinytex, profvis","codebook", "tidyverse", 
            "pander", "tidyverse","haven","rmarkdown","knitr","flextable",
            "showtext","mailR","lubridate","stringr"),
        "/usr/lib/R/site-library",
        Ncpus = 4)'

# install tinytex for rmarkdown
RUN R -e \
    'tinytex::install_tinytex();'

# rmarkdown tex dependencies
RUN tlmgr install inter titling lastpage fancyhdr setspace\
    colortbl multirow wrapfig dejavu

In addition to system packages and R packages, there are some tex packages (required for creating PDF documents) installed using tlmgr. Installing tex packages is optional since they are automatically installed at runtime if needed. However, putting them inside the docker container reduces the runtime.

Defining workflow to create and upload image

You need to have a Github workflow that creates and uploads the image to your Github repository. To do that let’s use a workflow file found here. The only difference is that the workflow should be run on dispatch.

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

# GitHub recommends pinning actions to a commit SHA.
# To get a newer version, you will need to update the SHA.
# You can also reference a tag or branch, but the action may change without warning.

name: Create and publish a Docker image

on: [workflow_dispatch]

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ Github.repository }}

jobs:
  build-and-push-image:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Log in to the Container registry
        uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
        with:
          registry: ${{ env.REGISTRY }}
          username: ${{ Github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

      - name: Build and push Docker image
        uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

Generating Dockerimage in Github

Put the Dockerfile in the root and add the other workflow file:

├── Dockerfile
├── .Github
│   └── workflows
│       ├── run_script.yml
│       └── upload_docker.yml
└── script.R

Now open your github repository in a browser. Under the actions section in the Github repository click on the upload_docker workflow to run it. The workflow will take around 20 minutes and then the image is generated and loaded into your Github repo.

Running the script

Let’s now create a workflow file (or change the previous version). Instead of setting up the whole environment in the workflow file, the script is run in the container you have previously defined. Moving the script into the container greatly decreases runtime and makes the workflow run more reliably.

name: run the script

on: #whatever you had previously

jobs:
  generate-data:
    runs-on: ${{ matrix.config.os }}
    container: ghcr.io/${{ Github.repository }}:main #put in the correct branch name
    name: ${{ matrix.config.os }} 
    
    strategy:
      fail-fast: false
      matrix:
        config:
          - {os: ubuntu-latest}
    
    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
        with:
         token: ${{ secrets.GITHUB_TOKEN }}

      - name: Run Script
        run: |
          source("script.R")
        shell: Rscript {0}

Notice that you can run as many scripts as you like. To do that, simply add more steps to the workflow file:

    steps:
      # Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
      - uses: actions/checkout@v2
        with:
         token: ${{ secrets.GITHUB_TOKEN }}

      - name: Run Script1
        run: |
          source("script1.R")
        shell: Rscript {0}

      - name: Run Script2
        run: |
          source("script2.R")
        shell: Rscript {0}

      - name: Run Script3
        run: |
          source("script3.R")
        shell: Rscript {0}

In theory we are now done. However, there could be hidden runtime dependencies. Make sure to read the logs in Github to determine the missing software.

Alternatively, test the script locally (on your computer) in Docker by running the following code in the terminal:

docker run -v "$PWD:$PWD" -w "$PWD" script_r_environment Rscript Script.R

Assert that you’re in the same directory as Script.R when running the command and make sure script_r_environment is up to date (otherwise run the command that created it again).

Extra: Committing generated file to Github from container

In the case that your script generated a file you want to upload to github, you will have to take some extra steps. In our case we wanted to upload some pdf files.

First make sure to install git in the docker container by adding it as a system dependency in the Dockerfile and generating it again.

In run-script.yml add the following lines after the ‘Run Script’ step:

    
      - name: Run Script
        run: |
          source("script.R")
        shell: Rscript {0}

      - run: git config --global --add safe.directory "$GITHUB_WORKSPACE"

      - name: Commit files
        run: |
          git config --local user.email "actions@github.com"
          git config --local user.name "GitHub Actions"
          git add .
          git diff-index --quiet HEAD || (git commit -m "upload pdfs" && git push)

Maintaining the repo

If you add more libraries to the R scripts you have to remake the docker container i.e. update the Dockerfile and run the upload_docker workflow again.

Conclusion

In our case we could minimize the time it takes to run the scripts from 12 minutes to less than 2 minutes.

An added benefit is a lot more stability using this method. While setting up the environment often fails for various reasons, using a docker container will not fail.

Related