Speed up R Workflow in Github using Docker
Overview
To run an R script on Github there has to be a suitable R environment in which all required packages are available. One approach is to prepare the R environment each time the actions are run (see e.g. here:) . However, setting up the environment takes a lot of computation time, which isn’t free in Github. An alternative approach is to create a docker container (including all required software and packages) and then execute the script within this container without regenerating it. This takes a lot less computation time.
The usual way of running an R script
Assuming you have a Github repository with a script and a corresponding workflow file that runs it. E.g:
├── .Github
│ └── workflows
│ └── run_script.yml
└── script.R
Then usually you write the commands to install
all dependencies of script.R
into
the run_script.yml
file. This could
be system dependencies (e.g. R
) or R dependencies (e.g. Tidyverse
).
Github will then setup a virtual machine every time the script is run. This can take a really long time for multiple reasons. One reason is that most R packages are compiled from source code, which is very time consuming.
What we are going to do
Let’s first look at the resulting changes:
├── Dockerfile
├── .Github
│ └── workflows
│ ├── run_script.yml
│ └── upload_docker.yml
└── script.R
The script.R
file will stay the same. But we will change run_script.yml
to run the script inside of a docker container. We will no longer
set up the environment each time we run the script.
In the Dockerfile
we define said environment in a similar way we previously did
in scirpt.R
.
The workflow defined by upload_docker.yml
will be used to generate a docker image from
the Dockerfile
and load it up to the Github Repository.
Setting up the Dockerfile
To get a docker image in Github, a Dockerfile has to be written.
Use the r-base docker image by rocker as your base image.
FROM r-base:latest
Any libraries used in the scripts should be installed here. Make sure to go through the script you are trying to run, and list all the libraries used.
For example:
RUN R -e \
'install.packages(c("tinytex, profvis","codebook",
"tidyverse", "pander", "tidyverse","haven","rmarkdown",
"knitr","flextable","showtext","mailR","lubridate","stringr"),
"/usr/lib/R/site-library",
Ncpus = 4)'
To test the Dockerfile locally, download Docker on your computer. Once it is installed, run
docker build -t script_r_environment .
in the directory of your Dockerfile. This builds the image and names it “script_r_environment”, but you can name it something else. If it does not throw any errors you are good to go.
Unfortunately, this will probably throw errors, since many R packages have hidden system dependencies (installed using apt since r-base is a debian-image). Make sure to find these by reading the error logs and doing online research.
Note that figuring out what packages are needed was the hardest part in this tutorial, but it is required even for setups which aren’t using docker.
Let’s take a look at the complete Dockerfile used in our project.
FROM r-base:latest
# tex packages are installed in /root/bin so we have to make sure those
# packages accessible by adding that directory to the PATH variable.
ENV PATH="$PATH:/root/bin"
# system dependencies (apt-get since r-base is debian based)
RUN apt-get update; \
apt-get install -y build-essential gfortran\
libapparmor-dev libboost-all-dev libcairo2-dev libcurl4-gnutls-dev\
libfontconfig1-dev libgsl-dev libjpeg-dev liblapack-dev libpng-dev\
libproj-dev libsodium-dev libssl-dev libudunits2-dev libxml2-dev\
mesa-common-dev libglu1-mesa-dev libharfbuzz-dev libfribidi-dev\
default-jre default-jdk pandoc git gnupg;
# R dependencies
RUN R -e \
'install.packages("remotes","/usr/lib/R/site-library",Ncpus = 4);\
remotes::install_Github("rubenarslan/formr", upgrade_dependencies = FALSE);\
install.packages(c("tinytex, profvis","codebook", "tidyverse",
"pander", "tidyverse","haven","rmarkdown","knitr","flextable",
"showtext","mailR","lubridate","stringr"),
"/usr/lib/R/site-library",
Ncpus = 4)'
# install tinytex for rmarkdown
RUN R -e \
'tinytex::install_tinytex();'
# rmarkdown tex dependencies
RUN tlmgr install inter titling lastpage fancyhdr setspace\
colortbl multirow wrapfig dejavu
In addition to system packages and
R packages, there are some tex packages (required for creating
PDF documents) installed using tlmgr
.
Installing tex packages is optional
since they are automatically installed at runtime if needed.
However, putting them inside the docker container reduces the runtime.
Defining workflow to create and upload image
You need to have a Github workflow that creates and uploads the image to your Github repository. To do that let’s use a workflow file found here. The only difference is that the workflow should be run on dispatch.
# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
# GitHub recommends pinning actions to a commit SHA.
# To get a newer version, you will need to update the SHA.
# You can also reference a tag or branch, but the action may change without warning.
name: Create and publish a Docker image
on: [workflow_dispatch]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ Github.repository }}
jobs:
build-and-push-image:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Log in to the Container registry
uses: docker/login-action@f054a8b539a109f9f41c372932f1ae047eff08c9
with:
registry: ${{ env.REGISTRY }}
username: ${{ Github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@98669ae865ea3cffbcbaa878cf57c20bbf1c6c38
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@ad44023a93711e3deb337508980b4b5e9bcdc5dc
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
Generating Dockerimage in Github
Put the Dockerfile in the root and add the other workflow file:
├── Dockerfile
├── .Github
│ └── workflows
│ ├── run_script.yml
│ └── upload_docker.yml
└── script.R
Now open your github repository in a browser. Under the actions section in the Github repository click on the upload_docker workflow to run it. The workflow will take around 20 minutes and then the image is generated and loaded into your Github repo.
Running the script
Let’s now create a workflow file (or change the previous version). Instead of setting up the whole environment in the workflow file, the script is run in the container you have previously defined. Moving the script into the container greatly decreases runtime and makes the workflow run more reliably.
name: run the script
on: #whatever you had previously
jobs:
generate-data:
runs-on: ${{ matrix.config.os }}
container: ghcr.io/${{ Github.repository }}:main #put in the correct branch name
name: ${{ matrix.config.os }}
strategy:
fail-fast: false
matrix:
config:
- {os: ubuntu-latest}
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v2
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Run Script
run: |
source("script.R")
shell: Rscript {0}
Notice that you can run as many scripts as you like. To do that, simply add more steps to the workflow file:
steps:
# Checks-out your repository under $GITHUB_WORKSPACE, so your job can access it
- uses: actions/checkout@v2
with:
token: ${{ secrets.GITHUB_TOKEN }}
- name: Run Script1
run: |
source("script1.R")
shell: Rscript {0}
- name: Run Script2
run: |
source("script2.R")
shell: Rscript {0}
- name: Run Script3
run: |
source("script3.R")
shell: Rscript {0}
In theory we are now done. However, there could be hidden runtime dependencies. Make sure to read the logs in Github to determine the missing software.
Alternatively, test the script locally (on your computer) in Docker by running the following code in the terminal:
docker run -v "$PWD:$PWD" -w "$PWD" script_r_environment Rscript Script.R
Assert that you’re in the same directory as Script.R when running the command and make sure script_r_environment is up to date (otherwise run the command that created it again).
Extra: Committing generated file to Github from container
In the case that your script generated a file you want to upload to github, you will have to take some extra steps. In our case we wanted to upload some pdf files.
First make sure to install git in the docker container by adding it as a system dependency in the Dockerfile and generating it again.
In run-script.yml
add the following lines after the ‘Run Script’ step:
- name: Run Script
run: |
source("script.R")
shell: Rscript {0}
- run: git config --global --add safe.directory "$GITHUB_WORKSPACE"
- name: Commit files
run: |
git config --local user.email "actions@github.com"
git config --local user.name "GitHub Actions"
git add .
git diff-index --quiet HEAD || (git commit -m "upload pdfs" && git push)
Maintaining the repo
If you add more libraries to the R scripts you have to remake the docker container i.e. update the Dockerfile and run the upload_docker workflow again.
Conclusion
In our case we could minimize the time it takes to run the scripts from 12 minutes to less than 2 minutes.
An added benefit is a lot more stability using this method. While setting up the environment often fails for various reasons, using a docker container will not fail.