Tianjian Qin

Integrating Pre-Trained PyTorch Models into Your R Package

Published on 20 Jun 202415 min readLink
image
Image Credit: leonardo.ai

Integrating pre-trained PyTorch models into R can significantly enhance the flexibility and power of your R projects. This tutorial will walk you through the process taking my R package EvoNN as an example. Before we dive in, ensure you have the latest versions of R and Python installed on your machine. Familiarity with Python virtual environments, PyTorch and basic R package development is assumed. For those new to these concepts, refer to the Python and PyTorch documentations. I also recommend to read Hadley Wickham's R package development guide beforehand.

Maintaining Python Dependencies

In EvoNN, we manually maintain a list of Python library dependencies (pkglist.csv) in the /inst directory. This list is distributed along with the package.

Below is how the list looks like, the list contains dependencies for PyTorch and PyTorch Geometric:

Storing Pretrained Models and Python Scripts

The /inst directory also stores pre-trained neural network models (weights.pt) and Python scripts (import.py and function.py) that contain the necessary libraries, functions defining the neural network architecture, and data loading mechanisms.

For an intuitive impression of how the EvoNN package is structured, you can play around with the interactive tree viewer below to see the core files and their relative locations.

Preparing Python Source Files

We can divide our Python scripts into two parts. The first part, import.py, contains the necessary Python libraries to load the pre-trained model and perform the neural network estimation. The second part, function.py, contains the Python function that performs the neural network estimation.

For example, in the import.py file you can write this:

# import.py
import torch
import torch_geometric
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

And in the function.py you can write some thing like this:

# function.py
# Not working code, demonstration purposes only
def py_function(py_tree):
    def create_dataset(tree):
        # Define the dataset creation process
        return dataset

    py_dataset = create_dataset(py_tree)

    # Define the neural network architecture
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = GCNConv(16, 16)
            self.conv2 = GCNConv(16, 16)
            self.fc = nn.Linear(16, 1)

        def forward(self, data):
            x, edge_index = data.x, data.edge_index
            x = F.relu(self.conv1(x, edge_index))
            x = F.relu(self.conv2(x, edge_index))
            x = F.dropout(x, training=self.training)
            x = self.fc(x)
            return x

    # Load the pre-trained model
    model = Net()
    model.load_state_dict(torch.load("weights.pt"))
    model.eval()

    # Perform neural network estimation
    out = model(py_dataset)
    # Convert to a data dict of numpy arrays
    out = convert_to_numpy(out)
    return out

You can of course combine the two parts or divide the script in a totally different manner, it is just a matter of personal taste, the code will work anyway as long as you source all the snippets in a correct order, that is, the same order as they were in the original (working) script.

Environment Setup with zzz.R and .onLoad()

You might have noticed the werid zzz.R file. Conventionally, we define the behavior of the .onLoad() function within this file. This function will be called every time our package is loaded.

The .onLoad() function has mainly two tasks:

  • Check the Python environment and install the required dependencies if necessary.
  • Import Python functions into R.

If no virtual environment exists, we create a new one named "EvoNN" and install the required dependencies. If the virtual environment already exists, we need to verify that the necessary packages are installed with the correct versions. If everything is in order, we can activate the existing virtual environment directly.

Below is a simplified version of the .onLoad() function:

# zzz.R
# Not working code, demonstration purposes only
py_function_to_r <- NULL # Global variable to store imported Python function

.onLoad <- function(libname, pkgname){
  # Read package version list
  pkglist <- utils::read.csv(system.file("pkglist.csv", package = "EvoNN"), row.names = 1)
  install_list <- paste0(pkglist$package, "==", pkglist$version)

  # Check if the EvoNN virtual environment exists
  env_exists <- reticulate::virtualenv_exists("EvoNN")
  if (env_exists) {
    # Check if the virtual environment has the required versions of the packages
    current_pkgs <- reticulate::py_list_packages("EvoNN")
    mismatched_pkgs <- compare_pkgs(current_pkgs, pkglist)

    # Reinstall the packages if they do not match
    if (length(mismatched_pkgs) > 0) {
      reticulate::virtualenv_install("EvoNN", packages = mismatched_pkgs)
    }
  } else {
    # Create the virtual environment if it does not exist
    reticulate::virtualenv_create("EvoNN", packages = install_list)
  }

  # Use the EvoNN virtual environment
  # Note that here we must explicitly set required = TRUE
  reticulate::use_virtualenv("EvoNN", required = TRUE)
  # Import Python dependencies
  reticulate::source_python(system.file(paste0("model/", "import.py"), package = "EvoNN"))
  # Import Python functions
  reticulate::source_python(system.file(paste0("model/", "function.py"), package = "EvoNN"))
  # Assign the function needed (assume it is defined as py_function) to the global environment
  py_function_to_r <<- reticulate::py$py_function
}

In the above code block this line is very important:

reticulate::use_virtualenv("EvoNN", required = TRUE)

Here we explicitly set required = TRUE, because its default value in use_virtualenv(), use_python(), and use_conda() is different within .onLoad(). It is required = TRUE in most contexts, except in .onLoad().

If we do not set required = TRUE, our on-load script may fail if users have already initialized their own Python environment in which the dependencies are missing.

Embedding Imported Python Function into R Function

Till now, we have successfully imported the Python function into R. The next step is to embed the imported Python function into an R function. This R function will be the interface for users to call the Python function.

In EvoNN, we defined an nn_estimate() function to load a phylogenetic tree into R, convert it to the desired data formats, and then use reticulate::r_to_py() to convert the R objects to Python objects. These objects are passed to the imported Python function, which return neural network estimates back to R.

Below is a simplified version of the nn_estimate() function:

# function.R
# Not working code, demonstration purposes only
nn_estimate <- function(phylo_tree) {
    # Load Phylo Tree
    phylo_tree <- function_to_read_tree(phylo_tree)
    # Convert Phylo Tree to Desired Data Format
    phylo_tree <- function_to_convert_tree(phylo_tree)
    # Convert R object to Python object
    py_tree <- reticulate::r_to_py(phylo_tree)
    # Call the Imported Python function
    out <- py_function_to_r(py_tree)
    return(out)
}

Passing the R D Check

After completing the above steps, run R CMD check to ensure your package is free of errors and warnings. By following this guide, Python integration won't be an obstacle. However, if you encounter any problems during the R CMD check, refer to Using reticulate in an R Package for possible solutions, especially if you plan to upload your package to CRAN.

In the End

The implementation of Python integration into an R package using reticulate is highly flexible. Here, we introduced a manual example, but you can also let the reticulate package automatically configure the Python environment for you by adding some lines to your R package's DESCRIPTION file. Below is an example:

Package: rscipy
Title: An R Interface to scipy
Version: 1.0.0
Description: Provides an R interface to the Python package scipy.
Config/reticulate:
  list(
    packages = list(
      list(package = "scipy")
    )
  )
< ... other fields ... >
Installation

However, the autoconfiguration will always install miniconda to manage Python dependencies, which might not be the best practice for small projects or if we intend to run the code on a cluster computer. We may want to use Python's virtual environment instead.

The reticulate developers prefer not to configure the Python environment automatically for users. Instead, they suggest providing an installation function that allows users the freedom to choose where to set up the required virtual environment. They also suggest avoiding the use of reticulate::source_python() within .onLoad() as it modifies the user's global environment and forces reticulate to initialize Python, possibly before the user has selected the desired Python version. They recommend using the reticulate::import() function family to load dependencies. Click here to read more about the good practices of managing an R package’s Python dependencies.

These suggestions, while being safe and robust, may pose challenges for both R package developers and their users. For developers, reticulate::import() doesn't always work, such as some submodules required by the EvoNN package. For users, additional knowledge may be required to get everything set up correctly, especially for those who primarily use R and lack Python experience. Therefore, this article provided a manual workaround trying to relieve some pain. Unexpected issues may still occur on the user-side, given the complex nature of Python versions, packages and environment management.

This article used simplified EvoNN code to demonstrate how to conveniently integrate pre-trained PyTorch models into your R projects. If you have any difficulty understanding or porting the examples, please visit the EvoNN GitHub repository below for the complete codebase. Click on the image below to open:

Choose Colour