Cold start optimisations

Many models can take several minutes to load for the first time. The Pipeline SDK offers several features that help in reducing the impact of this on your inference requests.

Common reasons for long cold start times:

  • Downloading model weights. This is normally due to unreliable, volatile, and slow download speeds (commonly from Huggingface when using transformers)
  • Initialising model weights on the GPU
  • Loading Libraries (torch, transformers, diffusers, tensorflow etc)

The typical procedure before running inference is:

  1. Startup VM instance in the cloud
  2. Start docker container
  3. Initialise environment
  4. Download model weights
  5. Load model on GPU
  6. Ready for inference

The next sections cover the features available for decreasing, or removing any cold start times from your pipelines.

Preemptive caching

Both Catalyst and a pcore deployment automate the optimisations of steps 1-3 of the typical procedure, but 4-5 are optimised in the Pipeline SDK when you create a pipeline. Catalyst and pcore deployments automatically try to load your models by looking at the current and historical traffic going to those models. This is called preemptive caching. To make your pipelines compatible with this the Pipeline SDK has arguments in the pipe decorator, on_startup and run_once, we will cover these in shortly.

Here's an example basic StableDiffusion pipeline:

from pathlib import Path
from typing import List

import torch
from diffusers import StableDiffusionPipeline

from pipeline import Pipeline, Variable, entity, pipe
from pipeline.cloud import compute_requirements, environments, pipelines
from pipeline.objects import File
from pipeline.objects.graph import InputField, InputSchema

@entity
class StableDiffusionModel:
    @pipe(on_startup=True, run_once=True)
    def load(self):
        model_id = "runwayml/stable-diffusion-v1-5"
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.pipe = StableDiffusionPipeline.from_pretrained(
            model_id,
        )
        self.pipe = self.pipe.to(device)

    @pipe
    def predict(self, prompt: str) -> List[File]:
        images = self.pipe(prompt).images

        output_images = []
        for i, image in enumerate(images):
            path = Path(f"/tmp/sd/image-{i}.jpg")
            path.parent.mkdir(parents=True, exist_ok=True)
            image.save(str(path))
            output_images.append(File(path=path, allow_out_of_context_creation=True))

        return output_images


with Pipeline() as builder:
    prompt = Variable(str)

    model = StableDiffusionModel()
    model.load()
    output = model.predict(prompt, kwargs)
    builder.output(output)

On line 14 we see the use of the two on_startup and run_once keyword arguments, but the Pipeline is built as normal as shown in line 37 onwards.

The keyword arguments act as their names describe:

  • on_startup - Run the pipe when the Pipeline is first run, or when the Pipelines startup function is called
  • run_once - Only run the function on the first call, and do not run it again

Files and directories

Libraries such as transformers use public mirrors to download model weights, this can be unstable and unreliable. It's common to see very volatile/unpredictable download speeds often doubling the cold start times of models. To overcome this it's possible to fully bundle the model before uploading the pipeline, and then use Files and directories objects to store everything on Catalyst/pcore directly. Here's an example:

from pipeline.objects import File

@entity
class StableDiffusionModel:
  	...
    @pipe(on_startup=True, run_once=True)
    def load(self, model_file: File):
      	import dill
        model_id = "runwayml/stable-diffusion-v1-5"
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
				with model_file.path.open("rb") as file:
          self.pipe = dill.load(file)
        self.pipe = self.pipe.to(device)
		...

my_model = self.pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
    
with Pipeline() as builder:
    prompt = Variable(str)
		
    my_file = File.from_object(my_model) # Create a file object
   
    model = StableDiffusionModel()
    model.load(my_file) # Pass file object to load function
    output = model.predict(prompt, kwargs)
    builder.output(output)