Cold start optimisations
Many models can take several minutes to load for the first time. The Pipeline SDK offers several features that help in reducing the impact of this on your inference requests.
Common reasons for long cold start times:
- Large pipeline docker image sizes, most likely due to large python dependencies (
torch
,transformers
,diffusers
,tensorflow
etc) - Downloading model weights. This is normally due to unreliable, volatile, and slow download speeds (commonly from Huggingface when using
transformers
) - Initialising model weights on the GPU
The typical procedure before running inference is:
- Startup VM instance in the cloud
- Pull the pipeline docker image from the registry
- Start docker container
- Download model weights
- Load model on GPU
- Ready for inference
The next sections cover the features available for decreasing, or removing any cold start times from your pipelines.
Preemptive caching
When you deploy a model to Mystic, steps 1-3 above are optimised behind the scenes for you, but 4-5 can be optimised in the Pipeline SDK when you create a pipeline. Mystic will automatically try to load your models by looking at the current and historical traffic going to those models. This is called preemptive caching. To make your pipelines compatible with this the Pipeline SDK has arguments in the pipe
decorator, on_startup
and run_once
, we will cover these in shortly.
Here's an example basic StableDiffusion pipeline:
from pathlib import Path
from typing import List
import torch
from diffusers import StableDiffusionPipeline
from pipeline import Pipeline, Variable, entity, pipe
from pipeline.cloud import compute_requirements, environments, pipelines
from pipeline.objects import File
from pipeline.objects.graph import InputField, InputSchema
@entity
class StableDiffusionModel:
@pipe(on_startup=True, run_once=True)
def load(self):
model_id = "runwayml/stable-diffusion-v1-5"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.pipe = StableDiffusionPipeline.from_pretrained(
model_id,
)
self.pipe = self.pipe.to(device)
@pipe
def predict(self, prompt: str) -> List[File]:
images = self.pipe(prompt).images
output_images = []
for i, image in enumerate(images):
path = Path(f"/tmp/sd/image-{i}.jpg")
path.parent.mkdir(parents=True, exist_ok=True)
image.save(str(path))
output_images.append(File(path=path, allow_out_of_context_creation=True))
return output_images
with Pipeline() as builder:
prompt = Variable(str)
model = StableDiffusionModel()
model.load()
output = model.predict(prompt, kwargs)
builder.output(output)
On line 14
we see the use of the two on_startup
and run_once
keyword arguments, but the Pipeline is built as normal as shown in line 37
onwards.
The keyword arguments act as their names describe:
on_startup
- Run thepipe
when the Pipeline is first run, or when the Pipelinesstartup
function is calledrun_once
- Only run the function on the first call, and do not run it again
Updated 12 months ago