Deploy Mistral 7B with vLLM

Mistral 7B is an open source LLM from Mistral AI released in September 2023. This example demonstrates how to achieve faster inference with both the regular and instruct model by using the open source project vLLM.

Below is a table outlining the performance of the models (all models are in float16 mode with a single conversation being processed) all tests used up to 25GB of VRAM on an A100 40GB costing $2 /hour on Mystic.

Batch sizeCost per 1k tokensTokens/s
1$0.012146
10$0.00139400
60$0.0003091.8k

You can also play with the models right now to test out the performance for yourself:

All of the source code and files for this example can be found in the Pipeline SDK repo here.

If you haven't setup your machine for working with Pipelines yet, please follow the Getting Started guide before doing this tutorial.

Setup

First we make a directory and then initialize our project:

mkdir mistral-7b-pipeline && cd mistral-7b-pipeline
pipeline container init

This will give us a basic pipeline.yaml and new_pipeline.py file to start building our pipeline with.

YAML

We need to make some changes to our YAML file to setup our Pipeline environment. Firstly we add the following container commands, which will make sure gcc is installed:

container_commands:
    - apt-get update
    - apt-get install -y gcc

Next we configure our Python dependencies, we need the following for Mistral:

    requirements:
      - "pipeline-ai"
      - "torch==2.0.1"
      - "transformers"
      - "diffusers==0.19.3"
      - "accelerate==0.21.0"
      - "vllm==0.2.1.post1"

Mistral is pretty small, it can run on a single A100 with 40GB of VRAM:

accelerators: ["nvidia_a100"]
accelerator_memory: null

Lastly, you'll want to change pipeline_name to have your Mystic username so that we can upload this Pipeline later.

pipeline_name: mystic_user/mistral-7b

Our final YAML looks like this:

runtime:
  container_commands:
    - apt-get update
    - apt-get install -y gcc
  python:
    version: "3.10"
    requirements:
      - "pipeline-ai"
      - "torch==2.0.1"
      - "transformers"
      - "diffusers==0.19.3"
      - "accelerate==0.21.0"
      - "vllm==0.2.1.post1"
    cuda_version: "11.4"
accelerators: ["nvidia_a100"]
accelerator_memory: null
pipeline_graph: new_pipeline:my_pipeline
pipeline_name: mystic_user/mistral-7b

Python Code

Let's start with the imports, we'll be using the vllm library to make running our model super easy:

from typing import List

from vllm import LLM, SamplingParams

from pipeline import Pipeline, entity, pipe
from pipeline.objects.graph import InputField, InputSchema, Variable

Next we'll define an input schema to wrap up all our inference parameters:

class ModelKwargs(InputSchema):
    do_sample: bool | None = InputField(default=False)
    use_cache: bool | None = InputField(default=True)
    temperature: float | None = InputField(default=0.6)
    top_k: float | None = InputField(default=50)
    top_p: float | None = InputField(default=0.9)
    max_tokens: int | None = InputField(default=100, ge=1, le=4096)

Now we'll implement the actual Pipeline. This is fairly straightforward thanks to our use of libraries. Note how we load the model at startup only.

@entity
class Mistral7B:
    @pipe(on_startup=True, run_once=True)
    def load_model(self) -> None:
        self.llm = LLM("mistralai/Mistral-7B-Instruct-v0.2", gpu_memory_utilization=0.5)

    @pipe
    def inference(self, prompts: list, kwargs: ModelKwargs) -> List[str]:
        sampling_params = SamplingParams(
            temperature=kwargs.temperature,
            top_p=kwargs.top_p,
            max_tokens=kwargs.max_tokens,
        )

        result = self.llm.generate(prompts, sampling_params)

        return [t.outputs[0].text for t in result]

Our model takes a list of prompts, and outputs a list of results. If you want to learn more about how the actual inference call works, you can take a look at VLLM.

Finally, let's connect the inputs and outputs of our model and package it:

with Pipeline() as builder:
    prompt = Variable(list, default=["My name is"])
    kwargs = Variable(ModelKwargs)

    _pipeline = Mistral7B()
    _pipeline.load_model()
    out = _pipeline.inference(prompt, kwargs)

    builder.output(out)


my_pipeline = builder.get_pipeline()

And that's it!

Testing

Your Pipeline should now be ready to test. If you want to run it locally, you may need to have a GPU installed. You can run your container locally with pipeline container up, or push it to Mystic and run the Pipeline on your dashboard.