Mistral AI 7B vLLM inference guide

A guide to running the Mistral AI 7B LLM with vLLM

Mistral 7B is an open source LLM from Mistral AI released in September 2023. This example demonstrates how to achieve faster inference with both the regular and instruct model by using the open source project vLLM.

Below is a table outlining the performance of the models (all models are in float16 mode with a single conversation being processed) all tests used up to 25GB of VRAM on an A100 40GB costing $2 /hour on Mystic.

Batch sizeCost per 1k tokensTokens/s
1$0.012146
10$0.00139400
60$0.0003091.8k

You can also play with the models right now to test out the performance for yourself:

Inference pipeline creation

The requirements for the python environment are:

torch==2.0.1
git+https://github.com/huggingface/transformers@211f93aab95d1c683494e61c3cf8ff10e1f5d6b7
diffusers==0.19.3
accelerate==0.21.0
vllm==0.2.0

For wrapping the model to run in the venv we will use pipeline entity classes that will load the model weights and perform the inference requests:

from vllm import LLM, SamplingParams

from pipeline import entity, pipe


@entity
class Mistral7B:
    @pipe(on_startup=True, run_once=True)
    def load_model(self) -> None:
        self.llm = LLM(
          "mistralai/Mistral-7B-Instruct-v0.1",
          dtype="bfloat16",
        )

Inference

To perform inference with the vLLM model we need to add in an additional @pipe:

from vllm import SamplingParams
from pipeline.objects.graph import InputField, InputSchema, Variable


class ModelKwargs(InputSchema):
    temperature: float | None = InputField(default=0.6)
    top_p: float | None = InputField(default=0.9)
    max_tokens: int | None = InputField(default=100, ge=1, le=4096)


@entity
class Mistral7B:
    ...

    @pipe
    def inference(self, prompts: list, kwargs: ModelKwargs) -> List[str]:
        sampling_params = SamplingParams(
            temperature=kwargs.temperature,
            top_p=kwargs.top_p,
            max_tokens=kwargs.max_tokens,
        )

        result = self.llm.generate(prompts, sampling_params)

        return [t.outputs[0].text for t in result]

We can now create our full final inference pipeline:

from pipeline import Pipeline, Variable

with Pipeline() as builder:
    prompt = Variable(
        list,
        default=["[INST] <<SYS>> answer any question <</SYS>> What is love? [/INST]"],
    )
    kwargs = Variable(ModelKwargs)

    _pipeline = Mistral7B()
    _pipeline.load_model()
    out = _pipeline.inference(prompt, kwargs)

    builder.output(out)


my_pipeline = builder.get_pipeline()

And we can run it locally with at the end of the python file with:

output = my_pipeline.run(
    ["My name is"],
    ModelKwargs(),
)

Remote inference

📘

Make sure you're logged in before continuing

You need to authenticate your local system with Catalyst or your pcore deployment to run the following code: pipeline cluster login catalyst YOUR_TOKEN -a

To run inference remotely on Catalyst or your enterprise pcore deployment we need to create an environment and upload the pipeline:

from pipeline.cloud import environments
my_env = environments.create_environment(
    "<your-username>/mistral-7b-vllm",
    python_requirements=[
        "torch==2.0.1",
        "git+https://github.com/huggingface/transformers@211f93aab95d1c683494e61c3cf8ff10e1f5d6b7",
        "diffusers==0.19.3",
        "accelerate==0.21.0",
        "vllm==0.2.0",
    ],
)

You can then upload the pipeline (make sure you haven't run it before upload):

from pipeline.cloud.pipelines import upload_pipeline, run_pipeline
result = upload_pipeline(
    my_pipeline,
    "<your-username>/Mistral-7B-Instruct-v0.1",
    "<your-username>/mistral-7b-vllm",
    minimum_cache_number=1,
    required_gpu_vram_mb=35_000,
    accelerators=[
        Accelerator.nvidia_a100,
    ],
)

run_pipeline(
    result.id,
    [
        "[INST] <<SYS>> answer any question <</SYS>> What is love? [/INST]",
    ],
    ModelKwargs(),
)