Scaling configuration

Control the scaling parameters of your Pipeline to best suit your products needs - scale to 0 or 1k GPUs without writing code or touching a cloud provider.

πŸ“˜

This is premium feature, register and read more here: https://www.mystic.ai/bring-your-own-cloud

Overview

Scaling Configurations allow you to define how your Pipeline should scale up or down in the cloud. As your traffic will vary, it's important to use an appropriate configuration to make sure you have a good cost/performance trade off according to your products need. Scaling Configurations also allow you to set minimum and maximum bounds for scaling to make sure a model can always be ready and does not scale up too high.

Key features:

  • Scale to 0 - If you want a pipeline to scale down completely with no traffic to minimise costs
  • Minimum & Maximum cache control - Upper and lower bounds for pipeline caching
  • Scale up/down speeds - This allows you to scale up or down based on traffic within a time period you set

Scaling configurations are accessible via generic CRUD endpoints in all products offered by Mystic, including the enterprise PipelineCore, as well as controllable via the python SDK and command line interface.

General definition

Scaling configurations all have the following fields:

  • type string Enum, values (read more about types below in Scaling types):
    • windows - Apply a weighted average to user defined windows to calculate the target number of nodes
  • args dictionary, specific to what type is defined

Concurrency

Most approaches to scaling use the concept of concurrency to calculate the ideal number of nodes for a pipeline. The concurrency is defined as:

concurrency = frequency of requests * duration of requests

As this is a decimal based number, we convert it to the ideal number of machines by always rounding up (ceiling) the number:

num_nodes = ceil( concurrency )

By calculating this we know the number of nodes that should be present given a calculated concurrency.

Scaling types

There are multiple scaling algorithms provided by Mystic on all of our cluster types - currently the only configurable public one is windows. Scaling algorithms allow you to select different approaches to responding to inbound traffic over time in a way that's most appropriate for your product or cost/performance needs.

Scaling types coming to the public soon:

  • pid - Proportional, Integral and Differential control. This allows more controlled responsiveness to drastic changes in traffic
  • exponential_decay - This algorithm reduces noise with in noisy traffic
  • historic_lookback - This looks back at the traffic in a given time period, such as a day, week, or other arbitrary time offset.

windows

The windows scaling type takes in an array of times weight a weight value to calculate a weighted average for scaling. For example the following args look back and calculate their concurrency 1, 5, and 30 minutes:

[
  {
  "lookback_time": 60,
  "weight": 0.6
	},
  {
  "lookback_time": 300,
  "weight": 0.3
	},
  {
  "lookback_time": 1800,
  "weight": 0.1
	},
]

Fields:

  • lookback_time - Time to view run traffic
  • weight - the weight to apply to the value for the time period

Note: Weights must sum to 1.

Create and link scaling configurations

To create a scaling configuration that keeps a pipeline alive based on traffic in the past 5 minutes via the CLI you can simply run:

pipeline create scalings username/my_new_scaling_config windows --args '{"windows": [{"lookback_time": 300,"weight": 1}]}' --min-nodes 0 --max-nodes 10

This can then be attached to a pipeline by running:

pipeline edit pl pipeline_123456789 -s username/my_new_scaling_config