Files and directories

When working with ML you will often need to use one or several files that contain model weights or other relevant things to your workflow. It's also possible to send a large image or video (file) as part of a run request. The Pipeline SDK has two things that make working with these easier:

  • File- An object that references a local file and allows it to be used remotely
  • Directory - Like a file, but for a full directory and its contents

As a bonus, if you need to use a file that's already hosted online you can use:

  • FileURL - An object that references a file to be downloaded via a given URL by the end server

You can create these objects programatically or in the CLI, and also as part of the pipeline definition or directly in a Run.

These file objects are used in two primary ways:

  • Static definition - For when you need a file to be accessible every time the pipeline is run or initialised, e.g. model weights.
  • Run time input - When a file is different every time a pipeline is run, like when passing in an image to do some processing on it, or transcribe a large audio file.

📘

You can also return File or Directory objects from a run in our hosted storage, see File responses

Static pipeline files

Using a File object in the pipeline definition allows you to send local files on pipeline upload for cold start optimisation, but the point is the file will be needed for every run in someway. Typically, this is model weights.

You can add a File or Directory in your pipeline like follows:

from pipeline import File, Pipeline, Variable, entity, pipe, FileURL
from pipeline.cloud.pipelines import upload_pipeline


@entity
class MainClass:
    @pipe(run_once=True, on_startup=True)
    def my_func(self, my_file: File) -> str:
        self.text = my_file.path.read_text()

    @pipe
    def get_text(self, rand: str) -> str:
        return self.text

with Pipeline() as builder:
    rand = Variable(str)

    my_file = File("./local_file.txt")
    # OR
    my_file = FileURL("https://storage.mystic.ai/run_files/c8/5c/c85c1853-d1cd-40d5-9aaa-6e2cfaca48d5/image-0.jpg")
    
    en = MainClass()

    en.my_func(my_file)
    res = en.get_text(rand)
    builder.output(res)

my_pl = builder.get_pipeline()

remote_pipeline = upload_pipeline(my_pl, "file-test-pipeline", "my-environment")

In this example a local txt file is uploaded with my pipeline to either a pcore deployment or Catalyst in line 27. It's important to note the contents of the File object (Directory is the same). You access the path of the File by referencing it's path attribute which is of type pathlib.Path. As it is of this type, lots of file manipulation features are offered out of the box.

📘

If you're debugging and don't want to upload large files every time check out Uploading files and directories by CLI below

Run time input

Using files in a run is different than when defining a pipeline. Instead of referencing a file on your local system you pass in the File/Directory class to the Variable type.

API requests for File objects

A standard cURL request for a run looks something like this:

curl --request POST \
  --url http://mystic.ai/v3/runs \
  --header 'Authorization: Bearer API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
	"pipeline_id_or_pointer": "meta/llama-2-70B:latest",
	"async_run": false,
	"input_data": [
		{
			"type": "string",
			"value": "Hellow my name is Paul, I like to"
		},
		{
			"type": "dictionary",
			"value": {
				"max_new_tokens": 20,
			}
		}
	]
}'

And with a File:

curl --request POST \
  --url http://mystic.ai/v3/runs \
  --header 'Authorization: Bearer API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
	"pipeline_id_or_pointer": "paulh/read-my-file:v1",
	"async_run": false,
	"input_data": [
		{
			"type": "file",
			"value": null,
      "file_name": "my_file.txt",
      "file_path": "/pipeline_files/aa/aa/aa598uohkjgesrohuijger/my_file.txt"
		}
	]
}'

The assumption here is that the file is already uploaded to Catalyst or your pcore deployment (with the exception of FileURLs see below).

To upload the file via HTTP see the following cURL request that can then be used with the above:

curl --request POST \
  --url http://mystic.ai/v3/pipeline_files \
  --header 'Authorization: Bearer API_TOKEN' \
	--form 'pfile=@/Users/paul/Desktop/my_file.txt'

The response contains both file_name and file_path for use in the run above. Or, it is possible to do all of the above directly in python:

from pipeline import File, Pipeline, Variable, pipe
from pipeline.cloud.pipelines import run_pipeline, upload_pipeline

@pipe
def my_func(my_file: File) -> str:
    return my_file.path.read_text()


with Pipeline() as builder:
    var_1 = Variable(File)

    output = my_func(var_1)

    builder.output(output)

my_pl = builder.get_pipeline()


remote_pipeline = upload_pipeline(my_pl, "file_test", "numpy")
# Option 1
output = run_pipeline(remote_pipeline.id, open("my_file.txt", "rb"))
# Option 2
output = run_pipeline(remote_pipeline.id, File("my_file.txt"))

All requests are automated on lines 21 and 23 for you.

API requests for Directory objects

🚧

File and Directory objects are not differentiated outside of the python SDK, if you do not use the Directory object in the @pipe definition but instead File the zip file will not be unzipped

Directory objects are sent by first zipping the contents of a directory. You can do this by running something like the following which will put file_1.txt and file_2.txt into my_zip_file.zip:

zip my_zip_file.zip file_1.txt file_2.txt

Once this is done you can upload the zip file in the same way you'd upload any file:

curl --request POST \
  --url http://mystic.ai/v3/pipeline_files \
  --header 'Authorization: Bearer API_TOKEN' \
	--form 'pfile=@my_zip_file.zip'

Uploading files and directories by CLI

You can upload either a File or Directory object by CLI. For a single file:

pipeline create file my_file.txt
# Pipeline File created with ID= file_34983498734

Or you can upload an entire directory

pipeline create file -r ./my_dir
# Pipeline File created with ID= file_34983498734

File responses

There are two ways files can be returned from a Run, a single File output variable, or an array of File objects. You must use the standard typing procedure in python to do this:

@pipe
def get_file() -> File:
  local_file_path = "/tmp/image.jpg"
  output_image = File(path=file_path, allow_out_of_context_creation=True)
  return output_image

# OR for an array

from typing import List

@pipe
def get_file() -> List[File]:
  local_file_path = "/tmp/image.jpg"
  local_file_path_2 = "/tmp/image2.jpg"
  output_image = File(path=file_path, allow_out_of_context_creation=True)
  output_image_2 = File(path=file_path_2, allow_out_of_context_creation=True)
  return [output_image, output_image_2]

When a File object is returned from a pipe a run request returns a URL to that file along with other meta data. The file specific variable object looks like the following in a response:

{
  "type": "file",
  "value": null,
  "file": {
    "name": "image-0.jpg",
    "path": "run_files/31/7e/317e304d-e816-4036-86b2-7ad82b208b70/image-0.jpg",
    "url": "https://storage.mystic.ai/run_files/31/7e/317e304d-e816-4036-86b2-7ad82b208b70/image-0.jpg",
    "size": 22241
  }
}

For a list/array of File objects the array variable is used as the root variable:

{
  "type": "array",
  "value": [
    {
      "type": "file",
      "value": null,
      "file": {
        "name": "image-0.jpg",
        "path": "run_files/31/7e/317e304d-e816-4036-86b2-7ad82b208b70/image-0.jpg",
        "url": "https://storage.mystic.ai/run_files/31/7e/317e304d-e816-4036-86b2-7ad82b208b70/image-0.jpg",
        "size": 22241
      }
    },
    {
      "type": "file",
      "value": null,
      "file": {
        "name": "image-1.jpg",
        "path": "run_files/45/2a/452a3204-beba-4ccc-949c-1af4e2fcd88c/image-1.jpg",
        "url": "https://storage.mystic.ai/run_files/45/2a/452a3204-beba-4ccc-949c-1af4e2fcd88c/image-1.jpg",
        "size": 16608
      }
    }
  ],
  "file": null
}

FileURLs

FileURLs can be used as a static file reference for a pipeline as discussed above, or used as an ordinary input in a run. Lets take the below pipeline as our example:

@pipe
def my_func(my_file: File) -> str:
    return my_file.path.read_text()


with Pipeline() as builder:
    var_1 = Variable(File)

    output = my_func(var_1)

    builder.output(output)

my_pl = builder.get_pipeline()
remote_pipeline = upload_pipeline(my_pl, "paulh/file_test", "numpy")

Now to perform a run with this pipeline using a FileURL in python:

run_pipeline(
  FileURL(
    url="https://storage.mystic.ai/run_files/c8/5c/c85c1853-d1cd-40d5-9aaa-6e2cfaca48d5/image-0.jpg",
  )
)

Or by curl:

curl --request POST \
  --url http://mystic.ai/v3/runs \
  --header 'Authorization: Bearer API_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
	"pipeline_id_or_pointer": "paulh/file_test:v1",
	"async_run": false,
	"input_data": [
		{
			"type": "file",
			"value": null,
      "file_name": "image-0.jpg",
      "file_path": "https://storage.mystic.ai/run_files/c8/5c/c85c1853-d1cd-40d5-9aaa-6e2cfaca48d5/image-0.jpg"
		}
	]
}'

Although the input to the my_func method is a File, using a local FileURL directs the local client to not read/load remote files, but pass them as a reference.