Skip to content

Folder Structure

TL;DR

You need to respect the following file structure from Vertex Pipeline Starter Kit:

vertex
├─ configs/
│  └─ {pipeline_name}     └─ {config_name}.json
└─ pipelines/
    └─ {pipeline_name}.py

A pipeline file looks like this:

```python title="vertex/pipelines/dummy_pipeline.py"
import kfp.dsl

@kfp.dsl.pipeline()
def dummy_pipeline():
    ...

You can use either .py, .toml or .json files for your config files.

You must respect the following folder structure. If you already follow the Vertex Pipelines Starter Kit folder structure, it should be pretty smooth to use this tool:

vertex
├─ configs/
│  └─ {pipeline_name}
│     └─ {config_name}.json
└─ pipelines/
   └─ {pipeline_name}.py

About folder structure

You must have at least these files. If you need to share some config elements between pipelines, you can have a shared folder in configs and import them in your pipeline configs.

If you're following a different folder structure, you can change the default paths in the pyproject.toml file. See Configuration section for more information.

Pipelines

Your file {pipeline_name}.py must contain a function called {pipeline_name} decorated using kfp.dsl.pipeline. In previous versions, the functions / object used to be called pipeline but it was changed to {pipeline_name} to avoid confusion with the kfp.dsl.pipeline decorator.

# vertex/pipelines/dummy_pipeline.py
import kfp.dsl

# New name to avoid confusion with the kfp.dsl.pipeline decorator
@kfp.dsl.pipeline()
def dummy_pipeline():
    ...

# Old name
@kfp.dsl.pipeline()
def pipeline():
    ...

Configs

Config file can be either .py, .json, .toml or yaml format. They must be located in the config/{pipeline_name} folder.

Why multiple formats?

.py files are useful to define complex configs (e.g. a list of dicts) while .json / .toml / yaml files are useful to define simple configs (e.g. a string). It also adds flexibility to the user and allows you to use the deployer with almost no migration cost.

How to format them?

  • .py files must be valid python files with two important elements:

    • parameter_values to pass arguments to your pipeline
    • input_artifacts if you want to retrieve and create input artifacts to your pipeline. See Vertex Documentation for more information.
  • .json files must be valid json files containing only one dict of key: value representing parameter values.

  • .toml files must be the same. Please note that TOML sections will be flattened, except for inline tables. Section names will be joined using "_" separator and this is not configurable at the moment. Example:

    [modeling]
    model_name = "my-model"
    params = { lambda = 0.1 }
    
    {
        "modeling_model_name": "my-model",
        "modeling_params": { "lambda": 0.1 }
    }
    
  • .yaml files must be valid yaml files containing only one dict of key: value representing parameter values.

Why are sections flattened when using TOML config files?

Vertex Pipelines parameter validation and parameter logging to Vertex Experiments are based on the parameter name. If you do not flatten your sections, you'll only be able to validate section names and that they should be of type dict.

Not very useful.

Why aren't input_artifacts supported in TOML / JSON config files?

Because it's low on the priority list. Feel free to open a PR if you want to add it.

How to name them?

{config_name}.py or {config_name}.json or {config_name}.toml. config_name is free but must be unique for a given pipeline.

Settings

You will also need the following ENV variables, either exported or in a .env file (see example in example.env):

PROJECT_ID=YOUR_PROJECT_ID  # GCP Project ID
GCP_REGION=europe-west1  # GCP Region

GAR_LOCATION=europe-west1  # Google Artifact Registry Location
GAR_PIPELINES_REPO_ID=YOUR_GAR_KFP_REPO_ID  # Google Artifact Registry Repo ID (KFP format)

VERTEX_STAGING_BUCKET_NAME=YOUR_VERTEX_STAGING_BUCKET_NAME  # GCS Bucket for Vertex Pipelines staging
VERTEX_SERVICE_ACCOUNT=YOUR_VERTEX_SERVICE_ACCOUNT  # Vertex Pipelines Service Account

About env files

We're using env files and dotenv to load the environment variables. No default value for --env-file argument is provided to ensure that you don't accidentally deploy to the wrong project. An example.env file is provided in this repo. This also allows you to work with multiple environments thanks to env files (test.env, dev.env, prod.env, etc)