Folder Structure
TL;DR
You need to respect the following file structure from Vertex Pipeline Starter Kit:
vertex
├─ configs/
│ └─ {pipeline_name}
│ └─ {config_name}.json
└─ pipelines/
└─ {pipeline_name}.py
A pipeline file looks like this:
```python title="vertex/pipelines/dummy_pipeline.py"
import kfp.dsl
@kfp.dsl.pipeline()
def dummy_pipeline():
...
You can use either .py
, .toml
or .json
files for your config files.
You must respect the following folder structure. If you already follow the Vertex Pipelines Starter Kit folder structure, it should be pretty smooth to use this tool:
vertex
├─ configs/
│ └─ {pipeline_name}
│ └─ {config_name}.json
└─ pipelines/
└─ {pipeline_name}.py
About folder structure
You must have at least these files. If you need to share some config elements between pipelines,
you can have a shared
folder in configs
and import them in your pipeline configs.
If you're following a different folder structure, you can change the default paths in the pyproject.toml
file.
See Configuration section for more information.
Pipelines
Your file {pipeline_name}.py
must contain a function called {pipeline_name}
decorated using kfp.dsl.pipeline
.
In previous versions, the functions / object used to be called pipeline
but it was changed to {pipeline_name}
to avoid confusion with the kfp.dsl.pipeline
decorator.
# vertex/pipelines/dummy_pipeline.py
import kfp.dsl
# New name to avoid confusion with the kfp.dsl.pipeline decorator
@kfp.dsl.pipeline()
def dummy_pipeline():
...
# Old name
@kfp.dsl.pipeline()
def pipeline():
...
Configs
Config file can be either .py
, .json
, .toml
or yaml
format.
They must be located in the config/{pipeline_name}
folder.
Why multiple formats?
.py
files are useful to define complex configs (e.g. a list of dicts) while .json
/ .toml
/ yaml
files are useful to define simple configs (e.g. a string).
It also adds flexibility to the user and allows you to use the deployer with almost no migration cost.
How to format them?
-
.py
files must be valid python files with two important elements:parameter_values
to pass arguments to your pipelineinput_artifacts
if you want to retrieve and create input artifacts to your pipeline. See Vertex Documentation for more information.
-
.json
files must be valid json files containing only one dict of key: value representing parameter values. -
.toml
files must be the same. Please note that TOML sections will be flattened, except for inline tables. Section names will be joined using"_"
separator and this is not configurable at the moment. Example:[modeling] model_name = "my-model" params = { lambda = 0.1 }
{ "modeling_model_name": "my-model", "modeling_params": { "lambda": 0.1 } }
-
.yaml
files must be valid yaml files containing only one dict of key: value representing parameter values.
Why are sections flattened when using TOML config files?
Vertex Pipelines parameter validation and parameter logging to Vertex Experiments are based on the parameter name.
If you do not flatten your sections, you'll only be able to validate section names and that they should be of type dict
.
Not very useful.
Why aren't input_artifacts
supported in TOML / JSON config files?
Because it's low on the priority list. Feel free to open a PR if you want to add it.
How to name them?
{config_name}.py
or {config_name}.json
or {config_name}.toml
. config_name is free but must be unique for a given pipeline.
Settings
You will also need the following ENV variables, either exported or in a .env
file (see example in example.env
):
PROJECT_ID=YOUR_PROJECT_ID # GCP Project ID
GCP_REGION=europe-west1 # GCP Region
GAR_LOCATION=europe-west1 # Google Artifact Registry Location
GAR_PIPELINES_REPO_ID=YOUR_GAR_KFP_REPO_ID # Google Artifact Registry Repo ID (KFP format)
VERTEX_STAGING_BUCKET_NAME=YOUR_VERTEX_STAGING_BUCKET_NAME # GCS Bucket for Vertex Pipelines staging
VERTEX_SERVICE_ACCOUNT=YOUR_VERTEX_SERVICE_ACCOUNT # Vertex Pipelines Service Account
About env files
We're using env files and dotenv to load the environment variables.
No default value for --env-file
argument is provided to ensure that you don't accidentally deploy to the wrong project.
An example.env
file is provided in this repo.
This also allows you to work with multiple environments thanks to env files (test.env
, dev.env
, prod.env
, etc)