Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs/re structure config docs #2421

Merged
merged 31 commits into from
Mar 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
98b780d
Add docs to explain registering a custom resolver
merelcht Mar 13, 2023
fedc118
Fix incorrect docs + add example for custom envs|
merelcht Mar 13, 2023
4e87359
Small clarifications
merelcht Mar 13, 2023
83eb348
Move kedro run config docs to running a pipeline
merelcht Mar 13, 2023
3962093
Merge branch 'main' into docs/improve-config-docs
merelcht Mar 13, 2023
d9d5601
Fix lint
merelcht Mar 13, 2023
2b26109
Re-order config paragraphs
merelcht Mar 13, 2023
bbd1009
Small simplifications
merelcht Mar 14, 2023
a8e8245
Add extra header
merelcht Mar 14, 2023
19cd27b
rename temp
merelcht Mar 14, 2023
b9fa2b9
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 14, 2023
af97bcc
Improve headers
merelcht Mar 14, 2023
557a919
clean up
merelcht Mar 14, 2023
7afca13
Divide config into basic and advanced + restructure basic content bet…
merelcht Mar 15, 2023
a18889d
Make config basics sections more complete + placeholder pages for cre…
merelcht Mar 16, 2023
b16d1d1
Convert explanations to how-tos advanced config
merelcht Mar 16, 2023
b686ec2
Re-arrange advanced topics
merelcht Mar 17, 2023
b26ffe2
Minor tweaks
merelcht Mar 17, 2023
917f54e
Add parameters and credentials pages
stichbury Mar 21, 2023
80809cc
Update basic config page
stichbury Mar 21, 2023
0a59f39
Another chunk of config docs improvements
stichbury Mar 21, 2023
775f7af
Merge branch 'main' into docs/re-structure-config-docs
stichbury Mar 21, 2023
ce4f5f1
Link to advanced how to's on basics and credentials pages
merelcht Mar 22, 2023
ed00281
Fix lint
merelcht Mar 22, 2023
1914132
Address review comments
merelcht Mar 23, 2023
617f96b
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 23, 2023
ee5952a
Fix typo in TemplatedConfigLoader sections
merelcht Mar 23, 2023
7155ac8
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 27, 2023
f994e28
Add links and update links to config pages
merelcht Mar 27, 2023
ce60f55
Update release notes
merelcht Mar 28, 2023
ed197d8
Merge branch 'main' into docs/re-structure-config-docs
merelcht Mar 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,12 @@
## Migration guide from Kedro 0.18.* to 0.19.*


# Upcoming Release
# Upcoming Release 0.18.8

## Major features and improvements

## Bug fixes and other changes
* Improvements to documentation about configuration.

## Breaking changes to the API

Expand All @@ -25,12 +26,11 @@
* Added new Kedro CLI `kedro jupyter setup` to setup Jupyter Kernel for Kedro.
* `kedro package` now includes the project configuration in a compressed `tar.gz` file.
* Added functionality to the `OmegaConfigLoader` to load configuration from compressed files of `zip` or `tar` format. This feature requires `fsspec>=2023.1.0`.
* Signficant improvements to on-boarding documentation that covers setup for new Kedro users. Also some major changes to the spaceflights tutorial to make it faster to work through. We think it's a better read. Tell us if it's not.

* Significant improvements to on-boarding documentation that covers setup for new Kedro users. Also some major changes to the spaceflights tutorial to make it faster to work through. We think it's a better read. Tell us if it's not.

## Bug fixes and other changes
* Added a guide and tooling for developing Kedro for Databricks.
* Implement missing dict-like interface for `_ProjectPipeline`.
* Implemented missing dict-like interface for `_ProjectPipeline`.

# Release 0.18.6

Expand Down
316 changes: 316 additions & 0 deletions docs/source/configuration/advanced_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,316 @@
# Advanced configuration
The documentation on [configuration](./configuration_basics.md) describes how to satisfy most common requirements of standard Kedro project configuration:

By default, Kedro is set up to use the [ConfigLoader](/kedro.config.ConfigLoader) class. Kedro also provides two additional configuration loaders with more advanced functionality: the [TemplatedConfigLoader](/kedro.config.TemplatedConfigLoader) and the [OmegaConfigLoader](/kedro.config.OmegaConfigLoader).
Each of these classes are alternatives for the default `ConfigLoader` and have different features. The following sections describe each of these classes and their specific functionality in more detail.

## TemplatedConfigLoader

Kedro provides an extension [TemplatedConfigLoader](/kedro.config.TemplatedConfigLoader) class that allows you to template values in configuration files. To apply templating in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):

```python
from kedro.config import TemplatedConfigLoader # new import

CONFIG_LOADER_CLASS = TemplatedConfigLoader
```

### Provide template values through globals
When using the `TemplatedConfigLoader` you can provide values in the configuration template through a `globals` file or dictionary.

Let's assume the project contains a `conf/base/globals.yml` file with the following contents:

```yaml
bucket_name: "my_s3_bucket"
key_prefix: "my/key/prefix/"

datasets:
csv: "pandas.CSVDataSet"
spark: "spark.SparkDataSet"

folders:
raw: "01_raw"
int: "02_intermediate"
pri: "03_primary"
fea: "04_feature"
```

To point your `TemplatedConfigLoader` to the globals file, add it to the the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):

```python
CONFIG_LOADER_ARGS = {"globals_pattern": "*globals.yml"}
```

Now the templating can be applied to the configuration. Here is an example of a templated `conf/base/catalog.yml` file:

```yaml
raw_boat_data:
type: "${datasets.spark}" # nested paths into global dict are allowed
filepath: "s3a://${bucket_name}/${key_prefix}/${folders.raw}/boats.csv"
file_format: parquet

raw_car_data:
type: "${datasets.csv}"
filepath: "s3://${bucket_name}/data/${key_prefix}/${folders.raw}/${filename|cars.csv}" # default to 'cars.csv' if the 'filename' key is not found in the global dict
```

Under the hood, `TemplatedConfigLoader` uses [`JMESPath` syntax](https://github.com/jmespath/jmespath.py) to extract elements from the globals dictionary.


Alternatively, you can declare which values to fill in the template through a dictionary. This dictionary could look like the following:

```python
{
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
"pri": "03_primary",
"fea": "04_feature",
},
}
```

To point your `TemplatedConfigLoader` to the globals dictionary, add it to the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):

```python
CONFIG_LOADER_ARGS = {
"globals_dict": {
"bucket_name": "another_bucket_name",
"non_string_key": 10,
"key_prefix": "my/key/prefix",
"datasets": {"csv": "pandas.CSVDataSet", "spark": "spark.SparkDataSet"},
"folders": {
"raw": "01_raw",
"int": "02_intermediate",
"pri": "03_primary",
"fea": "04_feature",
},
}
}
```

If you specify both `globals_pattern` and `globals_dict` in `CONFIG_LOADER_ARGS`, the contents of the dictionary resulting from `globals_pattern` are merged with the `globals_dict` dictionary. In case of conflicts, the keys from the `globals_dict` dictionary take precedence.


## OmegaConfigLoader

[OmegaConf](https://omegaconf.readthedocs.io/) is a Python library designed for configuration. It is a YAML-based hierarchical configuration system with support for merging configurations from multiple sources.

From Kedro 0.18.5 you can use the [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) which uses `OmegaConf` under the hood to load data.

```{note}
`OmegaConfigLoader` is under active development. It was first available from Kedro 0.18.5 with additional features due in later releases. Let us know if you have any feedback about the `OmegaConfigLoader`.
```

`OmegaConfigLoader` can load `YAML` and `JSON` files. Acceptable file extensions are `.yml`, `.yaml`, and `.json`. By default, any configuration files used by the config loaders in Kedro are `.yml` files.

To use `OmegaConfigLoader` in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):

```python
from kedro.config import OmegaConfigLoader # new import

CONFIG_LOADER_CLASS = OmegaConfigLoader
```

## Advanced Kedro configuration

This section contains a set of guidance for advanced configuration requirements of standard Kedro projects:

* [How to change which configuration files are loaded](#how-to-change-which-configuration-files-are-loaded)
* [How to ensure non default configuration files get loaded](#how-to-ensure-non-default-configuration-files-get-loaded)
* [How to bypass the configuration loading rules](#how-to-bypass-the-configuration-loading-rules)
* [How to use Jinja2 syntax in configuration](#how-to-use-jinja2-syntax-in-configuration)
* [How to do templating with the `OmegaConfigLoader`](#how-to-do-templating-with-the-omegaconfigloader)
* [How to use custom resolvers in the `OmegaConfigLoader`](#how-to-use-custom-resolvers-in-the-omegaconfigloader)
* [How to load credentials through environment variables](#how-to-load-credentials-through-environment-variables)

### How to change which configuration files are loaded
If you want to change the patterns that the configuration loader uses to find the files to load you need to set the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md).
For example, if your `parameters` files are using a `params` naming convention instead of `parameters` (e.g. `params.yml`) you need to update `CONFIG_LOADER_ARGS` as follows:

```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"parameters": ["params*", "params*/**", "**/params*"],
}
}
```

By changing this setting, the default behaviour for loading parameters will be replaced, while the other configuration patterns will remain in their default state.

### How to ensure non default configuration files get loaded
You can add configuration patterns to match files other than `parameters`, `credentials`, `logging`, and `catalog` by setting the `CONFIG_LOADER_ARGS` variable in [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md).
For example, if you want to load Spark configuration files you need to update `CONFIG_LOADER_ARGS` as follows:

```python
CONFIG_LOADER_ARGS = {
"config_patterns": {
"spark": ["spark*/"],
}
}
```

### How to bypass the configuration loading rules
You can bypass the configuration patterns and set configuration directly on the instance of a config loader class. You can bypass the default configuration (catalog, parameters, credentials, and logging) as well as additional configuration.

```{code-block} python
:lineno-start: 10
:emphasize-lines: 8

from kedro.config import ConfigLoader
from kedro.framework.project import settings

conf_path = str(project_path / settings.CONF_SOURCE)
conf_loader = ConfigLoader(conf_source=conf_path)

# Bypass configuration patterns by setting the key and values directly on the config loader instance.
conf_loader["catalog"] = {"catalog_config": "something_new"}
```

### How to use Jinja2 syntax in configuration
From version 0.17.0, `TemplatedConfigLoader` also supports the [Jinja2](https://palletsprojects.com/p/jinja/) template engine alongside the original template syntax. Below is an example of a `catalog.yml` file that uses both features:

```
{% for speed in ['fast', 'slow'] %}
{{ speed }}-trains:
type: MemoryDataSet

{{ speed }}-cars:
type: pandas.CSVDataSet
filepath: s3://${bucket_name}/{{ speed }}-cars.csv
save_args:
index: true

{% endfor %}
```

When parsing this configuration file, `TemplatedConfigLoader` will:

1. Read the `catalog.yml` and compile it using Jinja2
2. Use a YAML parser to parse the compiled config into a Python dictionary
3. Expand `${bucket_name}` in `filepath` using the `globals_pattern` and `globals_dict` arguments for the `TemplatedConfigLoader` instance, as in the previous examples

The output Python dictionary will look as follows:

```python
{
"fast-trains": {"type": "MemoryDataSet"},
"fast-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/fast-cars.csv",
"save_args": {"index": True},
},
"slow-trains": {"type": "MemoryDataSet"},
"slow-cars": {
"type": "pandas.CSVDataSet",
"filepath": "s3://my_s3_bucket/slow-cars.csv",
"save_args": {"index": True},
},
}
```

```{warning}
Although Jinja2 is a very powerful and extremely flexible template engine, which comes with a wide range of features, we do not recommend using it to template your configuration unless absolutely necessary. The flexibility of dynamic configuration comes at a cost of significantly reduced readability and much higher maintenance overhead. We believe that, for the majority of analytics projects, dynamically compiled configuration does more harm than good.
```


### How to do templating with the `OmegaConfigLoader`
Templating or [variable interpolation](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#variable-interpolation), as it's called in `OmegaConf`, for parameters works out of the box if the template values are within the parameter files or the name of the file that contains the template values follows the same config pattern specified for parameters.
By default, the config pattern for parameters is: `["parameters*", "parameters*/**", "**/parameters*"]`.
Suppose you have one parameters file called `parameters.yml` containing parameters with `omegaconf` placeholders like this:

```yaml
model_options:
test_size: ${data.size}
random_state: 3
```

and a file containing the template values called `parameters_globals.yml`:
```yaml
data:
size: 0.2
```

Since both of the file names (`parameters.yml` and `parameters_globals.yml`) match the config pattern for parameters, the `OmegaConfigLoader` will load the files and resolve the placeholders correctly.

```{note}
Templating currently only works for parameter files, but not for catalog files.
```

### How to use custom resolvers in the `OmegaConfigLoader`
`Omegaconf` provides functionality to [register custom resolvers](https://omegaconf.readthedocs.io/en/2.3_branch/usage.html#resolvers) for templated values. You can use these custom resolves within Kedro by extending the [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) class.
The example below illustrates this:

```python
from kedro.config import OmegaConfigLoader
from omegaconf import OmegaConf
from typing import Any, Dict


class CustomOmegaConfigLoader(OmegaConfigLoader):
def __init__(
self,
conf_source: str,
env: str = None,
runtime_params: Dict[str, Any] = None,
):
super().__init__(
conf_source=conf_source, env=env, runtime_params=runtime_params
)

# Register a customer resolver that adds up numbers.
self.register_custom_resolver("add", lambda *numbers: sum(numbers))

@staticmethod
def register_custom_resolver(name, function):
"""
Helper method that checks if the resolver has already been registered and registers the
resolver if it's new. The check is needed, because omegaconf will throw an error
if a resolver with the same name is registered twice.
Alternatively, you can call `register_new_resolver()` with `replace=True`.
"""
if not OmegaConf.has_resolver(name):
OmegaConf.register_new_resolver(name, function)
```

In order to use this custom configuration loader, you will need to set it as the project configuration loader in `src/<package_name>/settings.py`:

```python
from package_name.custom_configloader import CustomOmegaConfigLoader

CONFIG_LOADER_CLASS = CustomOmegaConfigLoader
```

You can then use the custom "add" resolver in your `parameters.yml` as follows:

```yaml
model_options:
test_size: ${add:1,2,3}
random_state: 3
```

### How to load credentials through environment variables
The [`OmegaConfigLoader`](/kedro.config.OmegaConfigLoader) enables you to load credentials from environment variables. To achieve this you have to use the `OmegaConfigLoader` and the `omegaconf` [`oc.env` resolver](https://omegaconf.readthedocs.io/en/2.3_branch/custom_resolvers.html#oc-env).
To use the `OmegaConfigLoader` in your project, set the `CONFIG_LOADER_CLASS` constant in your [`src/<package_name>/settings.py`](../kedro_project_setup/settings.md):

```python
from kedro.config import OmegaConfigLoader # new import

CONFIG_LOADER_CLASS = OmegaConfigLoader
```

Now you can use the `oc.env` resolver to access credentials from environment variables in your `credentials.yml`, as demonstrated in the following example:

```yaml
dev_s3:
client_kwargs:
aws_access_key_id: ${oc.env:AWS_ACCESS_KEY_ID}
aws_secret_access_key: ${oc.env:AWS_SECRET_ACCESS_KEY}
```

```{note}
Note that you can only use the resolver in `credentials.yml` and not in catalog or parameter files. This is because we do not encourage the usage of environment variables for anything other than credentials.
```
Loading