-
Notifications
You must be signed in to change notification settings - Fork 929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
document workflow to incrementally create a Kedro project #4305
Changes from 8 commits
fc4702e
4f12a43
4588380
815d361
a67e1b3
7e1f646
9637ee4
8f40c70
d55899b
d229df9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,176 @@ | ||||||
# Create a Minimal Kedro Project | ||||||
Check warning on line 1 in docs/source/get_started/minimal_kedro_project.md
|
||||||
This documentation aims to explain the essential components of a minimal Kedro project. The guide begins with a blank project and gradually introduces the necessary elements. While most users typically start with a [project template]((./new_project.md)) or adapt an existing Python project, this guide will help you understand the core concepts and how to customise them to suit your specific needs. | ||||||
Check notice on line 2 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
## Essential Components of a Kedro Project | ||||||
Check warning on line 4 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
Kedro is a Python framework designed for creating reproducible data science code. A typical Kedro project consists of two parts, the **mandatory structure** and the **opinionated** project structure**. | ||||||
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### 1. **Recommended Structure** | ||||||
Kedro projects follow a specific directory structure that promotes best practices for collaboration and maintenance. The default structure includes: | ||||||
|
||||||
| Directory/File | Description | | ||||||
|-----------------------|-----------------------------------------------------------------------------| | ||||||
| `conf/` | Contains configuration files such as `catalog.yml` and `parameters.yml`. | | ||||||
| `data/` | Local project data, typically not committed to version control. | | ||||||
| `docs/` | Project documentation files. | | ||||||
| `notebooks/` | Jupyter notebooks for experimentation and prototyping. | | ||||||
| `src/` | Source code for the project, including pipelines and nodes. | | ||||||
| `README.md` | Project overview and instructions. | | ||||||
| `pyproject.toml` | Metadata about the project, including dependencies. | | ||||||
| `.gitignore` | Specifies files and directories to be ignored by Git. | | ||||||
|
||||||
### 2. **Mandatory Files** | ||||||
For a project to be recognised as a Kedro project and support running `kedro run`, it must contain three essential files: | ||||||
- **`pyprojec.toml`**: Defines the python project | ||||||
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- **`settings.py`**: Defines project settings, including library component registration. | ||||||
- **`pipeline_registry.py`**: Registers the project's pipelines. | ||||||
|
||||||
If you want to see some examples of these files, you can either create a project with `kedro new` or check out the [project template on GitHub](https://github.com/kedro-org/kedro-starters/tree/main/spaceflights-pandas) | ||||||
|
||||||
|
||||||
#### `pyproject.toml` | ||||||
The `pyproject.toml` file is a crucial component of a Kedro project, serving as the standard way to store build metadata and tool settings for Python projects. It is essential for defining the project's configuration and ensuring proper integration with various tools and libraries. | ||||||
Check warning on line 32 in docs/source/get_started/minimal_kedro_project.md
|
||||||
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Particularly, Kedro requires `[tool.kedro]` section in `pyproject.toml`, this describes the [project metadata](../kedro_project_setup/settings.md) in the project. | ||||||
Check warning on line 34 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Typically, it looks similar to this: | ||||||
Check warning on line 36 in docs/source/get_started/minimal_kedro_project.md
|
||||||
```toml | ||||||
[tool.kedro] | ||||||
package_name = "package_name" | ||||||
project_name = "project_name" | ||||||
kedro_init_version = "kedro_version" | ||||||
tools = "" | ||||||
example_pipeline = "False" | ||||||
source_dir = "src" | ||||||
``` | ||||||
|
||||||
This informs Kedro where to look for the source code, `settings.py` and `pipeline_registry.py` are. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
#### `settings.py` | ||||||
The `settings.py` file is an important configuration file in a Kedro project that allows you to define various settings and hooks for your project. Here’s a breakdown of its purpose and functionality: | ||||||
- Project Settings: This file is where you can configure project-wide settings, such as defining the logging level, setting environment variables, or specifying paths for data and outputs. | ||||||
- Hooks Registration: You can register custom hooks in settings.py, which are functions that can be executed at specific points in the Kedro pipeline lifecycle (e.g., before or after a node runs). This is useful for adding additional functionality, such as logging or monitoring. | ||||||
Check warning on line 52 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Integration with Plugins: If you are using Kedro plugins, settings.py can also be utilized to configure them appropriately. | ||||||
Check warning on line 53 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Even if you do not have any settings, an empty `settings.py` is still required. Typically, they are stored at `src/<package_name>/settings.py`. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
#### `pipeline_registry.py` | ||||||
The `pipeline_registry.py` file is essential for managing the pipelines within your Kedro project. It provides a centralized way to register and access all pipelines defined in the project. Here are its key features: | ||||||
Check warning on line 58 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Pipeline Registration: The file must contain a top-level function called `register_pipelines()` that returns a mapping from pipeline names to Pipeline objects. This function is crucial because it enables the Kedro CLI and other tools to discover and run the defined pipelines. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Autodiscovery of Pipelines: Since Kedro 0.18.3, you can use the [`find_pipeline`](../nodes_and_pipelines/pipeline_registry.md#pipeline-autodiscovery) function to automatically discover pipelines defined in your project without manually updating the registry each time you create a new pipeline. | ||||||
Check warning on line 60 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
## Creating a Minimal Kedro Project Step-by-Step | ||||||
Check warning on line 62 in docs/source/get_started/minimal_kedro_project.md
|
||||||
This guide will walk you through the process of creating a minimal Kedro project, allowing you to successfully run `kedro run` with just three files. | ||||||
Check warning on line 63 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
### Step 1: Install Kedro | ||||||
Check warning on line 65 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
First, ensure that Python is installed on your machine. Then, install Kedro using pip: | ||||||
|
||||||
```bash | ||||||
pip install kedro | ||||||
``` | ||||||
|
||||||
### Step 2: Create a New Kedro Project | ||||||
Check warning on line 73 in docs/source/get_started/minimal_kedro_project.md
|
||||||
Create a new directory for your project: | ||||||
```bash | ||||||
mkdir minikedro | ||||||
``` | ||||||
|
||||||
Navigate into your newly created project directory: | ||||||
|
||||||
```bash | ||||||
cd minikiedro | ||||||
``` | ||||||
|
||||||
### Step 3: Create `pyproject.toml` | ||||||
Check warning on line 85 in docs/source/get_started/minimal_kedro_project.md
|
||||||
Create a new file named `pyproject.toml` in the project directory with the following content: | ||||||
|
||||||
```toml | ||||||
[tool.kedro] | ||||||
package_name = "minikedro" | ||||||
project_name = "minikedro" | ||||||
kedro_init_version = "0.19.9" | ||||||
source_dir = "." | ||||||
``` | ||||||
Comment on lines
+88
to
+94
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Interesting, so the Python packaging metadata ( |
||||||
|
||||||
At this point, your workingn directory should look like this: | ||||||
Check warning on line 96 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
```bash | ||||||
. | ||||||
├── pyproject.toml | ||||||
``` | ||||||
|
||||||
|
||||||
```{note} | ||||||
Note we define `source_dir = "."`, usually we keep our source code inside a directory called `src`. For this example, we try to keep the structure minimal so we keep the source code in the root directory | ||||||
``` | ||||||
|
||||||
### Step 4: Create `settings.py` and `pipeline_registry.py` | ||||||
Next, create a folder named minikedro, which should match the package_name defined in pyproject.toml: | ||||||
Check warning on line 108 in docs/source/get_started/minimal_kedro_project.md
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
```bash | ||||||
mkdir minikedro | ||||||
``` | ||||||
Inside this folder, create two empty files: `settings.py` and `pipeline_registry.py`: | ||||||
|
||||||
```bash | ||||||
touch minikedro/settings.py minikedro/pipeline_registry.py | ||||||
``` | ||||||
|
||||||
Now your working directory should look like this: | ||||||
```bash | ||||||
. | ||||||
├── minikedro | ||||||
│ ├── pipeline_registry.py | ||||||
│ └── settings.py | ||||||
└── pyproject.toml | ||||||
``` | ||||||
|
||||||
Try running the following command in the terminal: | ||||||
```bash | ||||||
kedro run | ||||||
``` | ||||||
|
||||||
You will encounter an error indicating that pipeline_registry.py is empty: | ||||||
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
```bash | ||||||
AttributeError: module 'minikedro.pipeline_registry' has no attribute 'register_pipelines' | ||||||
``` | ||||||
|
||||||
### Step 5: Create a Simple Pipeline | ||||||
Check warning on line 138 in docs/source/get_started/minimal_kedro_project.md
|
||||||
To resolve this issue, add the following code to `pipeline_registry.py`, which defines a simple pipeline to run: | ||||||
Check warning on line 139 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
```python | ||||||
from kedro.pipeline import pipeline, node | ||||||
|
||||||
def foo(): | ||||||
return "dummy" | ||||||
|
||||||
def register_pipelines(): | ||||||
return {"__default__": pipeline([node(foo, None, "dummy_output")])} | ||||||
``` | ||||||
|
||||||
If you attempt to run the pipeline again with `kedro run`, you will see another error: | ||||||
```bash | ||||||
MissingConfigException: Given configuration path either does not exist or is not a valid directory: /workspace/kedro/minikedro/conf/base | ||||||
``` | ||||||
|
||||||
### Step 6: Define the Project Settings | ||||||
Check warning on line 156 in docs/source/get_started/minimal_kedro_project.md
|
||||||
This error occurs because Kedro expects a configuration folder named `conf`, along with two environments called `base` and `local`. | ||||||
|
||||||
To fix this, add these two lines into `settings.py`: | ||||||
```python | ||||||
CONF_SOURCE = "." | ||||||
CONFIG_LOADER_ARGS = {"base_env": ".", "default_run_env": "."} | ||||||
``` | ||||||
|
||||||
These lines override the default settings so that Kedro knows to look for configurations in the current directory instead of the expected conf folder. For more details, refer to [How to change the setting for a configuration source folder](../configuration/configuration_basics.md#how-to-change-the-setting-for-a-configuration-source-folder) and [Advance Configuration without a full Kedro project](../configuration/advanced_configuration.md#advanced-configuration-without-a-full-kedro-project) | ||||||
Check warning on line 165 in docs/source/get_started/minimal_kedro_project.md
|
||||||
noklam marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Now, run the pipeline again: | ||||||
```bash | ||||||
kedro run | ||||||
``` | ||||||
|
||||||
You should see that the pipeline runs successfully! | ||||||
Check warning on line 172 in docs/source/get_started/minimal_kedro_project.md
|
||||||
|
||||||
## Conclusion | ||||||
|
||||||
Kedro provides a structured approach to developing data pipelines with clear separation of concerns through its components and directory structure. By following the steps outlined above, you can set up a minimal Kedro project that serves as a foundation for more complex data processing workflows. This guide explains essential concepts of Kedro projects. If you already have a Python project and want to integrate Kedro into it, these concepts will help you adjust and fit your own needs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the second sentence a bit confusing. Maybe it can be "Most users typically start with a project template or adapt an existing Python project. This guide will help you understand the core components of a Kedro project and how to customise them to suit your specific needs."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#4305 (comment)
I think the idea is that, starting with a blank project isn't something that most people would do, so the goal of this documentation focus on explains the concept, but not a step-by-step guide that one should follow in reality.
Maybe I should flip the sentence like this?
While most users typically start with a project template or adapt an existing Python project, this guide begins with a blank project and gradually introduces the necessary elements. This will help you understand the core concepts and how to customise them to suit your specific needs.