Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create main developer guide for Python #11235

Merged
merged 15 commits into from
Aug 4, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/cudf/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,9 @@

html_use_modindex = True

# Enable automatic generation of systematic, namespaced labels for sections
myst_heading_anchors = 2

# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]

Expand Down
120 changes: 120 additions & 0 deletions docs/cudf/source/developer_guide/contributing_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Contributing Guide

This document focuses on a high-level overview of best practices in cuDF.

## Directory structure and file naming

cuDF generally presents the same importable modules and subpackages as pandas.
All Cython code is contained in `python/cudf/cudf/_lib`.

**Open question**: Should we start enforcing a stricter file/directory layout? Any suggestions?


vyasr marked this conversation as resolved.
Show resolved Hide resolved
## Code style

cuDF employs a number of linters to ensure consistent style across the code base.
We manage our linters using [`pre-commit`](https://pre-commit.com/).
Developers are strongly recommended to set up `pre-commit` prior to any development.
The `.pre-commit-config.yaml` file at the root of the repo is the primary source of truth linting.
Specifically, cuDF uses the following tools:

- [`flake8`](https://github.com/pycqa/flake8) checks for general code formatting compliance.
- [`black`](https://github.com/psf/black) is an automatic code formatter.
- [`isort`](https://pycqa.github.io/isort/) ensures imports are sorted consistently.
- [`mypy`](http://mypy-lang.org/) performs static type checking.
In conjunction with [type hints](https://docs.python.org/3/library/typing.html),
`mypy` can help catch various bugs that are otherwise difficult to find.
- [`pydocstyle`](https://github.com/PyCQA/pydocstyle/) lints docstring style.

Linter config data is stored in a number of files.
We generally use `pyproject.toml` over `setup.cfg` and avoid project-specific files (e.g. `setup.cfg` > `python/cudf/setup.cfg`).
However, differences between tools and the different packages in the repo result in the following caveats:

- `flake8` has no plans to support `pyproject.toml`, so it must live in `setup.cfg`.
- `isort` must be configured per project to set which project is the "first party" project.

Additionally, our use of `versioneer` means that each project must have a `setup.cfg`.
As a result, we maintain both root and project-level `pyproject.toml` and `setup.cfg` files.


For more information, see the
[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md#python--pre-commit-hooks).
vyasr marked this conversation as resolved.
Show resolved Hide resolved

## Deprecating and removing code

cuDF follows the policy of deprecating code for one release prior to removal.
For example, if we decide to remove an API during the 22.08 release cycle,
it will be marked as deprecated in the 22.08 release and removed in the 22.10 release.
All internal usage of deprecated APIs in cuDF should be removed when the API is deprecated.
This prevents users from encountering unexpected deprecation warnings when using other (non-deprecated) APIs.
The documentation for the API should also be updated to reflect its deprecation.

When deprecating an API, developers should open a corresponding GitHub issue to track the API removal.
The GitHub issue should be labeled "deprecation" and added to the next release’s project board.
If necessary, the removal timeline can be discussed on this issue.
Upon removal this issue may be closed.
Additionally, when removing an API, make sure to remove all tests and documentation.

Deprecation messages should:
- emit a FutureWarning;
- consist of a single line with no newline characters.
vyasr marked this conversation as resolved.
Show resolved Hide resolved
- indicate a replacement API, if any exists
(deprecation messages are an opportunity to show users better ways to do things);
- not specify a version when removal will occur (this gives us more flexibility).

For example:
```python
warnings.warn(
"`Series.foo` is deprecated and will be removed in a future version of cudf. "
"Use `Series.new_foo` instead.",
FutureWarning
)
```

```{warning}
Deprecations should be signaled using a `FutureWarning` **not a `DeprecationWarning`**!
`DeprecationWarning` is hidden by default except in code run in the `__main__` module.
```

## `pandas` compatibility

Maintaining compatibility with the [pandas API](https://pandas.pydata.org/docs/reference/index.html) is a primary goal of cuDF.
Developers should always look at pandas APIs when adding a new feature to cuDF.
When introducing a new cuDF API with a pandas analog, we should match pandas as much as possible.
Since we try to maintain compatibility even with various edge cases (such as null handling),
new pandas releases sometimes require changes that break compatibility with old versions.
As a result, our compatibility target is the latest pandas version.

However, there are occasionally good reasons to deviate from pandas behavior.
The most common reasons center around performance.
Some APIs cannot match pandas behavior exactly without incurring exorbitant runtime costs.
Others may require using additional memory, which is always at a premium in GPU workflows.
If you are developing a feature and believe that perfect pandas compatibility is infeasible or undesirable,
you should consult with other members of the team to assess how to proceed.

When such a deviation from pandas behavior is necessary, it should be documented.
For more information on how to do that, see [our documentation on pandas comparison](./documentation.md#comparing-to-pandas).

## Python vs Cython

cuDF makes substantial use of [Cython](https://cython.org/).
Cython is a powerful tool, but it is less user-friendly than pure Python.
It is also more difficult to debug or profile.
Therefore, developers should generally prefer Python code over Cython where possible.

The primary use-case for Cython in cuDF is to expose libcudf C++ APIs to Python.
This Cython usage is generally composed of two parts:
1. A `pxd` file declaring C++ APIs so that they may be used in Cython, and
2. A `pyx` file containing Cython functions that wrap those C++ APIs so that they can be called from Python.

The latter wrappers should generally be kept as thin as possible to minimize Cython usage.
For more information see [our Cython layer design documentation](./library_design.md#the-cython-layer).

In some rare cases we may actually benefit from writing pure Cython code to speed up particular code paths.
Given that most numerical computations in cuDF actually happen in libcudf, however,
such use cases are quite rare.
Any attempt to write pure Cython code for this purpose should be justified with benchmarks.

## Exception handling

TBD, to be written by Michael.
25 changes: 25 additions & 0 deletions docs/cudf/source/developer_guide/developer_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Developer Guide

```{note}
At present, this guide only covers the main cuDF library.
In the future, it may be expanded to also cover dask_cudf, cudf_kafka, and custreamz.
```

cuDF is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.
For more information about the `libcudf` library, a good starting point is the
[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).

This document assumes familiarity with the
[overall contributing guide](https://github.com/rapidsai/cudf/blob/main/CONTRIBUTING.md).
The goal of this document is to provide more specific guidance for Python developers.
It covers the structure of the Python code and discusses best practices.
Additionally, it includes longer sections on more specific topics like testing and benchmarking.

```{toctree}
:maxdepth: 2

library_design
documentation
```
7 changes: 7 additions & 0 deletions docs/cudf/source/developer_guide/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,13 @@ These pages do not conform to any specific style or set of use cases.
However, if you develop any sufficiently complex new features,
consider whether users would benefit from a more complete demonstration of them.

```{note}
We encourage using links between pages.
We enable [Myst auto-generated anchors](https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#auto-generated-header-anchors),
so links should make use of the appropriately namespaced anchors for links rather than adding manual links.

```

## Building documentation

### Requirements
Expand Down
7 changes: 0 additions & 7 deletions docs/cudf/source/developer_guide/index.md

This file was deleted.

12 changes: 1 addition & 11 deletions docs/cudf/source/developer_guide/library_design.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,5 @@
# Library Design

The cuDF library is a GPU-accelerated, [Pandas-like](https://pandas.pydata.org/) DataFrame library.
Under the hood, all of cuDF's functionality relies on the CUDA-accelerated `libcudf` C++ library.
Thus, cuDF's internals are designed to efficiently and robustly map pandas APIs to `libcudf` functions.

```{note}
For more information about the `libcudf` library, a good starting point is the
[developer guide](https://github.com/rapidsai/cudf/blob/main/cpp/docs/DEVELOPER_GUIDE.md).
```

At a high level, cuDF is structured in three layers, each of which serves a distinct purpose:

1. The Frame layer: The user-facing implementation of pandas-like data structures like `DataFrame` and `Series`.
Expand Down Expand Up @@ -219,8 +210,7 @@ A `Buffer` constructed from a preexisting device memory allocation (such as a Cu
Conversely, when constructed from a host object,
`Buffer` uses [`rmm.DeviceBuffer`](https://github.com/rapidsai/rmm#devicebuffers) to allocate new memory.
The data is then copied from the host object into the newly allocated device memory.
You can read more about device memory allocation with RMM [here](https://github.com/rapidsai/rmm).

You can read more about [device memory allocation with RMM here](https://github.com/rapidsai/rmm).

## The Cython layer

Expand Down
2 changes: 1 addition & 1 deletion docs/cudf/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ the details of CUDA programming.

user_guide/index
api_docs/index
developer_guide/index
developer_guide/developer_guide.md


Indices and tables
Expand Down