Skip to content

Commit

Permalink
feat: rework TaggedTable (#680)
Browse files Browse the repository at this point in the history
Closes #647

### Summary of Changes

* `TaggedTable` is now called `TabularDataset`,
* It is moved from `safeds.data.tabular.containers` to
`safeds.data.labeled.containers`. That's where all dataset classes for
supervised learning will go, like the upcoming `ImageDataset`.
* `TabularDataset` no longer inherits from `Table`.
* `TabularDataset` now has a very small interface. It's only meant to be
used as input for supervised ML models. Table manipulation is now solely
done using the `Table` class.
* `tag_columns` on `Table` is now called `to_tabular_dataset`. This
makes it consistent with other conversion methods and emphasizes that
this is a terminal operation and should only be used once one is done
manipulating the table.
* `TabularDataset` now has a public `to_table` method to get a `Table`
again.

---------

Co-authored-by: megalinter-bot <[email protected]>
  • Loading branch information
lars-reimann and megalinter-bot authored May 1, 2024
1 parent 72842dd commit db2b613
Show file tree
Hide file tree
Showing 104 changed files with 1,168 additions and 3,443 deletions.
8 changes: 5 additions & 3 deletions .github/linters/.ruff.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
ignore-init-module-imports = true
line-length = 120
target-version = "py311"

[lint]
ignore-init-module-imports = true

select = [
"F",
"E",
Expand Down Expand Up @@ -91,7 +93,7 @@ ignore = [
"TRY003",
]

[per-file-ignores]
[lint.per-file-ignores]
"*test*.py" = [
# Undocumented declarations
"D100",
Expand All @@ -108,5 +110,5 @@ ignore = [
"TCH004",
]

[pydocstyle]
[lint.pydocstyle]
convention = "numpy"
2 changes: 0 additions & 2 deletions docs/development/project_guidelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -488,7 +488,6 @@ when adding new classes to it.
"Column",
"Row",
"Table",
"TaggedTable",
]
```

Expand All @@ -497,7 +496,6 @@ when adding new classes to it.
```py
__all__ = [
"Table",
"TaggedTable",
"Column",
"Row",
]
Expand Down
6 changes: 3 additions & 3 deletions docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,9 +117,9 @@ It is analyzed to uncover the meaningful information in the larger data set.
Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data.
Those Algorithms might be able to find hidden meaning in data - without being told where to look.

## Tagged Table
In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that
an applied algorithm will train to predict its entries. The marked column is referred to as ["target"](#target).
## Tabular Dataset
In addition to a regular table, a tabular dataset will mark one column as ["target"](#target), meaning that an applied
algorithm will train to predict its entries.

## Target
The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding.
Expand Down
25 changes: 9 additions & 16 deletions docs/tutorials/classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,7 @@
"execution_count": null,
"outputs": [],
"source": [
"split_tuple = titanic.split_rows(0.60)\n",
"\n",
"train_table = split_tuple[0]\n",
"testing_table = split_tuple[1]\n",
"train_table, testing_table = titanic.split_rows(0.6)\n",
"\n",
"test_table = testing_table.remove_columns([\"survived\"]).shuffle_rows()"
],
Expand Down Expand Up @@ -111,9 +108,7 @@
},
{
"cell_type": "markdown",
"source": [
"5. Tag the `survived` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable."
],
"source": "5. Mark the `survived` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.",
"metadata": {
"collapsed": false
}
Expand All @@ -123,7 +118,7 @@
"execution_count": null,
"outputs": [],
"source": [
"tagged_train_table= transformed_table.tag_columns(\"survived\", feature_names=[\n",
"train_tabular_dataset = transformed_table.to_tabular_dataset(\"survived\", feature_names=[\n",
" *new_columns\n",
"])"
],
Expand All @@ -133,9 +128,7 @@
},
{
"cell_type": "markdown",
"source": [
"6. Use `RandomForest` classifier as a model for the classification. Pass the \"tagged_titanic\" table to the fit function of the model:"
],
"source": "6. Use `RandomForest` classifier as a model for the classification. Pass the \"train_tabular_dataset\" table to the fit function of the model:",
"metadata": {
"collapsed": false
}
Expand All @@ -148,7 +141,7 @@
"from safeds.ml.classical.classification import RandomForestClassifier\n",
"\n",
"model = RandomForestClassifier()\n",
"fitted_model= model.fit(tagged_train_table)"
"fitted_model= model.fit(train_tabular_dataset)"
],
"metadata": {
"collapsed": false
Expand All @@ -172,11 +165,11 @@
"encoder = OneHotEncoder().fit(test_table, [\"sex\"])\n",
"transformed_test_table = encoder.transform(test_table)\n",
"\n",
"predicition = fitted_model.predict(\n",
"prediction = fitted_model.predict(\n",
" transformed_test_table\n",
")\n",
"#For visualisation purposes we only print out the first 15 rows.\n",
"predicition.slice_rows(0,15)"
"prediction.to_table().slice_rows(start=0, end=15)"
],
"metadata": {
"collapsed": false
Expand All @@ -199,10 +192,10 @@
"encoder = OneHotEncoder().fit(test_table, [\"sex\"])\n",
"testing_table = encoder.transform(testing_table)\n",
"\n",
"tagged_test_table= testing_table.tag_columns(\"survived\", feature_names=[\n",
"test_tabular_dataset = testing_table.to_tabular_dataset(\"survived\", feature_names=[\n",
" *new_columns\n",
"])\n",
"fitted_model.accuracy(tagged_train_table)\n"
"fitted_model.accuracy(test_tabular_dataset)\n"
],
"metadata": {
"collapsed": false
Expand Down
10 changes: 5 additions & 5 deletions docs/tutorials/machine_learning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@
"\n",
"This tutorial explains how to train a machine learning model in Safe-DS and use it to make predictions.\n",
"\n",
"## Create a `TaggedTable`\n",
"## Create a `TabularDataset`\n",
"\n",
"First, we need to create a `TaggedTable` from the training data. `TaggedTable`s are used to train supervised machine learning models, because they keep track of the target\n",
"column. A `TaggedTable` can be created from a `Table` by calling the `tag_columns` method:"
"First, we need to create a `TabularDataset` from the training data. `TabularDataset`s are used to train supervised machine learning models, because they keep track of the target\n",
"column. A `TabularDataset` can be created from a `Table` by calling the `to_tabular_dataset` method:"
],
"metadata": {
"collapsed": false
Expand All @@ -30,7 +30,7 @@
" \"result\": [6, 7, 10, 13, 9]\n",
"})\n",
"\n",
"tagged_table = training_set.tag_columns(\n",
"tabular_dataset = training_set.to_tabular_dataset(\n",
" target_name=\"result\"\n",
")"
],
Expand All @@ -57,7 +57,7 @@
"from safeds.ml.classical.regression import LinearRegressionRegressor\n",
"\n",
"model = LinearRegressionRegressor()\n",
"fitted_model = model.fit(tagged_table)"
"fitted_model = model.fit(tabular_dataset)"
],
"metadata": {
"collapsed": false
Expand Down
23 changes: 8 additions & 15 deletions docs/tutorials/regression.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -50,10 +50,7 @@
"execution_count": null,
"outputs": [],
"source": [
"split_tuple = pricing.split_rows(0.60)\n",
"\n",
"train_table = split_tuple[0]\n",
"testing_table = split_tuple[1]\n",
"train_table, testing_table = pricing.split_rows(0.60)\n",
"\n",
"test_table = testing_table.remove_columns([\"price\"]).shuffle_rows()"
],
Expand All @@ -63,9 +60,7 @@
},
{
"cell_type": "markdown",
"source": [
"3. Tag the `price` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.\n"
],
"source": "3. Mark the `price` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.\n",
"metadata": {
"collapsed": false
}
Expand All @@ -77,7 +72,7 @@
"source": [
"feature_columns = set(train_table.column_names) - set([\"price\", \"id\"])\n",
"\n",
"tagged_train_table = train_table.tag_columns(\"price\", feature_names=[\n",
"train_tabular_dataset = train_table.to_tabular_dataset(\"price\", feature_names=[\n",
" *feature_columns])\n"
],
"metadata": {
Expand All @@ -86,9 +81,7 @@
},
{
"cell_type": "markdown",
"source": [
"4. Use `Decision Tree` regressor as a model for the regression. Pass the \"tagged_pricing\" table to the fit function of the model:\n"
],
"source": "4. Use `Decision Tree` regressor as a model for the regression. Pass the \"train_tabular_dataset\" table to the fit function of the model:\n",
"metadata": {
"collapsed": false
}
Expand All @@ -101,7 +94,7 @@
"from safeds.ml.classical.regression import DecisionTreeRegressor\n",
"\n",
"model = DecisionTreeRegressor()\n",
"fitted_model = model.fit(tagged_train_table)"
"fitted_model = model.fit(train_tabular_dataset)"
],
"metadata": {
"collapsed": false
Expand All @@ -125,7 +118,7 @@
" test_table\n",
")\n",
"# For visualisation purposes we only print out the first 15 rows.\n",
"prediction.slice_rows(0,15)"
"prediction.to_table().slice_rows(start=0, end=15)"
],
"metadata": {
"collapsed": false
Expand Down Expand Up @@ -154,11 +147,11 @@
}
],
"source": [
"tagged_test_table= testing_table.tag_columns(\"price\", feature_names=[\n",
"test_tabular_dataset = testing_table.to_tabular_dataset(\"price\", feature_names=[\n",
" *feature_columns\n",
"])\n",
"\n",
"fitted_model.mean_absolute_error(tagged_test_table)\n"
"fitted_model.mean_absolute_error(test_tabular_dataset)\n"
],
"metadata": {
"collapsed": false
Expand Down
1 change: 1 addition & 0 deletions src/safeds/data/labeled/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Work with labeled data."""
19 changes: 19 additions & 0 deletions src/safeds/data/labeled/containers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
"""Classes that can store labeled data."""

from typing import TYPE_CHECKING

import apipkg

if TYPE_CHECKING:
from ._tabular_dataset import TabularDataset

apipkg.initpkg(
__name__,
{
"TabularDataset": "._tabular_dataset:TabularDataset",
},
)

__all__ = [
"TabularDataset",
]
Loading

0 comments on commit db2b613

Please sign in to comment.