feat: rework TaggedTable (#680)

Closes #647 ### Summary of Changes * `TaggedTable` is now called `TabularDataset`, * It is moved from `safeds.data.tabular.containers` to `safeds.data.labeled.containers`. That's where all dataset classes for supervised learning will go, like the upcoming `ImageDataset`. * `TabularDataset` no longer inherits from `Table`. * `TabularDataset` now has a very small interface. It's only meant to be used as input for supervised ML models. Table manipulation is now solely done using the `Table` class. * `tag_columns` on `Table` is now called `to_tabular_dataset`. This makes it consistent with other conversion methods and emphasizes that this is a terminal operation and should only be used once one is done manipulating the table. * `TabularDataset` now has a public `to_table` method to get a `Table` again. --------- Co-authored-by: megalinter-bot <[email protected]>
Safe-DS · May 1, 2024 · db2b613 · db2b613
1 parent 72842dd
commit db2b613
Show file tree

Hide file tree

Showing 104 changed files with 1,168 additions and 3,443 deletions.
diff --git a/.github/linters/.ruff.toml b/.github/linters/.ruff.toml
@@ -1,7 +1,9 @@
-ignore-init-module-imports = true
 line-length = 120
 target-version = "py311"
 
+[lint]
+ignore-init-module-imports = true
+
 select = [
     "F",
     "E",
@@ -91,7 +93,7 @@ ignore = [
     "TRY003",
 ]
 
-[per-file-ignores]
+[lint.per-file-ignores]
 "*test*.py" = [
     # Undocumented declarations
     "D100",
@@ -108,5 +110,5 @@ ignore = [
     "TCH004",
 ]
 
-[pydocstyle]
+[lint.pydocstyle]
 convention = "numpy"
diff --git a/docs/development/project_guidelines.md b/docs/development/project_guidelines.md
@@ -488,7 +488,6 @@ when adding new classes to it.
         "Column",
         "Row",
         "Table",
-        "TaggedTable",
     ]
     ```
 
@@ -497,7 +496,6 @@ when adding new classes to it.
     ```py
     __all__ = [
         "Table",
-        "TaggedTable",
         "Column",
         "Row",
     ]

diff --git a/docs/glossary.md b/docs/glossary.md
@@ -117,9 +117,9 @@ It is analyzed to uncover the meaningful information in the larger data set.
 Supervised Learning is a subcategory of ML. This approach uses algorithms to learn given data.
 Those Algorithms might be able to find hidden meaning in data - without being told where to look.
 
-## Tagged Table
-In addition to a regular table, a Tagged Table will mark one column as tagged, meaning that
-an applied algorithm will train to predict its entries. The marked column is referred to as ["target"](#target).
+## Tabular Dataset
+In addition to a regular table, a tabular dataset will mark one column as ["target"](#target), meaning that an applied
+algorithm will train to predict its entries.
 
 ## Target
 The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding.

diff --git a/docs/tutorials/classification.ipynb b/docs/tutorials/classification.ipynb
@@ -50,10 +50,7 @@
    "execution_count": null,
    "outputs": [],
    "source": [
-    "split_tuple = titanic.split_rows(0.60)\n",
-    "\n",
-    "train_table = split_tuple[0]\n",
-    "testing_table = split_tuple[1]\n",
+    "train_table, testing_table = titanic.split_rows(0.6)\n",
     "\n",
     "test_table = testing_table.remove_columns([\"survived\"]).shuffle_rows()"
    ],
@@ -111,9 +108,7 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "5. Tag the `survived` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable."
-   ],
+   "source": "5. Mark the `survived` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.",
    "metadata": {
     "collapsed": false
    }
@@ -123,7 +118,7 @@
    "execution_count": null,
    "outputs": [],
    "source": [
-    "tagged_train_table= transformed_table.tag_columns(\"survived\", feature_names=[\n",
+    "train_tabular_dataset = transformed_table.to_tabular_dataset(\"survived\", feature_names=[\n",
     "    *new_columns\n",
     "])"
    ],
@@ -133,9 +128,7 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "6. Use `RandomForest` classifier as a model for the classification. Pass the \"tagged_titanic\" table to the fit function of the model:"
-   ],
+   "source": "6. Use `RandomForest` classifier as a model for the classification. Pass the \"train_tabular_dataset\" table to the fit function of the model:",
    "metadata": {
     "collapsed": false
    }
@@ -148,7 +141,7 @@
     "from safeds.ml.classical.classification import RandomForestClassifier\n",
     "\n",
     "model = RandomForestClassifier()\n",
-    "fitted_model= model.fit(tagged_train_table)"
+    "fitted_model= model.fit(train_tabular_dataset)"
    ],
    "metadata": {
     "collapsed": false
@@ -172,11 +165,11 @@
     "encoder = OneHotEncoder().fit(test_table, [\"sex\"])\n",
     "transformed_test_table = encoder.transform(test_table)\n",
     "\n",
-    "predicition = fitted_model.predict(\n",
+    "prediction = fitted_model.predict(\n",
     "    transformed_test_table\n",
     ")\n",
     "#For visualisation purposes we only print out the first 15 rows.\n",
-    "predicition.slice_rows(0,15)"
+    "prediction.to_table().slice_rows(start=0, end=15)"
    ],
    "metadata": {
     "collapsed": false
@@ -199,10 +192,10 @@
     "encoder = OneHotEncoder().fit(test_table, [\"sex\"])\n",
     "testing_table = encoder.transform(testing_table)\n",
     "\n",
-    "tagged_test_table= testing_table.tag_columns(\"survived\", feature_names=[\n",
+    "test_tabular_dataset = testing_table.to_tabular_dataset(\"survived\", feature_names=[\n",
     "    *new_columns\n",
     "])\n",
-    "fitted_model.accuracy(tagged_train_table)\n"
+    "fitted_model.accuracy(test_tabular_dataset)\n"
    ],
    "metadata": {
     "collapsed": false

diff --git a/docs/tutorials/machine_learning.ipynb b/docs/tutorials/machine_learning.ipynb
@@ -7,10 +7,10 @@
     "\n",
     "This tutorial explains how to train a machine learning model in Safe-DS and use it to make predictions.\n",
     "\n",
-    "## Create a `TaggedTable`\n",
+    "## Create a `TabularDataset`\n",
     "\n",
-    "First, we need to create a `TaggedTable` from the training data. `TaggedTable`s are used to train supervised machine learning models, because they keep track of the target\n",
-    "column. A `TaggedTable` can be created from a `Table` by calling the `tag_columns` method:"
+    "First, we need to create a `TabularDataset` from the training data. `TabularDataset`s are used to train supervised machine learning models, because they keep track of the target\n",
+    "column. A `TabularDataset` can be created from a `Table` by calling the `to_tabular_dataset` method:"
    ],
    "metadata": {
     "collapsed": false
@@ -30,7 +30,7 @@
     "    \"result\": [6, 7, 10, 13, 9]\n",
     "})\n",
     "\n",
-    "tagged_table = training_set.tag_columns(\n",
+    "tabular_dataset = training_set.to_tabular_dataset(\n",
     "    target_name=\"result\"\n",
     ")"
    ],
@@ -57,7 +57,7 @@
     "from safeds.ml.classical.regression import LinearRegressionRegressor\n",
     "\n",
     "model = LinearRegressionRegressor()\n",
-    "fitted_model = model.fit(tagged_table)"
+    "fitted_model = model.fit(tabular_dataset)"
    ],
    "metadata": {
     "collapsed": false

diff --git a/docs/tutorials/regression.ipynb b/docs/tutorials/regression.ipynb
@@ -50,10 +50,7 @@
    "execution_count": null,
    "outputs": [],
    "source": [
-    "split_tuple = pricing.split_rows(0.60)\n",
-    "\n",
-    "train_table = split_tuple[0]\n",
-    "testing_table = split_tuple[1]\n",
+    "train_table, testing_table = pricing.split_rows(0.60)\n",
     "\n",
     "test_table = testing_table.remove_columns([\"price\"]).shuffle_rows()"
    ],
@@ -63,9 +60,7 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "3. Tag the `price` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.\n"
-   ],
+   "source": "3. Mark the `price` `Column` as the target variable to be predicted. Use the new names of the fitted `Column`s as features, which will be used to make predictions based on the target variable.\n",
    "metadata": {
     "collapsed": false
    }
@@ -77,7 +72,7 @@
    "source": [
     "feature_columns = set(train_table.column_names) - set([\"price\", \"id\"])\n",
     "\n",
-    "tagged_train_table = train_table.tag_columns(\"price\", feature_names=[\n",
+    "train_tabular_dataset = train_table.to_tabular_dataset(\"price\", feature_names=[\n",
     "    *feature_columns])\n"
    ],
    "metadata": {
@@ -86,9 +81,7 @@
   },
   {
    "cell_type": "markdown",
-   "source": [
-    "4. Use `Decision Tree` regressor as a model for the regression. Pass the \"tagged_pricing\" table to the fit function of the model:\n"
-   ],
+   "source": "4. Use `Decision Tree` regressor as a model for the regression. Pass the \"train_tabular_dataset\" table to the fit function of the model:\n",
    "metadata": {
     "collapsed": false
    }
@@ -101,7 +94,7 @@
     "from safeds.ml.classical.regression import DecisionTreeRegressor\n",
     "\n",
     "model = DecisionTreeRegressor()\n",
-    "fitted_model = model.fit(tagged_train_table)"
+    "fitted_model = model.fit(train_tabular_dataset)"
    ],
    "metadata": {
     "collapsed": false
@@ -125,7 +118,7 @@
     "    test_table\n",
     ")\n",
     "# For visualisation purposes we only print out the first 15 rows.\n",
-    "prediction.slice_rows(0,15)"
+    "prediction.to_table().slice_rows(start=0, end=15)"
    ],
    "metadata": {
     "collapsed": false
@@ -154,11 +147,11 @@
     }
    ],
    "source": [
-    "tagged_test_table= testing_table.tag_columns(\"price\", feature_names=[\n",
+    "test_tabular_dataset = testing_table.to_tabular_dataset(\"price\", feature_names=[\n",
     "    *feature_columns\n",
     "])\n",
     "\n",
-    "fitted_model.mean_absolute_error(tagged_test_table)\n"
+    "fitted_model.mean_absolute_error(test_tabular_dataset)\n"
    ],
    "metadata": {
     "collapsed": false

diff --git a/src/safeds/data/labeled/__init__.py b/src/safeds/data/labeled/__init__.py
@@ -0,0 +1 @@
+"""Work with labeled data."""
diff --git a/src/safeds/data/labeled/containers/__init__.py b/src/safeds/data/labeled/containers/__init__.py
@@ -0,0 +1,19 @@
+"""Classes that can store labeled data."""
+
+from typing import TYPE_CHECKING
+
+import apipkg
+
+if TYPE_CHECKING:
+    from ._tabular_dataset import TabularDataset
+
+apipkg.initpkg(
+    __name__,
+    {
+        "TabularDataset": "._tabular_dataset:TabularDataset",
+    },
+)
+
+__all__ = [
+    "TabularDataset",
+]