feat: stabilize Table class (#979)

Closes #875 Closes #877 Closes partially #977 ### Summary of Changes Stabilize the API of the `Table` class. This PR introduces several breaking changes to this class: - All optional parameters are now keyword-only, so we can reposition them later. - The `data` parameter of `__init__` is now required. - Rename `remove_columns_except` to `select_columns` - The new method can also be called with a callback that determines which columns to select. - Rename `add_table_as_columns` to `add_tables_as_columns` - Multiple tables can now be passed at once. - Rename `add_table_as_rows` to `add_tables_as_rows` - Multiple tables can now be passed at once. It also adds new functionality throughout the library: - New method `Table.add_index_column` to add a new column with auto-incrementing integer values to a table. - New method `Table.filter_rows` to keep only the rows matched by some predicate. - New method `Table.filter_rows_by_column` to keep only the rows that have a value in a specific column that matches some predicate. - New parameter `random_seed` for `Table.shuffle_rows` and `Table.split_rows` to control the pseudorandom number generator. Previously, the methods were deterministic, but the seed was hidden. - New parameter `missing_value_ratio_threshold` of `Table.remove_columns_with_missing_values` to be able to keep columns with only a few missing values. - Various static factory methods under `ColumnType` to instantiate column types. This prepares for #754. Finally, the methods `Table.summarize_statistics` and `Column.summarize_statistics` are now considerably faster. --------- Co-authored-by: megalinter-bot <[email protected]>
Safe-DS · Jan 12, 2025 · db85617 · db85617
1 parent 29fdefa
commit db85617
Show file tree

Hide file tree

Showing 329 changed files with 7,703 additions and 5,154 deletions.
diff --git a/benchmarks/metrics/classification.py b/benchmarks/metrics/classification.py
@@ -3,10 +3,10 @@
 from timeit import timeit
 
 import polars as pl
-from safeds.data.tabular.containers import Table
-from safeds.ml.metrics import ClassificationMetrics
 
 from benchmarks.table.utils import create_synthetic_table
+from safeds.data.tabular.containers import Table
+from safeds.ml.metrics import ClassificationMetrics
 
 REPETITIONS = 10
 

diff --git a/benchmarks/table/column_operations.py b/benchmarks/table/column_operations.py
@@ -1,8 +1,7 @@
 from timeit import timeit
 
-from safeds.data.tabular.containers import Table
-
 from benchmarks.table.utils import create_synthetic_table
+from safeds.data.tabular.containers import Table
 
 REPETITIONS = 10
 

diff --git a/docs/tutorials/classification.ipynb b/docs/tutorials/classification.ipynb
@@ -3,7 +3,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "In this tutorial, we use `safeds` on **Titanic passenger data** to predict who will survive and who will not."
@@ -12,7 +15,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Loading Data\n",
@@ -23,7 +29,10 @@
    "cell_type": "code",
    "execution_count": 1,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [
     {
@@ -75,7 +84,7 @@
     "from safeds.data.tabular.containers import Table\n",
     "\n",
     "raw_data = Table.from_csv_file(\"data/titanic.csv\")\n",
-    "#For visualisation purposes we only print out the first 15 rows.\n",
+    "# For visualisation purposes we only print out the first 15 rows.\n",
     "raw_data.slice_rows(length=15)"
    ]
   },
@@ -169,18 +178,18 @@
    "source": [
     "We remove certain columns for the following reasons:\n",
     "1. **high idness**: `id` , `ticket`\n",
-    "2. **high stability**: `parents_children` \n",
+    "2. **high stability**: `parents_children`\n",
     "3. **high missing value ratio**: `cabin`"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
-    "train_table = train_table.remove_columns([\"id\",\"ticket\", \"parents_children\", \"cabin\"])\n",
-    "test_table = test_table.remove_columns([\"id\",\"ticket\", \"parents_children\", \"cabin\"])"
+    "train_table = train_table.remove_columns([\"id\", \"ticket\", \"parents_children\", \"cabin\"])\n",
+    "test_table = test_table.remove_columns([\"id\", \"ticket\", \"parents_children\", \"cabin\"])"
    ]
   },
   {
@@ -199,15 +208,18 @@
    "source": [
     "from safeds.data.tabular.transformation import SimpleImputer\n",
     "\n",
-    "simple_imputer = SimpleImputer(column_names=[\"age\",\"fare\"],strategy=SimpleImputer.Strategy.mean())\n",
+    "simple_imputer = SimpleImputer(column_names=[\"age\", \"fare\"], strategy=SimpleImputer.Strategy.mean())\n",
     "fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)\n",
     "transformed_test_data = fitted_simple_imputer_train.transform(test_table)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Handling Nominal Categorical Data\n",
@@ -219,13 +231,18 @@
    "cell_type": "code",
    "execution_count": 6,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
     "from safeds.data.tabular.transformation import OneHotEncoder\n",
     "\n",
-    "fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(column_names=[\"sex\", \"port_embarked\"]).fit_and_transform(transformed_train_data)\n",
+    "fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(\n",
+    "    column_names=[\"sex\", \"port_embarked\"],\n",
+    ").fit_and_transform(transformed_train_data)\n",
     "transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)"
    ]
   },
@@ -299,7 +316,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Marking the Target Column\n",
@@ -314,17 +334,23 @@
    "cell_type": "code",
    "execution_count": 8,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
-    "tagged_train_table = transformed_train_data.to_tabular_dataset(\"survived\",extra_names=[\"name\"])"
+    "tagged_train_table = transformed_train_data.to_tabular_dataset(\"survived\", extra_names=[\"name\"])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Fitting a Classifier\n",
@@ -335,7 +361,10 @@
    "cell_type": "code",
    "execution_count": 9,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -348,7 +377,10 @@
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Predicting with the Classifier\n",
@@ -360,7 +392,10 @@
    "cell_type": "code",
    "execution_count": 10,
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "outputs": [],
    "source": [
@@ -433,14 +468,17 @@
    ],
    "source": [
     "reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)\n",
-    "#For visualisation purposes we only print out the first 15 rows.\n",
+    "# For visualisation purposes we only print out the first 15 rows.\n",
     "reverse_transformed_prediction.slice_rows(length=15)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {
-    "collapsed": false
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    }
    },
    "source": [
     "### Testing the Accuracy of the Model\n",
@@ -449,28 +487,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
-   "metadata": {
-    "collapsed": false
-   },
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "Accuracy on test data: 79.3893%\n"
-     ]
-    }
-   ],
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
    "source": [
     "accuracy = fitted_classifier.accuracy(transformed_test_data) * 100\n",
-    "print(f'Accuracy on test data: {accuracy:.4f}%')"
+    "f\"Accuracy on test data: {accuracy:.4f}%\""
    ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -488,5 +516,5 @@
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 4
 }