Skip to content

Commit

Permalink
feat: stabilize Table class (#979)
Browse files Browse the repository at this point in the history
Closes #875
Closes #877
Closes partially #977

### Summary of Changes

Stabilize the API of the `Table` class. This PR introduces several
breaking changes to this class:

- All optional parameters are now keyword-only, so we can reposition
them later.
- The `data` parameter of `__init__` is now required.
- Rename `remove_columns_except` to `select_columns`
- The new method can also be called with a callback that determines
which columns to select.
- Rename `add_table_as_columns` to `add_tables_as_columns`
  - Multiple tables can now be passed at once.
- Rename `add_table_as_rows` to `add_tables_as_rows`
  - Multiple tables can now be passed at once.

It also adds new functionality throughout the library:

- New method `Table.add_index_column` to add a new column with
auto-incrementing integer values to a table.
- New method `Table.filter_rows` to keep only the rows matched by some
predicate.
- New method `Table.filter_rows_by_column` to keep only the rows that
have a value in a specific column that matches some predicate.
- New parameter `random_seed` for `Table.shuffle_rows` and
`Table.split_rows` to control the pseudorandom number generator.
Previously, the methods were deterministic, but the seed was hidden.
- New parameter `missing_value_ratio_threshold` of
`Table.remove_columns_with_missing_values` to be able to keep columns
with only a few missing values.
- Various static factory methods under `ColumnType` to instantiate
column types. This prepares for #754.

Finally, the methods `Table.summarize_statistics` and
`Column.summarize_statistics` are now considerably faster.

---------

Co-authored-by: megalinter-bot <[email protected]>
  • Loading branch information
lars-reimann and megalinter-bot authored Jan 12, 2025
1 parent 29fdefa commit db85617
Show file tree
Hide file tree
Showing 329 changed files with 7,703 additions and 5,154 deletions.
4 changes: 2 additions & 2 deletions benchmarks/metrics/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
from timeit import timeit

import polars as pl
from safeds.data.tabular.containers import Table
from safeds.ml.metrics import ClassificationMetrics

from benchmarks.table.utils import create_synthetic_table
from safeds.data.tabular.containers import Table
from safeds.ml.metrics import ClassificationMetrics

REPETITIONS = 10

Expand Down
3 changes: 1 addition & 2 deletions benchmarks/table/column_operations.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
from timeit import timeit

from safeds.data.tabular.containers import Table

from benchmarks.table.utils import create_synthetic_table
from safeds.data.tabular.containers import Table

REPETITIONS = 10

Expand Down
102 changes: 65 additions & 37 deletions docs/tutorials/classification.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"In this tutorial, we use `safeds` on **Titanic passenger data** to predict who will survive and who will not."
Expand All @@ -12,7 +15,10 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Loading Data\n",
Expand All @@ -23,7 +29,10 @@
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
Expand Down Expand Up @@ -75,7 +84,7 @@
"from safeds.data.tabular.containers import Table\n",
"\n",
"raw_data = Table.from_csv_file(\"data/titanic.csv\")\n",
"#For visualisation purposes we only print out the first 15 rows.\n",
"# For visualisation purposes we only print out the first 15 rows.\n",
"raw_data.slice_rows(length=15)"
]
},
Expand Down Expand Up @@ -169,18 +178,18 @@
"source": [
"We remove certain columns for the following reasons:\n",
"1. **high idness**: `id` , `ticket`\n",
"2. **high stability**: `parents_children` \n",
"2. **high stability**: `parents_children`\n",
"3. **high missing value ratio**: `cabin`"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_table = train_table.remove_columns([\"id\",\"ticket\", \"parents_children\", \"cabin\"])\n",
"test_table = test_table.remove_columns([\"id\",\"ticket\", \"parents_children\", \"cabin\"])"
"train_table = train_table.remove_columns([\"id\", \"ticket\", \"parents_children\", \"cabin\"])\n",
"test_table = test_table.remove_columns([\"id\", \"ticket\", \"parents_children\", \"cabin\"])"
]
},
{
Expand All @@ -199,15 +208,18 @@
"source": [
"from safeds.data.tabular.transformation import SimpleImputer\n",
"\n",
"simple_imputer = SimpleImputer(column_names=[\"age\",\"fare\"],strategy=SimpleImputer.Strategy.mean())\n",
"simple_imputer = SimpleImputer(column_names=[\"age\", \"fare\"], strategy=SimpleImputer.Strategy.mean())\n",
"fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)\n",
"transformed_test_data = fitted_simple_imputer_train.transform(test_table)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Handling Nominal Categorical Data\n",
Expand All @@ -219,13 +231,18 @@
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"from safeds.data.tabular.transformation import OneHotEncoder\n",
"\n",
"fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(column_names=[\"sex\", \"port_embarked\"]).fit_and_transform(transformed_train_data)\n",
"fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(\n",
" column_names=[\"sex\", \"port_embarked\"],\n",
").fit_and_transform(transformed_train_data)\n",
"transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)"
]
},
Expand Down Expand Up @@ -299,7 +316,10 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Marking the Target Column\n",
Expand All @@ -314,17 +334,23 @@
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"tagged_train_table = transformed_train_data.to_tabular_dataset(\"survived\",extra_names=[\"name\"])"
"tagged_train_table = transformed_train_data.to_tabular_dataset(\"survived\", extra_names=[\"name\"])"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Fitting a Classifier\n",
Expand All @@ -335,7 +361,10 @@
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
Expand All @@ -348,7 +377,10 @@
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Predicting with the Classifier\n",
Expand All @@ -360,7 +392,10 @@
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
Expand Down Expand Up @@ -433,14 +468,17 @@
],
"source": [
"reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)\n",
"#For visualisation purposes we only print out the first 15 rows.\n",
"# For visualisation purposes we only print out the first 15 rows.\n",
"reverse_transformed_prediction.slice_rows(length=15)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"source": [
"### Testing the Accuracy of the Model\n",
Expand All @@ -449,28 +487,18 @@
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy on test data: 79.3893%\n"
]
}
],
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"accuracy = fitted_classifier.accuracy(transformed_test_data) * 100\n",
"print(f'Accuracy on test data: {accuracy:.4f}%')"
"f\"Accuracy on test data: {accuracy:.4f}%\""
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -488,5 +516,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 4
}
Loading

0 comments on commit db85617

Please sign in to comment.