Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add schema conversions when adding new rows to a table and schema conversion when creating a new table #432

Merged
merged 66 commits into from
Jul 12, 2023
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
6fc5289
_data_type function
PhilipGutberlet May 26, 2023
f94a226
Columny_type is working without is_nullable
PhilipGutberlet Jun 2, 2023
cd7d745
feat: Added new static method `Schema.merge_multiple_schemas` to merg…
Marsmaennchen221 Jun 2, 2023
faea224
test: Corrected tests for different schemas
Marsmaennchen221 Jun 2, 2023
fbf0411
Columny_type is working except numeric + print statements for Alex
PhilipGutberlet Jun 2, 2023
87f9b47
Merge branch 'main' of https://github.com/Safe-DS/Stdlib into 127-add…
Marsmaennchen221 Jun 2, 2023
c7774d6
feat: Added abstract constructor to `ColumnType`
Marsmaennchen221 Jun 2, 2023
80ee415
style: apply automated linter fixes
megalinter-bot Jun 2, 2023
88bb00b
style: apply automated linter fixes
megalinter-bot Jun 2, 2023
ffbf2d6
Everythings works properly except numeric Columns with None
PhilipGutberlet Jun 2, 2023
cbd7722
Changes by hussi
PhilipGutberlet Jun 2, 2023
0bc211b
Merge branch 'main' into 322-detected-the-true-columntypes-when-initi…
daniaHu Jun 16, 2023
ff6025b
Merge branch 'main' into 322-detected-the-true-columntypes-when-initi…
daniaHu Jun 23, 2023
aae7b81
feat: fixed some tests, now Columns aren't wraped in pd.Series
daniaHu Jun 23, 2023
f54fdd5
Fixed Bug where test would break if the first cell in a column is null
sibre28 Jun 23, 2023
fafce09
Merge remote-tracking branch 'origin/322-detected-the-true-columntype…
daniaHu Jun 23, 2023
655bee0
changes rolled back, couldn't find a way to work with pd.DataFrame an…
daniaHu Jun 23, 2023
923cad0
fix: fix wrong datatype error
alex-senger Jun 30, 2023
e725fd8
Merge remote-tracking branch 'origin/127-add_row-and-add_rows-should-…
alex-senger Jun 30, 2023
cf5cfa3
Merge remote-tracking branch 'origin/322-detected-the-true-columntype…
alex-senger Jun 30, 2023
2431d02
fix: fix merge problems
alex-senger Jun 30, 2023
4844788
fix: fix `add_rows`
alex-senger Jun 30, 2023
d3ba722
Fix merge_multiple_schemas() method to also handle Nothing Types corr…
sibre28 Jul 1, 2023
69d0fb4
Fix add_row() Method to correctly handle a row with a different schema
sibre28 Jul 2, 2023
30e8c1f
fix: fix `remove_rows_with_missing_values` to update schema
alex-senger Jul 7, 2023
7de0f45
fix: fix table transformer error handling
alex-senger Jul 7, 2023
7af83da
fix: fix `one_hot_encoder` to be able to handle `float("nan")` values
alex-senger Jul 7, 2023
9d71c12
Merge branch 'main' of https://github.com/Safe-DS/Stdlib into 404-mer…
alex-senger Jul 7, 2023
ec62ba3
fix: fix test_row parameterize
alex-senger Jul 7, 2023
fafad3a
fix: add typehints
alex-senger Jul 7, 2023
9d3818e
Try stuff to make linter happy
sibre28 Jul 7, 2023
0f474c9
fix: fix error handling and typehint
alex-senger Jul 7, 2023
5567929
Try stuff to make linter happy
sibre28 Jul 7, 2023
6437a90
Merge branch '404-merge-issues-322-and-127' of https://github.com/Saf…
alex-senger Jul 7, 2023
9d44318
Try stuff to make linter happy
sibre28 Jul 7, 2023
7949af1
fix: trying our best to make the linter happy
alex-senger Jul 7, 2023
878c82b
Merge branch '404-merge-issues-322-and-127' of https://github.com/Saf…
alex-senger Jul 7, 2023
4903ae4
style: apply automated linter fixes
megalinter-bot Jul 7, 2023
90a9948
Add comment to linter solution
sibre28 Jul 7, 2023
66585a6
Add comment to linter solution
sibre28 Jul 7, 2023
109b2a7
style: apply automated linter fixes
megalinter-bot Jul 7, 2023
b4881f6
fix: fix `_data_type`
alex-senger Jul 7, 2023
7d89395
style: apply automated linter fixes
megalinter-bot Jul 7, 2023
e9c1201
Merge branch 'main' into 404-merge-issues-322-and-127
alex-senger Jul 7, 2023
c74e2d4
Merge branch 'main' into 404-merge-issues-322-and-127
lars-reimann Jul 7, 2023
6267075
test: add tests for `sort_columns`
alex-senger Jul 10, 2023
9d72cd0
test: add test for unsupported data types
alex-senger Jul 10, 2023
9c7bc9e
Merge branch 'main' of https://github.com/Safe-DS/Stdlib into 404-mer…
alex-senger Jul 10, 2023
8386ac2
fix: remove unnecessary code
alex-senger Jul 10, 2023
a52a12b
fix: remove duplicate code
alex-senger Jul 10, 2023
672b6b1
fix: remove unnecessary code
alex-senger Jul 10, 2023
6ed0990
test: Add test for CodeCov
alex-senger Jul 10, 2023
531e7d6
fix: Fix Typehint and add match to raises
alex-senger Jul 10, 2023
18f27ea
test: remove test because it makes the Linter fail
alex-senger Jul 10, 2023
ecb8796
style: apply automated linter fixes
megalinter-bot Jul 10, 2023
95526cb
fix: replace pass with docstring
alex-senger Jul 11, 2023
50e15fc
Apply suggestions from code review
alex-senger Jul 11, 2023
50be3f6
Merge branch 'main' of https://github.com/Safe-DS/Stdlib into 404-mer…
alex-senger Jul 11, 2023
171de13
style: apply automated linter fixes
megalinter-bot Jul 11, 2023
9c16312
fix: remove `SchemaMismatchError`
alex-senger Jul 11, 2023
4c8f2a9
Update src/safeds/data/tabular/typing/__init__.py
Marsmaennchen221 Jul 11, 2023
80c35d5
Apply suggestions from code review
alex-senger Jul 12, 2023
c944492
test: apply suggestion from codereview
alex-senger Jul 12, 2023
c2b2679
test: add import
alex-senger Jul 12, 2023
918f4b6
Merge branch 'main' of https://github.com/Safe-DS/Stdlib into 404-mer…
alex-senger Jul 12, 2023
551e396
style: apply automated linter fixes
megalinter-bot Jul 12, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/safeds/data/tabular/containers/_column.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ def _from_pandas_series(data: pd.Series, type_: ColumnType | None = None) -> Col
result._name = data.name
result._data = data
# noinspection PyProtectedMember
result._type = type_ if type_ is not None else ColumnType._from_numpy_data_type(data.dtype)
result._type = type_ if type_ is not None else ColumnType._data_type(data)

return result

Expand Down Expand Up @@ -106,7 +106,7 @@ def __init__(self, name: str, data: Sequence[T] | None = None) -> None:
self._name: str = name
self._data: pd.Series = data.rename(name) if isinstance(data, pd.Series) else pd.Series(data, name=name)
# noinspection PyProtectedMember
self._type: ColumnType = ColumnType._from_numpy_data_type(self._data.dtype)
self._type: ColumnType = ColumnType._data_type(data)
alex-senger marked this conversation as resolved.
Show resolved Hide resolved

def __contains__(self, item: Any) -> bool:
return item in self._data
Expand Down
36 changes: 35 additions & 1 deletion src/safeds/data/tabular/containers/_row.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from __future__ import annotations

import copy
from collections.abc import Mapping
import functools
from collections.abc import Callable, Mapping
from typing import TYPE_CHECKING, Any

import pandas as pd
Expand Down Expand Up @@ -441,6 +442,39 @@ def get_column_type(self, column_name: str) -> ColumnType:
"""
return self._schema.get_column_type(column_name)

# ------------------------------------------------------------------------------------------------------------------
# Transformations
# ------------------------------------------------------------------------------------------------------------------

def sort_columns(
self,
comparator: Callable[[tuple, tuple], int] = lambda col1, col2: (col1[0] > col2[0]) - (col1[0] < col2[0]),
) -> Row:
"""
Sort the columns of a `Row` with the given comparator and return a new `Row`.

The original row is not modified. The comparator is a function that takes two tuples of (ColumnName, Value) `col1` and `col2` and
returns an integer:

* If `col1` should be ordered before `col2`, the function should return a negative number.
* If `col1` should be ordered after `col2`, the function should return a positive number.
* If the original order of `col1` and `col2` should be kept, the function should return 0.

If no comparator is given, the columns will be sorted alphabetically by their name.

Parameters
----------
comparator : Callable[[tuple, tuple], int]
The function used to compare two tuples of (ColumnName, Value).

Returns
-------
new_row : Row
A new row with sorted columns.
"""
sorted_row_dict = dict(sorted(self.to_dict().items(), key=functools.cmp_to_key(comparator)))
return Row.from_dict(sorted_row_dict)

# ------------------------------------------------------------------------------------------------------------------
# Conversion
# ------------------------------------------------------------------------------------------------------------------
Expand Down
85 changes: 43 additions & 42 deletions src/safeds/data/tabular/containers/_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@
DuplicateColumnNameError,
IndexOutOfBoundsError,
NonNumericColumnError,
SchemaMismatchError,
UnknownColumnNameError,
WrongFileExtensionError,
)
Expand Down Expand Up @@ -302,8 +301,8 @@ def from_rows(rows: list[Row]) -> Table:

Raises
------
SchemaMismatchError
If any of the row schemas does not match with the others.
UnknownColumnNameError
If any of the row column names does not match with the first row.

Examples
--------
Expand All @@ -318,17 +317,22 @@ def from_rows(rows: list[Row]) -> Table:
if len(rows) == 0:
return Table._from_pandas_dataframe(pd.DataFrame())

schema_compare: Schema = rows[0]._schema
column_names_compare: list = list(rows[0].column_names)
unknown_column_names = set()
row_array: list[pd.DataFrame] = []

for row in rows:
if schema_compare != row._schema:
raise SchemaMismatchError
unknown_column_names.update(set(column_names_compare) - set(row.column_names))
row_array.append(row._data)
if len(unknown_column_names) > 0:
raise UnknownColumnNameError(list(unknown_column_names))

dataframe: DataFrame = pd.concat(row_array, ignore_index=True)
dataframe.columns = schema_compare.column_names
return Table._from_pandas_dataframe(dataframe)
dataframe.columns = column_names_compare

schema = Schema.merge_multiple_schemas([row.schema for row in rows])

return Table._from_pandas_dataframe(dataframe, schema)

@staticmethod
def _from_pandas_dataframe(data: pd.DataFrame, schema: Schema | None = None) -> Table:
Expand Down Expand Up @@ -906,6 +910,9 @@ def add_row(self, row: Row) -> Table:

If the table happens to be empty beforehand, respective columns will be added automatically.

The order of columns of the new row will be adjusted to the order of columns in the table.
The new table will contain the merged schema.

This table is not modified.

Parameters
Expand All @@ -920,8 +927,8 @@ def add_row(self, row: Row) -> Table:

Raises
------
SchemaMismatchError
If the schema of the row does not match the table schema.
UnknownColumnNameError
If the row has different column names than the table.

Examples
--------
Expand All @@ -935,20 +942,18 @@ def add_row(self, row: Row) -> Table:
"""
int_columns = []
result = self._copy()
if self.number_of_columns == 0:
return Table.from_rows([row])
if len(set(self.column_names) - set(row.column_names)) > 0:
raise UnknownColumnNameError(list(set(self.column_names) - set(row.column_names)))

if result.number_of_rows == 0:
int_columns = list(filter(lambda name: isinstance(row[name], int | np.int64), row.column_names))
if result.number_of_columns == 0:
for column in row.column_names:
result._data[column] = Column(column, [])
result._schema = Schema._from_pandas_dataframe(result._data)
elif result.column_names != row.column_names:
raise SchemaMismatchError
elif result._schema != row.schema:
raise SchemaMismatchError
int_columns = list(filter(lambda name: isinstance(row[name], int | np.int64 | np.int32), row.column_names))

new_df = pd.concat([result._data, row._data]).infer_objects()
new_df.columns = result.column_names
result = Table._from_pandas_dataframe(new_df)
schema = Schema.merge_multiple_schemas([result.schema, row.schema])
result = Table._from_pandas_dataframe(new_df, schema)

for column in int_columns:
result = result.replace_column(column, [result.get_column(column).transform(lambda it: int(it))])
Expand All @@ -959,6 +964,9 @@ def add_rows(self, rows: list[Row] | Table) -> Table:
"""
Add multiple rows to a table.

The order of columns of the new rows will be adjusted to the order of columns in the table.
The new table will contain the merged schema.

This table is not modified.

Parameters
Expand All @@ -973,8 +981,8 @@ def add_rows(self, rows: list[Row] | Table) -> Table:

Raises
------
SchemaMismatchError
If the schema of one of the rows does not match the table schema.
UnknownColumnNameError
If at least one of the rows have different column names than the table.

Examples
--------
Expand All @@ -990,28 +998,21 @@ def add_rows(self, rows: list[Row] | Table) -> Table:
"""
if isinstance(rows, Table):
rows = rows.to_rows()
int_columns = []
result = self._copy()

if len(rows) == 0:
return self._copy()

different_column_names = set()
for row in rows:
if result.number_of_rows == 0:
int_columns = list(filter(lambda name: isinstance(row[name], int | np.int64), row.column_names))
if result.number_of_columns == 0:
for column in row.column_names:
result._data[column] = Column(column, [])
result._schema = Schema._from_pandas_dataframe(result._data)
elif result.column_names != row.column_names:
raise SchemaMismatchError
elif result._schema != row.schema:
raise SchemaMismatchError

row_frames = (row._data for row in rows)

new_df = pd.concat([result._data, *row_frames]).infer_objects()
new_df.columns = result.column_names
result = Table._from_pandas_dataframe(new_df)
different_column_names.update(set(rows[0].column_names) - set(row.column_names))
if len(different_column_names) > 0:
raise UnknownColumnNameError(list(different_column_names))

for column in int_columns:
result = result.replace_column(column, [result.get_column(column).transform(lambda it: int(it))])
result = self._copy()

for row in rows:
result = result.add_row(row)

return result

Expand Down Expand Up @@ -1269,7 +1270,7 @@ def remove_rows_with_missing_values(self) -> Table:
"""
result = self._data.copy(deep=True)
result = result.dropna(axis="index")
return Table._from_pandas_dataframe(result, self._schema)
return Table._from_pandas_dataframe(result)

def remove_rows_with_outliers(self) -> Table:
"""
Expand Down
6 changes: 3 additions & 3 deletions src/safeds/data/tabular/transformation/_label_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,9 @@ def inverse_transform(self, transformed_table: Table) -> Table:
if len(missing_columns) > 0:
raise UnknownColumnNameError(missing_columns)

if transformed_table.number_of_rows == 0:
raise ValueError("The LabelEncoder cannot inverse transform the table because it contains 0 rows")

if transformed_table.keep_only_columns(
self._column_names,
).remove_columns_with_non_numerical_values().number_of_columns < len(self._column_names):
Expand All @@ -168,9 +171,6 @@ def inverse_transform(self, transformed_table: Table) -> Table:
),
)

if transformed_table.number_of_rows == 0:
raise ValueError("The LabelEncoder cannot inverse transform the table because it contains 0 rows")

data = transformed_table._data.copy()
data.columns = transformed_table.column_names
data[self._column_names] = self._wrapped_transformer.inverse_transform(data[self._column_names])
Expand Down
12 changes: 9 additions & 3 deletions src/safeds/data/tabular/transformation/_one_hot_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,9 @@ def inverse_transform(self, transformed_table: Table) -> Table:
if len(missing_columns) > 0:
raise UnknownColumnNameError(missing_columns)

if transformed_table.number_of_rows == 0:
raise ValueError("The OneHotEncoder cannot inverse transform the table because it contains 0 rows")

if transformed_table._as_table().keep_only_columns(
_transformed_column_names,
).remove_columns_with_non_numerical_values().number_of_columns < len(_transformed_column_names):
Expand All @@ -293,9 +296,6 @@ def inverse_transform(self, transformed_table: Table) -> Table:
),
)

if transformed_table.number_of_rows == 0:
raise ValueError("The OneHotEncoder cannot inverse transform the table because it contains 0 rows")

original_columns = {}
for original_column_name in self._column_names:
original_columns[original_column_name] = [None for _ in range(transformed_table.number_of_rows)]
Expand All @@ -306,6 +306,12 @@ def inverse_transform(self, transformed_table: Table) -> Table:
if transformed_table.get_column(constructed_column)[i] == 1.0:
original_columns[original_column_name][i] = value

for original_column_name in self._value_to_column_nans:
constructed_column = self._value_to_column_nans[original_column_name]
for i in range(transformed_table.number_of_rows):
if transformed_table.get_column(constructed_column)[i] == 1.0:
original_columns[original_column_name][i] = np.nan

table = transformed_table

for column_name, encoded_column in original_columns.items():
Expand Down
18 changes: 9 additions & 9 deletions src/safeds/data/tabular/transformation/_range_scaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ def fit(self, table: Table, column_names: list[str] | None) -> RangeScaler:
if len(missing_columns) > 0:
raise UnknownColumnNameError(missing_columns)

if table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot be fitted because the table contains 0 rows")

if (
table.keep_only_columns(column_names).remove_columns_with_non_numerical_values().number_of_columns
< table.keep_only_columns(column_names).number_of_columns
Expand All @@ -83,9 +86,6 @@ def fit(self, table: Table, column_names: list[str] | None) -> RangeScaler:
),
)

if table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot be fitted because the table contains 0 rows")

wrapped_transformer = sk_MinMaxScaler((self._minimum, self._maximum))
wrapped_transformer.fit(table._data[column_names])

Expand Down Expand Up @@ -131,6 +131,9 @@ def transform(self, table: Table) -> Table:
if len(missing_columns) > 0:
raise UnknownColumnNameError(missing_columns)

if table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot transform the table because it contains 0 rows")

if (
table.keep_only_columns(self._column_names).remove_columns_with_non_numerical_values().number_of_columns
< table.keep_only_columns(self._column_names).number_of_columns
Expand All @@ -148,9 +151,6 @@ def transform(self, table: Table) -> Table:
),
)

if table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot transform the table because it contains 0 rows")

data = table._data.copy()
data.columns = table.column_names
data[self._column_names] = self._wrapped_transformer.transform(data[self._column_names])
Expand Down Expand Up @@ -191,6 +191,9 @@ def inverse_transform(self, transformed_table: Table) -> Table:
if len(missing_columns) > 0:
raise UnknownColumnNameError(missing_columns)

if transformed_table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot transform the table because it contains 0 rows")

if (
transformed_table.keep_only_columns(self._column_names)
.remove_columns_with_non_numerical_values()
Expand All @@ -210,9 +213,6 @@ def inverse_transform(self, transformed_table: Table) -> Table:
),
)

if transformed_table.number_of_rows == 0:
raise ValueError("The RangeScaler cannot transform the table because it contains 0 rows")

data = transformed_table._data.copy()
data.columns = transformed_table.column_names
data[self._column_names] = self._wrapped_transformer.inverse_transform(data[self._column_names])
Expand Down
Loading