Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite specified schema (selectively) #754

Open
lars-reimann opened this issue May 11, 2024 · 0 comments
Open

Overwrite specified schema (selectively) #754

lars-reimann opened this issue May 11, 2024 · 0 comments
Milestone

Comments

@lars-reimann
Copy link
Member

Is your feature request related to a problem?

Sometimes, the data type of a column is inferred incorrectly.

Desired solution

Allow overwriting the schema when constructing a table.

Possible alternatives (optional)

No response

Screenshots (optional)

No response

Additional Context (optional)

No response

@github-project-automation github-project-automation bot moved this to Backlog in Library May 11, 2024
lars-reimann added a commit that referenced this issue Jan 12, 2025
Closes #875
Closes #877
Closes partially #977

### Summary of Changes

Stabilize the API of the `Table` class. This PR introduces several
breaking changes to this class:

- All optional parameters are now keyword-only, so we can reposition
them later.
- The `data` parameter of `__init__` is now required.
- Rename `remove_columns_except` to `select_columns`
- The new method can also be called with a callback that determines
which columns to select.
- Rename `add_table_as_columns` to `add_tables_as_columns`
  - Multiple tables can now be passed at once.
- Rename `add_table_as_rows` to `add_tables_as_rows`
  - Multiple tables can now be passed at once.

It also adds new functionality throughout the library:

- New method `Table.add_index_column` to add a new column with
auto-incrementing integer values to a table.
- New method `Table.filter_rows` to keep only the rows matched by some
predicate.
- New method `Table.filter_rows_by_column` to keep only the rows that
have a value in a specific column that matches some predicate.
- New parameter `random_seed` for `Table.shuffle_rows` and
`Table.split_rows` to control the pseudorandom number generator.
Previously, the methods were deterministic, but the seed was hidden.
- New parameter `missing_value_ratio_threshold` of
`Table.remove_columns_with_missing_values` to be able to keep columns
with only a few missing values.
- Various static factory methods under `ColumnType` to instantiate
column types. This prepares for #754.

Finally, the methods `Table.summarize_statistics` and
`Column.summarize_statistics` are now considerably faster.

---------

Co-authored-by: megalinter-bot <[email protected]>
lars-reimann added a commit that referenced this issue Jan 14, 2025
Closes partially #754
Closes partially #977

### Summary of Changes

- Improve documentation for all methods of `Column`.
- Add the option to specify the column type when calling the
constructor. If omitted, it is still inferred from the data.
@lars-reimann lars-reimann added this to the v1.0.0 milestone Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog
Development

No branches or pull requests

1 participant