Skip to content
This repository has been archived by the owner on Sep 13, 2023. It is now read-only.

Support batch-reading for data-types with chunksize parameter #221

Merged
merged 5 commits into from
May 15, 2022

Conversation

terryyylim
Copy link
Contributor

@terryyylim terryyylim commented May 4, 2022

Context

Some datasets are large and rather than dealing them with one big block, we could split the data into chunks. This PR adds batch-reading support for data formats which provides the chunksize parameter using the Pandas API.

Related PRs: #206, #216

Supported formats:

  • CSV
  • JSON
  • Stata

Modifications

  • mlem/cli/apply.py - Added batch parameter when calling apply method - supports both importing data on/off-the-fly workflows
  • mlem/api/utils.py - Added batch reading support when getting Dataset value
  • mlem/core/errors.py - Added new errors UnsupportedDatasetBatchLoadingType and UnsupportedDatasetBatchLoading for batch-reading workflows
  • mlem/contrib/pandas.py - Added batch-reading support for CSV, JSON, Stata data formats
  • tests/contrib/test_pandas.py - Added tests for supported batch-reading data formats

Which issue(s) this PR fixes:
Fixes #23

@terryyylim
Copy link
Contributor Author

@mike0sv , some thoughts on the following workflows:

Batching on import (Import data-on-the-fly) workflow
Currently, we call mlem/api/commands.py (import_object method) which calls ImportHook class object’s process method
This call loads the entire dataset into memory via mlem/core/metadata.py’s (get_object_metadata method) and binds it to dataset attribute.

I think what we can do for batch reading, would be to:

  1. Add a new batch argument to process method (https://github.com/iterative/mlem/blob/main/mlem/contrib/pandas.py#L535)
  2. Call retrieval of object metadata:
    a. If batch argument is not provided - Call retrieval of object metadata via get_object_metadata method (https://github.com/iterative/mlem/blob/main/mlem/core/metadata.py#L25)
    b. If batch argument is provided, retrieve only metadata of the dataset, without binding it to data attribute,

i.e populating reader_cache information and setting data attribute to None
Something like:

reader_cache={'dataset_type': {'columns': ['', 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 
'dtypes': ['int64', 'float64', 'float64', 'float64', 'float64'], 'index_cols': [''], 'type': 'dataframe'},
'format': 'csv', 'type': 'pandas'} data=None

By doing so, when apply method (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L98) is called subsequently, it’ll trigger the batch-reading workflow as implemented in the current PR (https://github.com/iterative/mlem/pull/216/files#diff-7c25a5730147dd254f2b720216ef3c356a9828b85afdbea0c88123a85212bd18R17).


Batch writing
For batch writing, this concerns the https://github.com/iterative/mlem/blob/main/mlem/core/metadata.py#L47 (save method), which calls the dump method and triggers writing via https://github.com/iterative/mlem/blob/main/mlem/core/objects.py#L486 (get_artifacts method).

We could expose a new write_batch method that considers 2 scenarios here:

  1. Writing data in batches via a new argument batch (int) passed to related function calls much like how batch-reading is done now, that does something like:
df.to_csv("path/to/save/file.csv", chunksize=batch)
  1. AND writing data that is suitable for batch-reading via a new argument batch_read (boolean), where internally within our eg.contrib.pandas module, we could utilise PANDAS_FORMATS's write_args that has been configured for batching in this PR i.e https://github.com/iterative/mlem/pull/216/files#diff-706e38d0ea328a5aa14720511896d8e86f7e75edbe79156defa0dea248bfd858R499.

When writing in batches, we also need to consider if output parameter is provided (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L35).

If output is provided, we'll only read data in-memory by batches (i.e at any point in time, only a small chunk should be in-memory). Currently, chunksize writing via Pandas API is only supported for CSV, Feather. Assuming CSV format for this description:

Scenario 1: File does not exist

  • use default mode='w' and chunksize=batch to write to CSV

Scenario 2: File exists

  • use non-default mode='a' and chunksize=batch to write to CSV

If no output is provided, we'll read everything in-memory in batches:

Scenario 1: Using API

  • Return a single DataFrame after combining all batches' DataFrames and return to client

Scenario 2: Using CLI

@terryyylim terryyylim force-pushed the terence/dataset-batch-reading branch from f8b593b to 7400623 Compare May 8, 2022 02:32
@terryyylim terryyylim force-pushed the terence/dataset-batch-reading branch from 7400623 to fca1b89 Compare May 8, 2022 02:36
@mike0sv
Copy link
Contributor

mike0sv commented May 8, 2022

Sorry for long wait! I think the logic part is done. Now please fix codestyle/formatting and make this PR pass all checks (you can run pre-commit locally)

@mike0sv
Copy link
Contributor

mike0sv commented May 8, 2022

Also, lets rename batch arg to batch_size

@codecov
Copy link

codecov bot commented May 8, 2022

Codecov Report

Merging #221 (8952c06) into main (9aeac4a) will increase coverage by 0.30%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##             main     #221      +/-   ##
==========================================
+ Coverage   89.31%   89.62%   +0.30%     
==========================================
  Files          75       76       +1     
  Lines        5298     5436     +138     
==========================================
+ Hits         4732     4872     +140     
+ Misses        566      564       -2     
Impacted Files Coverage Δ
mlem/cli/clone.py 100.00% <ø> (ø)
mlem/cli/import_object.py 100.00% <ø> (ø)
mlem/config.py 98.64% <ø> (-0.02%) ⬇️
mlem/core/meta_io.py 93.92% <ø> (ø)
mlem/runtime/client/base.py 90.41% <ø> (-0.13%) ⬇️
mlem/cli/types.py 22.44% <25.00%> (ø)
mlem/cli/main.py 76.44% <66.66%> (ø)
mlem/runtime/interface/base.py 84.78% <66.66%> (ø)
mlem/ext.py 86.45% <69.23%> (-1.55%) ⬇️
mlem/core/metadata.py 92.72% <87.50%> (-1.72%) ⬇️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d720df4...8952c06. Read the comment docs.

@terryyylim terryyylim force-pushed the terence/dataset-batch-reading branch 2 times, most recently from d449905 to a6c4b99 Compare May 8, 2022 15:42
@terryyylim
Copy link
Contributor Author

@mike0sv I'm guessing codecov/patch is failing even though codecov/project is passing because of missing test for apply workflow for batch_size parameter. Would you want me to refactor this (https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108) to separate pandas and numpy so that we can reuse the pandas fixtures for cli/test_apply.py or do we merge as is?

@mike0sv
Copy link
Contributor

mike0sv commented May 9, 2022 via email

@terryyylim terryyylim force-pushed the terence/dataset-batch-reading branch from a6c4b99 to 3222655 Compare May 9, 2022 15:38
@terryyylim terryyylim force-pushed the terence/dataset-batch-reading branch from 3222655 to a8791f1 Compare May 9, 2022 15:47
@mike0sv
Copy link
Contributor

mike0sv commented May 9, 2022

I just merged a big refactor PR to main and resolved the conflicts, but there are a couple of places you need to rename the classes yourself. So, DatasetMeta should become MlemDataset, maybe something else

@terryyylim
Copy link
Contributor Author

Okay @mike0sv , I think the conflicts have been resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Interface for batch loading for datasets
2 participants