Support batch-reading for data-types with chunksize parameter #221

terryyylim · 2022-05-04T13:54:50Z

Context

Some datasets are large and rather than dealing them with one big block, we could split the data into chunks. This PR adds batch-reading support for data formats which provides the chunksize parameter using the Pandas API.

Related PRs: #206, #216

Supported formats:

CSV
JSON
Stata

Modifications

mlem/cli/apply.py - Added batch parameter when calling apply method - supports both importing data on/off-the-fly workflows
mlem/api/utils.py - Added batch reading support when getting Dataset value
mlem/core/errors.py - Added new errors UnsupportedDatasetBatchLoadingType and UnsupportedDatasetBatchLoading for batch-reading workflows
mlem/contrib/pandas.py - Added batch-reading support for CSV, JSON, Stata data formats
tests/contrib/test_pandas.py - Added tests for supported batch-reading data formats

Which issue(s) this PR fixes:
Fixes #23

terryyylim · 2022-05-04T14:23:18Z

@mike0sv , some thoughts on the following workflows:

Batching on import (Import data-on-the-fly) workflow
Currently, we call mlem/api/commands.py (import_object method) which calls ImportHook class object’s process method
This call loads the entire dataset into memory via mlem/core/metadata.py’s (get_object_metadata method) and binds it to dataset attribute.

I think what we can do for batch reading, would be to:

Add a new batch argument to process method (https://github.com/iterative/mlem/blob/main/mlem/contrib/pandas.py#L535)
Call retrieval of object metadata:
a. If batch argument is not provided - Call retrieval of object metadata via get_object_metadata method (https://github.com/iterative/mlem/blob/main/mlem/core/metadata.py#L25)
b. If batch argument is provided, retrieve only metadata of the dataset, without binding it to data attribute,
- Specifically, we could get fmt via method introduced in this PR: fmt = get_pandas_batch_formats(batch), and then read just the first chunk of data in-memory prior to passing the data into get_object_metadata call. This would us to have all the metadata required for batch-reading.
- Following that, we could skip binding the data here (https://github.com/iterative/mlem/blob/main/mlem/core/dataset_type.py#L60), which would result in the workflow to fallback to the call introduced in this PR, specifically here (https://github.com/iterative/mlem/pull/221/files#diff-7c25a5730147dd254f2b720216ef3c356a9828b85afdbea0c88123a85212bd18R16)

i.e populating reader_cache information and setting data attribute to None
Something like:

reader_cache={'dataset_type': {'columns': ['', 'sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'], 
'dtypes': ['int64', 'float64', 'float64', 'float64', 'float64'], 'index_cols': [''], 'type': 'dataframe'},
'format': 'csv', 'type': 'pandas'} data=None

By doing so, when apply method (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L98) is called subsequently, it’ll trigger the batch-reading workflow as implemented in the current PR (https://github.com/iterative/mlem/pull/216/files#diff-7c25a5730147dd254f2b720216ef3c356a9828b85afdbea0c88123a85212bd18R17).

Batch writing
For batch writing, this concerns the https://github.com/iterative/mlem/blob/main/mlem/core/metadata.py#L47 (save method), which calls the dump method and triggers writing via https://github.com/iterative/mlem/blob/main/mlem/core/objects.py#L486 (get_artifacts method).

We could expose a new write_batch method that considers 2 scenarios here:

Writing data in batches via a new argument batch (int) passed to related function calls much like how batch-reading is done now, that does something like:

df.to_csv("path/to/save/file.csv", chunksize=batch)

AND writing data that is suitable for batch-reading via a new argument batch_read (boolean), where internally within our eg.contrib.pandas module, we could utilise PANDAS_FORMATS's write_args that has been configured for batching in this PR i.e https://github.com/iterative/mlem/pull/216/files#diff-706e38d0ea328a5aa14720511896d8e86f7e75edbe79156defa0dea248bfd858R499.

When writing in batches, we also need to consider if output parameter is provided (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L35).

If output is provided, we'll only read data in-memory by batches (i.e at any point in time, only a small chunk should be in-memory). Currently, chunksize writing via Pandas API is only supported for CSV, Feather. Assuming CSV format for this description:

Scenario 1: File does not exist

use default mode='w' and chunksize=batch to write to CSV

Scenario 2: File exists

use non-default mode='a' and chunksize=batch to write to CSV

If no output is provided, we'll read everything in-memory in batches:

Scenario 1: Using API

Return a single DataFrame after combining all batches' DataFrames and return to client

Scenario 2: Using CLI

Combine all batches' DataFrames into a single DataFrame and serialize results into JSON format before printing onto command line, like how it is done currently (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L106)

mike0sv · 2022-05-08T12:35:39Z

Sorry for long wait! I think the logic part is done. Now please fix codestyle/formatting and make this PR pass all checks (you can run pre-commit locally)

mike0sv · 2022-05-08T12:36:37Z

Also, lets rename batch arg to batch_size

codecov · 2022-05-08T14:32:04Z

Codecov Report

Merging #221 (8952c06) into main (9aeac4a) will increase coverage by 0.30%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##             main     #221      +/-   ##
==========================================
+ Coverage   89.31%   89.62%   +0.30%     
==========================================
  Files          75       76       +1     
  Lines        5298     5436     +138     
==========================================
+ Hits         4732     4872     +140     
+ Misses        566      564       -2

Impacted Files	Coverage Δ
mlem/cli/clone.py	`100.00% <ø> (ø)`
mlem/cli/import_object.py	`100.00% <ø> (ø)`
mlem/config.py	`98.64% <ø> (-0.02%)`	⬇️
mlem/core/meta_io.py	`93.92% <ø> (ø)`
mlem/runtime/client/base.py	`90.41% <ø> (-0.13%)`	⬇️
mlem/cli/types.py	`22.44% <25.00%> (ø)`
mlem/cli/main.py	`76.44% <66.66%> (ø)`
mlem/runtime/interface/base.py	`84.78% <66.66%> (ø)`
mlem/ext.py	`86.45% <69.23%> (-1.55%)`	⬇️
mlem/core/metadata.py	`92.72% <87.50%> (-1.72%)`	⬇️
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d720df4...8952c06. Read the comment docs.

terryyylim · 2022-05-09T13:59:39Z

@mike0sv I'm guessing codecov/patch is failing even though codecov/project is passing because of missing test for apply workflow for batch_size parameter. Would you want me to refactor this (https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108) to separate pandas and numpy so that we can reuse the pandas fixtures for cli/test_apply.py or do we merge as is?

mike0sv · 2022-05-09T14:47:18Z

We probably should refactor those fixtures, but separately. You can create new fixture that you need for the test

…

On 9 May 2022, at 17:59, Terence Lim ***@***.***> wrote: @mike0sv <https://github.com/mike0sv> I'm guessing codecov/patch is failing even though codecov/project is passing because of missing test for apply workflow for batch_size parameter. Would you want me to refactor this (https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108 <https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108>) to separate pandas and numpy so that we can reuse the pandas fixtures for cli/test_apply.py or do we merge as is? — Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNJYAY7MOQZOXZXS5T4FPLVJEK5PANCNFSM5VCBVJPA>. You are receiving this because you were mentioned.

mike0sv · 2022-05-09T18:48:39Z

I just merged a big refactor PR to main and resolved the conflicts, but there are a couple of places you need to rename the classes yourself. So, DatasetMeta should become MlemDataset, maybe something else

terryyylim · 2022-05-10T02:28:56Z

Okay @mike0sv , I think the conflicts have been resolved.

terryyylim requested a review from a team May 4, 2022 13:54

terryyylim mentioned this pull request May 4, 2022

Support batch-reading for data-types with chunksize parameter #216

Closed

3 tasks

terryyylim force-pushed the terence/dataset-batch-reading branch from f8b593b to 7400623 Compare May 8, 2022 02:32

Implement batch dataset iterator reader

fca1b89

terryyylim force-pushed the terence/dataset-batch-reading branch from 7400623 to fca1b89 Compare May 8, 2022 02:36

Add non-implemented batch functions

bfce0c0

terryyylim force-pushed the terence/dataset-batch-reading branch 2 times, most recently from d449905 to a6c4b99 Compare May 8, 2022 15:42

terryyylim force-pushed the terence/dataset-batch-reading branch from a6c4b99 to 3222655 Compare May 9, 2022 15:38

Address PR comments and fix linting

a8791f1

terryyylim force-pushed the terence/dataset-batch-reading branch from 3222655 to a8791f1 Compare May 9, 2022 15:47

Merge branch 'main' into terence/dataset-batch-reading

bb82b35

Fix incorrect naming

8952c06

aguschin assigned terryyylim May 11, 2022

mike0sv approved these changes May 11, 2022

View reviewed changes

mike0sv merged commit 00b0162 into iterative:main May 15, 2022

aguschin mentioned this pull request May 18, 2022

Add numpy or refactor this #257

Closed

terryyylim mentioned this pull request May 22, 2022

Support combining batched prediction results on DatasetType #262

Closed

terryyylim mentioned this pull request Jun 8, 2022

Support combining batched prediction results on DatasetType #290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support batch-reading for data-types with chunksize parameter #221

Support batch-reading for data-types with chunksize parameter #221

terryyylim commented May 4, 2022 •

edited

Loading

terryyylim commented May 4, 2022

mike0sv commented May 8, 2022

mike0sv commented May 8, 2022

codecov bot commented May 8, 2022 •

edited

Loading

terryyylim commented May 9, 2022

mike0sv commented May 9, 2022 via email

mike0sv commented May 9, 2022

terryyylim commented May 10, 2022

Support batch-reading for data-types with chunksize parameter #221

Support batch-reading for data-types with chunksize parameter #221

Conversation

terryyylim commented May 4, 2022 • edited Loading

Context

Modifications

terryyylim commented May 4, 2022

mike0sv commented May 8, 2022

mike0sv commented May 8, 2022

codecov bot commented May 8, 2022 • edited Loading

Codecov Report

terryyylim commented May 9, 2022

mike0sv commented May 9, 2022 via email

mike0sv commented May 9, 2022

terryyylim commented May 10, 2022

terryyylim commented May 4, 2022 •

edited

Loading

codecov bot commented May 8, 2022 •

edited

Loading