-
Notifications
You must be signed in to change notification settings - Fork 44
Support batch-reading for data-types with chunksize parameter #221
Support batch-reading for data-types with chunksize parameter #221
Conversation
@mike0sv , some thoughts on the following workflows: Batching on import (Import data-on-the-fly) workflow I think what we can do for batch reading, would be to:
i.e populating reader_cache information and setting
By doing so, when apply method (https://github.com/iterative/mlem/blob/main/mlem/cli/apply.py#L98) is called subsequently, it’ll trigger the batch-reading workflow as implemented in the current PR (https://github.com/iterative/mlem/pull/216/files#diff-7c25a5730147dd254f2b720216ef3c356a9828b85afdbea0c88123a85212bd18R17). Batch writing We could expose a new
When writing in batches, we also need to consider if If Scenario 1: File does not exist
Scenario 2: File exists
If no Scenario 1: Using API
Scenario 2: Using CLI
|
f8b593b
to
7400623
Compare
7400623
to
fca1b89
Compare
Sorry for long wait! I think the logic part is done. Now please fix codestyle/formatting and make this PR pass all checks (you can run pre-commit locally) |
Also, lets rename |
Codecov Report
@@ Coverage Diff @@
## main #221 +/- ##
==========================================
+ Coverage 89.31% 89.62% +0.30%
==========================================
Files 75 76 +1
Lines 5298 5436 +138
==========================================
+ Hits 4732 4872 +140
+ Misses 566 564 -2
Continue to review full report at Codecov.
|
d449905
to
a6c4b99
Compare
@mike0sv I'm guessing |
We probably should refactor those fixtures, but separately. You can create new fixture that you need for the test
… On 9 May 2022, at 17:59, Terence Lim ***@***.***> wrote:
@mike0sv <https://github.com/mike0sv> I'm guessing codecov/patch is failing even though codecov/project is passing because of missing test for apply workflow for batch_size parameter. Would you want me to refactor this (https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108 <https://github.com/iterative/mlem/blob/main/tests/conftest.py#L108>) to separate pandas and numpy so that we can reuse the pandas fixtures for cli/test_apply.py or do we merge as is?
—
Reply to this email directly, view it on GitHub <#221 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNJYAY7MOQZOXZXS5T4FPLVJEK5PANCNFSM5VCBVJPA>.
You are receiving this because you were mentioned.
|
a6c4b99
to
3222655
Compare
3222655
to
a8791f1
Compare
I just merged a big refactor PR to main and resolved the conflicts, but there are a couple of places you need to rename the classes yourself. So, |
Okay @mike0sv , I think the conflicts have been resolved. |
Context
Some datasets are large and rather than dealing them with one big block, we could split the data into chunks. This PR adds batch-reading support for data formats which provides the
chunksize
parameter using the Pandas API.Related PRs: #206, #216
Supported formats:
Modifications
mlem/cli/apply.py
- Addedbatch
parameter when calling apply method - supports both importing data on/off-the-fly workflowsmlem/api/utils.py
- Added batch reading support when getting Dataset valuemlem/core/errors.py
- Added new errorsUnsupportedDatasetBatchLoadingType
andUnsupportedDatasetBatchLoading
for batch-reading workflowsmlem/contrib/pandas.py
- Added batch-reading support for CSV, JSON, Stata data formatstests/contrib/test_pandas.py
- Added tests for supported batch-reading data formatsWhich issue(s) this PR fixes:
Fixes #23