WIP: Support batch-reading for data-types with chunksize parameter #206

terryyylim · 2022-04-24T04:02:03Z

Context

Some datasets are large and rather than dealing them with one big block, we could split the data into chunks. This PR adds batch-reading support for data formats which provides the chunksize parameter using the Pandas API.

Supported formats:

CSV
JSON
Stata

Modifications

mlem/cli/apply.py - Added batch parameter when calling apply method - supports both importing data on/off-the-fly workflows
mlem/api/utils.py - Added batch reading support when getting Dataset value
mlem/core/errors.py - Added new errors UnsupportedDatasetBatchLoadingType and DatasetBatchLoadingJSONError for batch-reading workflows
mlem/contrib/pandas.py - Added batch-reading support for CSV, JSON data formats
tests/contrib/test_pandas.py - Added tests for supported batch-reading data formats, exception tests for unsupported batch-reading data formats

Which issue(s) this PR fixes:
Fixes #23

terryyylim

Left some comments to explain what I changed, hopefully it helps with the review!

terryyylim · 2022-04-24T04:07:27Z

mlem/contrib/pandas.py

 def read_html(*args, **kwargs):
    # read_html returns list of dataframes
    return pd.read_html(*args, **kwargs)[0]


-PANDAS_FORMATS = {


Refactored to make this a non-global variable to avoid needing hacks around tests for batch-reading.

def test_simple_batch_df(data, format): writer = PandasWriter(format=format) # Batch-reading JSON files require line-delimited data if format == "json": writer.fmt.write_args = {"orient": "records", "lines": True} dataset_write_read_check( DatasetType.create(data), writer, PandasReader, pd.DataFrame.equals, batch=2 ) # Need reset if PANDAS_FORMATS is a global variable if format == "csv": writer.fmt.write_args = {"index": False} writer.fmt.read_args = {} writer.fmt.read_func = read_csv_with_unnamed if format == "json": writer.fmt.write_args = {"date_format": "iso", "date_unit": "ns"} writer.fmt.read_args = {} writer.fmt.read_func = read_json_reset_index

Not sure, why you need to change something for tests? This means you testing something else. no?
And you don't need function anyway. You can just do writer.fmt = PandasFormat(<whatever you need>) before chech, or maybe use fmt = writer.fmt.copy() and then change what you need.

terryyylim · 2022-04-24T04:14:01Z

mlem/core/dataset_type.py

+        dataset: DatasetType,
+        storage: Storage,
+        path: str,
+        writer_fmt_args: Optional[Dict[str, Any]] = None,


Added optional parameter to update writer_fmt_args which is required for JSON batch-reading (needs to be in line-delimited form).

This parameter is currently only used in batch-reading tests, to write JSON dataset in line-delimited for batch-reading. It can be further exposed to users here - https://github.com/iterative/mlem/blob/main/mlem/core/objects.py#L659, via write_value call.

Lmk if you want me to look into adding this as part of the PR.

Let's leave batch writing out of the scope of this PR. If you only using it for tests, lets remove this for now. In tests you can write the test dataset manually without mlem, like our forefathers did.

mlem/contrib/pandas.py

mike0sv

I commented in the code a bit, but main idea is following: you implemented batch as a flag for regular read. But actually we need like a separate type of loading data.
It should return some lazy iterable object that would yield batches, preferably of the same type that the whole dataset is. For example, apply flow should change something like that:

if batch is None:
  # existing logic
  res = apply(model, dataset.get_value())
else:
  res = [apply(model, batch) for batch in dataset.iter_batches(batch)]

You don't need to worry about batch writing or merging different batches for now

mike0sv · 2022-04-25T13:40:48Z

mlem/contrib/pandas.py

+    unnamed = {}
+    for i, df_chunk in enumerate(df_iterator):
+        # Instantiate Pandas DataFrame with columns if it is the first chunk
+        if i == 0:


this may be df is None

mike0sv · 2022-04-25T13:42:29Z

mlem/contrib/pandas.py

+            for col in df_chunk.columns:
+                if col.startswith("Unnamed: "):
+                    unnamed[col] = ""
+        df = pd.concat([df, df_chunk], ignore_index=True)


You still read the whole file to memory. The idea is to apply model to each part before you read next

I see, sorry I misunderstood. - I'll refactor and implement iterator for these batch functions.

mlem/contrib/pandas.py

mike0sv · 2022-04-25T13:48:43Z

mlem/contrib/pandas.py

 def read_html(*args, **kwargs):
    # read_html returns list of dataframes
    return pd.read_html(*args, **kwargs)[0]


-PANDAS_FORMATS = {


Not sure, why you need to change something for tests? This means you testing something else. no?
And you don't need function anyway. You can just do writer.fmt = PandasFormat(<whatever you need>) before chech, or maybe use fmt = writer.fmt.copy() and then change what you need.

mike0sv · 2022-04-25T13:49:57Z

mlem/contrib/pandas.py

@@ -505,16 +548,38 @@ def read(self, artifacts: Artifacts) -> DatasetType:
            self.dataset_type.align(self.fmt.read(artifacts))
        )

+    def read_batch(self, artifacts: Artifacts, batch: int) -> DatasetType:
+        fmt = update_batch_args(self.format, self.fmt, batch)


I think I get now this thing above. I'd say we just need separate set of args for batch and non-batch reading, so you don't have to change the state every time

Refactored this bit - so there's no longer a need to update args based on batch/non-batch reads.

mike0sv · 2022-04-25T13:53:02Z

mlem/contrib/pandas.py

+        # Pandas supports batch-reading for JSON only if the JSON file is line-delimited
+        # https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#line-delimited-json
+        if self.format == "json":
+            dataset_lines = sum(1 for line in open(artifacts["data"].uri))


you are reading the whole file here, but you actually need 2 lines to know there are enough.

if it is one-line json, you still read the whole dataset. and then you actually read it again if everything is ok

it will fail you your dataset has just 1 row

let's assume dataset file is in the right format for now

mike0sv · 2022-04-25T13:59:30Z

mlem/core/dataset_type.py

+        dataset: DatasetType,
+        storage: Storage,
+        path: str,
+        writer_fmt_args: Optional[Dict[str, Any]] = None,


Let's leave batch writing out of the scope of this PR. If you only using it for tests, lets remove this for now. In tests you can write the test dataset manually without mlem, like our forefathers did.

terryyylim · 2022-04-28T16:46:55Z

@mike0sv I think my forked repo is no longer synced with this repository because it's made public now (my commits are no longer updated here). Shall I make a new PR?

mike0sv · 2022-04-29T00:36:00Z

tests/conftest.py

+                if df is None:
+                    df = pd.DataFrame(columns=chunk.columns, dtype=col_types)
+                    col_types = {
+                        chunk.columns[idx]: chunk.dtypes[idx]


zip would fit nicely here

mike0sv · 2022-04-29T00:44:04Z

mlem/core/dataset_type.py

@@ -404,6 +414,10 @@ class Config:
    def read(self, artifacts: Artifacts) -> DatasetType:
        raise NotImplementedError

+    @abstractmethod
+    def read_batch(self, artifacts: Artifacts, batch: int) -> Iterator:


This should actually be Iterator[DatasetType]

mike0sv · 2022-04-29T00:46:51Z

mlem/core/metadata.py

@@ -142,6 +145,7 @@ def load_meta(
    follow_links: bool = True,
    load_value: bool = False,
    fs: Optional[AbstractFileSystem] = None,
+    batch: Optional[int] = None,


load_meta should not have batch arg: you are only loading metadata. load_value here is for convenience. If you need batching, you should set load_value=False and then call read_batch on DatasetMeta directly

mike0sv · 2022-04-29T00:52:43Z

mlem/core/objects.py

@@ -668,6 +668,9 @@ def write_value(self) -> Artifacts:
    def load_value(self):
        self.dataset = self.reader.read(self.relative_artifacts)

+    def load_batch_value(self, batch: int):


this should be changed to something like read_batch, return Iterator[DatasetType] and not set any values. self.dataset field is for whole dataset value, and if you wish to use batching, that means you don't want to load whole dataset into memory. So, read_batch will return lazy iterator that you can iterate on and get DatasetTypes to work with. For now lets say that those DatasetTypes should be the same as the reader is holding (==DatasetType of the whole dataset).

mike0sv · 2022-04-29T00:55:18Z

mlem/cli/apply.py

+                repo=data_repo,
+                rev=data_rev,
+                type_=import_type,
+                batch=batch,


for now let's ignore this branch and focus on the other one without import. once we done, we can discuss how to approach this (I am not sure myself)

mike0sv · 2022-04-29T00:55:59Z

mlem/cli/apply.py

@@ -92,6 +99,7 @@ def apply(
                data_rev,
                load_value=True,


as per comments in metadata.py and objects.py, this should be load_meta(..., load_value=batch is None)

mike0sv · 2022-04-29T01:09:09Z

mlem/api/commands.py

@@ -85,7 +86,7 @@ def apply(
        resolved_method = PREDICT_METHOD_NAME
    echo(EMOJI_APPLY + f"Applying `{resolved_method}` method...")
    res = [
-        w.call_method(resolved_method, get_dataset_value(part))
+        w.call_method(resolved_method, get_dataset_value(part, batch))


Here you should flatten all the batches if there are any. Here the example of how I see it:
Suppose you have dataframe=[1,2,3,4] (I mean 1 column, 4 rows) saved to csv file. You load its metadata without loading the value, let's say dt = DatasetMeta(dataset_type=DataFrameType(...), ...). And you call apply with data=[dt]. If batch arg is not provided, what will happen is get_dataset_value will load the actual dataframe and in the end res = [w.call_method(..., dataframe([1,2,3,4))]. But if you provided batch=2, dt. read_batch should be called. If you iterate through it, you will get 2 parts of the dataframe, and in the end res=[w.call_method(..., dataframe([1,2])), w.call_method(..., dataframe([3,4]]))]

👍🏻 Makes sense, I implemented something similar yesterday, but I couldn't sync the changes to this PR because the private fork no longer points to this repository.

terryyylim · 2022-04-29T10:10:26Z

Closing the PR because private forks can't sync to this repository anymore since it went public. I've pushed the changes to a separate PR (#216) after forking the public repository.

terryyylim added 2 commits April 24, 2022 11:44

Add support for CSV and JSON Dataset batch reading

90a4ee4

Add tests for batch reading workflows

83a05f6

terryyylim requested a review from a team April 24, 2022 04:02

terryyylim commented Apr 24, 2022

View reviewed changes

terryyylim added 2 commits April 24, 2022 13:25

Add support for Stata Dataset batch reading

5c3ed28

Remove code duplication using partial function

a1e3a06

terryyylim commented Apr 24, 2022

View reviewed changes

mlem/contrib/pandas.py Show resolved Hide resolved

terryyylim self-assigned this Apr 24, 2022

Add clearer comments

9e012d1

mike0sv reviewed Apr 25, 2022

View reviewed changes

terryyylim added 2 commits April 26, 2022 17:06

Address PR comments

365ca09

Implement batch dataset read iterator

8d089f5

terryyylim changed the title ~~Support batch-reading for data-types with chunksize parameter~~ WIP: Support batch-reading for data-types with chunksize parameter Apr 26, 2022

mike0sv reviewed Apr 29, 2022

View reviewed changes

terryyylim mentioned this pull request Apr 29, 2022

Support batch-reading for data-types with chunksize parameter #216

Closed

3 tasks

terryyylim closed this Apr 29, 2022

terryyylim mentioned this pull request May 4, 2022

Support batch-reading for data-types with chunksize parameter #221

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Support batch-reading for data-types with chunksize parameter #206

WIP: Support batch-reading for data-types with chunksize parameter #206

terryyylim commented Apr 24, 2022 •

edited

Loading

terryyylim left a comment

terryyylim Apr 24, 2022

mike0sv Apr 25, 2022 •

edited

Loading

terryyylim Apr 24, 2022

mike0sv Apr 25, 2022

mike0sv left a comment

mike0sv Apr 25, 2022

mike0sv Apr 25, 2022

terryyylim Apr 25, 2022

mike0sv Apr 25, 2022 •

edited

Loading

mike0sv Apr 25, 2022

terryyylim Apr 25, 2022

mike0sv Apr 25, 2022

mike0sv Apr 25, 2022

mike0sv Apr 25, 2022

terryyylim commented Apr 28, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

mike0sv Apr 29, 2022

terryyylim Apr 29, 2022

terryyylim commented Apr 29, 2022

WIP: Support batch-reading for data-types with chunksize parameter #206

WIP: Support batch-reading for data-types with chunksize parameter #206

Conversation

terryyylim commented Apr 24, 2022 • edited Loading

Context

Modifications

terryyylim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mike0sv Apr 25, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terryyylim commented Apr 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

terryyylim commented Apr 29, 2022

terryyylim commented Apr 24, 2022 •

edited

Loading

mike0sv Apr 25, 2022 •

edited

Loading

mike0sv Apr 25, 2022 •

edited

Loading