Support combining batched prediction results on DatasetType #262

terryyylim · 2022-05-22T03:29:29Z

Context

A bug was introduced in #221, where approach to combining prediction results incorrectly assumed type to be numpy array. This PR utilizes DatasetAnalyzer to first determine the type of the results prior to combining them, since the same model can output different data types.

Modifications

mlem/api/commands.py - Interpret data type prior to combining prediction results
mlem/contrib/numpy.py - Add support for combining batch reading prediction results for nd.array

Which issue(s) this PR fixes:
Fixes #257

mike0sv · 2022-05-22T05:53:50Z

mlem/api/commands.py

        for part in data:
            batch_dataset = get_dataset_value(part, batch_size)
            for chunk in batch_dataset:
                preds = w.call_method(resolved_method, chunk.data)
-                res += [*preds]  # TODO: merge results
+                dt = DatasetAnalyzer.analyze(preds)


Should be safe to assume dt = w.methods[resolved_method].returns

mike0sv · 2022-05-22T05:55:26Z

mlem/api/commands.py

        for part in data:
            batch_dataset = get_dataset_value(part, batch_size)
            for chunk in batch_dataset:
                preds = w.call_method(resolved_method, chunk.data)
-                res += [*preds]  # TODO: merge results
+                dt = DatasetAnalyzer.analyze(preds)
+                res = dt.combine(res, preds)


It's suboptimal to combine every new part. Each time you allocate new array in memory. It's better to collect all of them and combine in one call

mike0sv · 2022-05-22T05:56:21Z

mlem/contrib/numpy.py

@@ -101,6 +101,12 @@ class NumpyNdarrayType(
    def _abstract_shape(shape):
        return (None,) + shape[1:]

+    @staticmethod
+    def combine(original: np.ndarray, new: np.ndarray):


Should not be static, for example you should be able to check if datasets you are combining have the same structure

mike0sv · 2022-05-22T05:58:10Z

I know you are not done with it, but I left some comments :)

terryyylim · 2022-05-22T07:47:53Z

@mike0sv No worries! Early feedback is always good to prevent redundant work. I've updated the implementation, let me know if it's aligned with what you had in mind, thank you!

codecov · 2022-05-24T15:22:22Z

Codecov Report

Merging #262 (24ae103) into main (83f169f) will increase coverage by 0.05%.
The diff coverage is 94.03%.

@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
+ Coverage   89.23%   89.28%   +0.05%     
==========================================
  Files          77       76       -1     
  Lines        5544     5629      +85     
==========================================
+ Hits         4947     5026      +79     
- Misses        597      603       +6

Impacted Files	Coverage Δ
mlem/core/artifacts.py	`99.40% <ø> (ø)`
mlem/core/hooks.py	`95.08% <ø> (ø)`
mlem/core/import_objects.py	`97.56% <ø> (ø)`
mlem/core/index.py	`67.32% <ø> (ø)`
mlem/core/meta_io.py	`93.92% <ø> (ø)`
mlem/core/model.py	`93.38% <ø> (ø)`
mlem/core/requirements.py	`95.05% <ø> (ø)`
mlem/ext.py	`88.29% <ø> (+1.84%)`	⬆️
mlem/runtime/interface.py	`84.78% <ø> (ø)`
mlem/utils/github.py	`94.00% <ø> (ø)`
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 575241b...24ae103. Read the comment docs.

mike0sv · 2022-05-28T13:38:53Z

Hey Terence! Sorry that it took so long, we've been busy with the upcoming release. I will review this shortly

mike0sv · 2022-05-29T18:14:58Z

mlem/core/dataset_type.py

@@ -48,6 +51,10 @@ def check_type(obj, exp_type, exc_type):
                f"given dataset is of type: {type(obj)}, expected: {exp_type}"
            )

+    @abstractmethod
+    def combine(self, batched_data: List[List[T]]) -> List[T]:


Why did you choose this signature instead of (List[T]) -> T?

Ah, didn't consider to treat T as List[T] on its own, I'll refactor this later this week, a little tied up at the moment with other stuff. Will also fix the conflicts then! Lmk if there's any other concerns with his PR 🙏🏻

Actually, come to think of it, I used (List[List[T]) -> List[T]) because T is a TypeVar, type checks via mypy will be incorrect if we do (List[T]) -> T instead.

I mean, np.concatenate accepts a sequence of array-like and you should pass it a List[np.array]. And it returns siingle np.array. So correct signature is List[T] -> T where T is np.array or pd.DataFrame of something like this

terryyylim · 2022-06-08T16:03:31Z

Closing because original fork was detached when repository became public.

terryyylim requested a review from a team May 22, 2022 03:29

terryyylim force-pushed the fix/remove-numpy-dependency branch from c3c8cd9 to b7a8f64 Compare May 22, 2022 04:43

Fix batch dataset reading

8f9dbbf

terryyylim force-pushed the fix/remove-numpy-dependency branch from b7a8f64 to 8f9dbbf Compare May 22, 2022 04:48

mike0sv reviewed May 22, 2022

View reviewed changes

Address PR comments

49d1477

terryyylim added 4 commits May 23, 2022 19:56

Shift dataset type out of loop

51c0cb6

Fix function type annotations

22ec7f1

Fix typo

69e5029

Fix test

24ae103

terryyylim self-assigned this May 25, 2022

terryyylim requested a review from mike0sv May 25, 2022 06:39

mike0sv reviewed May 29, 2022

View reviewed changes

terryyylim closed this Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support combining batched prediction results on DatasetType #262

Support combining batched prediction results on DatasetType #262

terryyylim commented May 22, 2022

mike0sv May 22, 2022

mike0sv May 22, 2022

mike0sv May 22, 2022

mike0sv commented May 22, 2022

terryyylim commented May 22, 2022 •

edited

Loading

codecov bot commented May 24, 2022 •

edited

Loading

mike0sv commented May 28, 2022

mike0sv May 29, 2022

terryyylim May 31, 2022

terryyylim Jun 5, 2022

mike0sv Jun 6, 2022

terryyylim commented Jun 8, 2022

Support combining batched prediction results on DatasetType #262

Support combining batched prediction results on DatasetType #262

Conversation

terryyylim commented May 22, 2022

Context

Modifications

mike0sv May 22, 2022

Choose a reason for hiding this comment

mike0sv May 22, 2022

Choose a reason for hiding this comment

mike0sv May 22, 2022

Choose a reason for hiding this comment

mike0sv commented May 22, 2022

terryyylim commented May 22, 2022 • edited Loading

codecov bot commented May 24, 2022 • edited Loading

Codecov Report

mike0sv commented May 28, 2022

mike0sv May 29, 2022

Choose a reason for hiding this comment

terryyylim May 31, 2022

Choose a reason for hiding this comment

terryyylim Jun 5, 2022

Choose a reason for hiding this comment

mike0sv Jun 6, 2022

Choose a reason for hiding this comment

terryyylim commented Jun 8, 2022

terryyylim commented May 22, 2022 •

edited

Loading

codecov bot commented May 24, 2022 •

edited

Loading