Skip to content
This repository has been archived by the owner on Sep 13, 2023. It is now read-only.

Support combining batched prediction results on DatasetType #262

Closed

Conversation

terryyylim
Copy link
Contributor

Context

A bug was introduced in #221, where approach to combining prediction results incorrectly assumed type to be numpy array. This PR utilizes DatasetAnalyzer to first determine the type of the results prior to combining them, since the same model can output different data types.

Modifications

  • mlem/api/commands.py - Interpret data type prior to combining prediction results
  • mlem/contrib/numpy.py - Add support for combining batch reading prediction results for nd.array

Which issue(s) this PR fixes:
Fixes #257

@terryyylim terryyylim requested a review from a team May 22, 2022 03:29
@terryyylim terryyylim force-pushed the fix/remove-numpy-dependency branch from c3c8cd9 to b7a8f64 Compare May 22, 2022 04:43
@terryyylim terryyylim force-pushed the fix/remove-numpy-dependency branch from b7a8f64 to 8f9dbbf Compare May 22, 2022 04:48
for part in data:
batch_dataset = get_dataset_value(part, batch_size)
for chunk in batch_dataset:
preds = w.call_method(resolved_method, chunk.data)
res += [*preds] # TODO: merge results
dt = DatasetAnalyzer.analyze(preds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be safe to assume dt = w.methods[resolved_method].returns

for part in data:
batch_dataset = get_dataset_value(part, batch_size)
for chunk in batch_dataset:
preds = w.call_method(resolved_method, chunk.data)
res += [*preds] # TODO: merge results
dt = DatasetAnalyzer.analyze(preds)
res = dt.combine(res, preds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's suboptimal to combine every new part. Each time you allocate new array in memory. It's better to collect all of them and combine in one call

@@ -101,6 +101,12 @@ class NumpyNdarrayType(
def _abstract_shape(shape):
return (None,) + shape[1:]

@staticmethod
def combine(original: np.ndarray, new: np.ndarray):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not be static, for example you should be able to check if datasets you are combining have the same structure

@mike0sv
Copy link
Contributor

mike0sv commented May 22, 2022

I know you are not done with it, but I left some comments :)

@terryyylim
Copy link
Contributor Author

terryyylim commented May 22, 2022

@mike0sv No worries! Early feedback is always good to prevent redundant work. I've updated the implementation, let me know if it's aligned with what you had in mind, thank you!

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #262 (24ae103) into main (83f169f) will increase coverage by 0.05%.
The diff coverage is 94.03%.

@@            Coverage Diff             @@
##             main     #262      +/-   ##
==========================================
+ Coverage   89.23%   89.28%   +0.05%     
==========================================
  Files          77       76       -1     
  Lines        5544     5629      +85     
==========================================
+ Hits         4947     5026      +79     
- Misses        597      603       +6     
Impacted Files Coverage Δ
mlem/core/artifacts.py 99.40% <ø> (ø)
mlem/core/hooks.py 95.08% <ø> (ø)
mlem/core/import_objects.py 97.56% <ø> (ø)
mlem/core/index.py 67.32% <ø> (ø)
mlem/core/meta_io.py 93.92% <ø> (ø)
mlem/core/model.py 93.38% <ø> (ø)
mlem/core/requirements.py 95.05% <ø> (ø)
mlem/ext.py 88.29% <ø> (+1.84%) ⬆️
mlem/runtime/interface.py 84.78% <ø> (ø)
mlem/utils/github.py 94.00% <ø> (ø)
... and 47 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 575241b...24ae103. Read the comment docs.

@terryyylim terryyylim self-assigned this May 25, 2022
@terryyylim terryyylim requested a review from mike0sv May 25, 2022 06:39
@mike0sv
Copy link
Contributor

mike0sv commented May 28, 2022

Hey Terence! Sorry that it took so long, we've been busy with the upcoming release. I will review this shortly

@@ -48,6 +51,10 @@ def check_type(obj, exp_type, exc_type):
f"given dataset is of type: {type(obj)}, expected: {exp_type}"
)

@abstractmethod
def combine(self, batched_data: List[List[T]]) -> List[T]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose this signature instead of (List[T]) -> T?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't consider to treat T as List[T] on its own, I'll refactor this later this week, a little tied up at the moment with other stuff. Will also fix the conflicts then! Lmk if there's any other concerns with his PR 🙏🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, come to think of it, I used (List[List[T]) -> List[T]) because T is a TypeVar, type checks via mypy will be incorrect if we do (List[T]) -> T instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, np.concatenate accepts a sequence of array-like and you should pass it a List[np.array]. And it returns siingle np.array. So correct signature is List[T] -> T where T is np.array or pd.DataFrame of something like this

@terryyylim
Copy link
Contributor Author

Closing because original fork was detached when repository became public.

@terryyylim terryyylim closed this Jun 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add numpy or refactor this
2 participants