-
Notifications
You must be signed in to change notification settings - Fork 44
Support combining batched prediction results on DatasetType #262
Conversation
c3c8cd9
to
b7a8f64
Compare
b7a8f64
to
8f9dbbf
Compare
mlem/api/commands.py
Outdated
for part in data: | ||
batch_dataset = get_dataset_value(part, batch_size) | ||
for chunk in batch_dataset: | ||
preds = w.call_method(resolved_method, chunk.data) | ||
res += [*preds] # TODO: merge results | ||
dt = DatasetAnalyzer.analyze(preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be safe to assume dt = w.methods[resolved_method].returns
mlem/api/commands.py
Outdated
for part in data: | ||
batch_dataset = get_dataset_value(part, batch_size) | ||
for chunk in batch_dataset: | ||
preds = w.call_method(resolved_method, chunk.data) | ||
res += [*preds] # TODO: merge results | ||
dt = DatasetAnalyzer.analyze(preds) | ||
res = dt.combine(res, preds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's suboptimal to combine every new part. Each time you allocate new array in memory. It's better to collect all of them and combine in one call
mlem/contrib/numpy.py
Outdated
@@ -101,6 +101,12 @@ class NumpyNdarrayType( | |||
def _abstract_shape(shape): | |||
return (None,) + shape[1:] | |||
|
|||
@staticmethod | |||
def combine(original: np.ndarray, new: np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should not be static, for example you should be able to check if datasets you are combining have the same structure
I know you are not done with it, but I left some comments :) |
@mike0sv No worries! Early feedback is always good to prevent redundant work. I've updated the implementation, let me know if it's aligned with what you had in mind, thank you! |
Codecov Report
@@ Coverage Diff @@
## main #262 +/- ##
==========================================
+ Coverage 89.23% 89.28% +0.05%
==========================================
Files 77 76 -1
Lines 5544 5629 +85
==========================================
+ Hits 4947 5026 +79
- Misses 597 603 +6
Continue to review full report at Codecov.
|
Hey Terence! Sorry that it took so long, we've been busy with the upcoming release. I will review this shortly |
@@ -48,6 +51,10 @@ def check_type(obj, exp_type, exc_type): | |||
f"given dataset is of type: {type(obj)}, expected: {exp_type}" | |||
) | |||
|
|||
@abstractmethod | |||
def combine(self, batched_data: List[List[T]]) -> List[T]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you choose this signature instead of (List[T]) -> T
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't consider to treat T
as List[T]
on its own, I'll refactor this later this week, a little tied up at the moment with other stuff. Will also fix the conflicts then! Lmk if there's any other concerns with his PR 🙏🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, come to think of it, I used (List[List[T]) -> List[T])
because T is a TypeVar
, type checks via mypy will be incorrect if we do (List[T]) -> T
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, np.concatenate accepts a sequence of array-like and you should pass it a List[np.array]
. And it returns siingle np.array
. So correct signature is List[T] -> T
where T
is np.array
or pd.DataFrame
of something like this
Closing because original fork was detached when repository became public. |
Context
A bug was introduced in #221, where approach to combining prediction results incorrectly assumed type to be numpy array. This PR utilizes
DatasetAnalyzer
to first determine the type of the results prior to combining them, since the same model can output different data types.Modifications
mlem/api/commands.py
- Interpret data type prior to combining prediction resultsmlem/contrib/numpy.py
- Add support for combining batch reading prediction results for nd.arrayWhich issue(s) this PR fixes:
Fixes #257