-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/510 split #677
Features/510 split #677
Conversation
GPU cluster tests are currently disabled on this Pull Request. |
Codecov Report
@@ Coverage Diff @@
## master #677 +/- ##
==========================================
+ Coverage 97.54% 97.59% +0.04%
==========================================
Files 87 87
Lines 17698 18030 +332
==========================================
+ Hits 17264 17596 +332
Misses 434 434
Continue to review full report at Codecov.
|
run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all looks good to me. However, there is a larger comment to be discussed.
Do we want to have this return actual views of the DNDarray? If so, it would mean that we cannot balance the DNDarrays, and it might change some of the logic involved in the function itself. @ClaudiaComito @Markus-Goetz , thoughts?
if len(ary.lshape) < 2: | ||
ary = reshape(ary, (1, ary.lshape[0])) | ||
result = split(ary, indices_or_sections, 1) | ||
result = [flatten(sub_array) for sub_array in result] | ||
else: | ||
result = split(ary, indices_or_sections, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed in dsplit
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coquelin77 No, it isn't. hsplit
is some kind of a special function among the three split
variations. In contrary to the corresponding general split
function call with axis 1, the ary
is allowed to be 1-dimensional (in numpy
) even if (functionally) split among the second axis. As it isn't in torch
, ary
has to be reshaped (adding one dimension). The subsequent flattening is needed to meet full consistency with numpy
.
As this exception doesn't occur within the other split
functions, no shape manipulating is needed there.
for sub_DNDarray in sub_arrays_ht: | ||
sub_DNDarray.balance_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that this should be communicated in the docs. Since all of the sub-arrays are balanced, they are not linked to the original data points...this spawns a larger question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coquelin77 I agree, but I guess this strongly depends on the general discussion above.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Lena,
before I look into the details, something important jumps to my eye: np.split
& co. return lists of subarrays as views into the original array. The current heat implementation also says so, but really it returns copies. To exemplify this:
>>> import numpy as np
>>> import heat as ht
>>> a = np.arange(16).reshape(4,4)
>>> a
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> b, c = np.split(a, 2)
>>> b[0,0] = 10
>>> b
array([[10, 1, 2, 3],
[ 4, 5, 6, 7]])
>>> a
array([[10, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
>>> ht_a = ht.arange(16).reshape((4,4))
>>> ht_a
DNDarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=ht.int32, device=cpu:0, split=None)
>>> ht_b, ht_c = ht.split(ht_a, 2)
>>> ht_b[0,0] = 10
>>> ht_b
DNDarray([[10, 1, 2, 3],
[ 4, 5, 6, 7]], dtype=ht.int32, device=cpu:0, split=None)
>>> ht_a
DNDarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=ht.int32, device=cpu:0, split=None)
DNDarray([]) | ||
] | ||
|
||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we should raise an exception if ary.ndim < 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, an exception is raised within split
, so I think raising an additional one in dsplit
would be superfluous
I'm with you @coquelin77, no balancing. We need |
this isnt a major change from the code, it could be done with relative ease if the balance is dropped. however, if the hard splits result in a partial view on two processes, this would be hard to communicate to users |
@coquelin77 @ClaudiaComito For now it should return a copy, generally however I would obviously like to have something like a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the comment about views vs copies didnt make it into the docs of dsplit
, hsplit
. or vsplit
heat/core/manipulations.py
Outdated
Returns | ||
------- | ||
sub_arrays : list of DNDarrays | ||
A list of sub-DNDarrays as views into ary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copies not views, also ary an be a code block (ary
)
also can you add the copy statement to the description of the function as well. i think that people will often only read that part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh dear, sorry I missed that. I'll fix it and wrap ary
into a code block. Are all markdown features available in the dosctrings/correctly rendered in sphinx? And you're right, the general description is probably the most relevant part to most, I'll add the hint
rerun tests |
1 similar comment
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the comment about views vs copies didnt make it into the docs of dsplit
, hsplit
. or vsplit
sorry this was a remnant of before, ignore this |
Description
Implementation of functions
split()
,vsplit()
,hsplit()
,dsplit()
based on np analogues.For process-local operations, I use
torch.split
Docs numpy:
Docs pytorch: https://pytorch.org/docs/stable/generated/torch.split.html
differences numpy and torch
This affects the input parameter
indices_or_sections
(np) /split_size_or_sections
(torch).As the variations in the given names might already reveal, the parameters contain different types of information depending on the used library.
Numpy:
In this case,
indices_or_sections
indicates in how many parts the inputary
shall be split.Therefore,
indices_or_sections = 2
will result into a list containing 2 DNDarrays.On the other hand,
indices_or_sections
can be handled as a list containing the boundaries of a series of slicesFor instance,
indices_or_sections = [1, 3, 4]
will indicate the following data chunks:- [: 1]
- [1: 3]
- [3: 4]
- [4: ]
Torch:
In contrary to numpy, the torch analogue always contains the size(s) of the resulting tensor(s).
If integer, all (function-)split tensors will have the same size, otherwise the possibly varying split sizes might be defined in an array_like.
According to these differences, the parameter semantics of numpy are mapped to those of torch as explained in the following.
Strategy
As
vsplit, dsplit, hsplit
are no more than calls ofsplit
with a specific axis, I'll reduce the explanation to the algorithm of the latter.Depending on whether
ary
is distributed in the same dimension as it is split within the function (thereforeary.split == axis
), the strategy varies, as the data chunks on each node have to be reorganized correctly in the resulting list of (function-) split DNDarrays. (Challenge: Data of (function-) split DNDarrays is split via MPI)axis != ary.split
Mapping np -> torch:
If
indices_or_sections
is integer, the resulting size for all DNDarrays will be calculated using the size of the affected axis.If
indices_or_sections
is array_like, the resulting split sizes are calculated usinght.diff
(1st discrete difference).axis == ary.split
indices_or_sections
is integer:In this case, the split matches the distribution and the data on the node can be used directly and inserted into the resulting list.
The remaining DNDarrays are filled with empty DNDarrays of the needed shape.
The goal is, to use
torch.split
with the process-adapted/correctly calculatedindices_or_sections
parameter (new_indices
)indices_or_sections
is array_like / DNDarray after sanitationThe goal is, to use
torch.split
with the process-adapted/correctly calculatedindices_or_sections
parameter.Therefore, reduce the input information of
indices_or_sections
to the process relevant/ the given indices (of the global DNDarray) which correspond to the (local) data on the node. The needed information is provided via the slice out ofht.comm.chunk
Afterwards, map the numpy to the torch semantics, as described above in
axis != ary.split
In all cases, every DNDarray within the list is balanced before being returned.
Overview split functions and related axis
Hint: In contrary to the corresponding general split function call with axis 1, the input within
hsplit
is allowed to be 1-dimensional.This results into some dimension changing operations in the HeAT version, more specifically a reshape adding one dimension and a flattening afterwards, to meet full consistency with numpy.
Issue/s resolved: #510
Changes proposed:
manipulations.split()
,manipulations.hsplit()
,manipulations.vsplit()
,manipulations.dsplit()
Type of change
Due Diligence
Does this change modify the behaviour of other functions? If so, which?
no