Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relax] Implement relax.op.view #16955

Merged
merged 3 commits into from
May 9, 2024

Conversation

Lunderberg
Copy link
Contributor

This commit implements relax.op.view (R.view in TVMScript) to produce a view into an existing array. This returned view shares the same backing allocation as the existing array.

Because R.view comes with potential trade-offs; such as increased memory footprint, performance cost to apply a non-zero DLTensor::byte_offset, and potential misalignment for vector operators; this PR does not use R.view apart from unit tests. Applications of R.view, either for specific compute kernels or in optimization passes, is instead kept for follow-up PRs.

@Lunderberg Lunderberg requested a review from masahi April 29, 2024 17:15
@Lunderberg
Copy link
Contributor Author

By line count, this PR looks bigger than it actually is. Because this functionality can , there's a lot more error checking than usual, and a lot more test cases to validate those error-checking paths. Breaking down the changes in this commit, the majority are test cases, and the majority of what remain are error-checking paths.

  • ~100 lines implementation in view.cc
  • ~260 lines error checking in view.cc
  • ~100 lines exposing the functionality through the Python API
  • ~750 lines testing the functionality in test_op_view.py

This commit implements `relax.op.view` (`R.view` in TVMScript) to
produce a view into an existing array.  This returned view shares
the same backing allocation as the existing array.

Because `R.view` comes with potential trade-offs; such as increased
memory footprint, performance cost to apply a non-zero
`DLTensor::byte_offset`, and potential misalignment for vector
operators; this PR does not use `R.view` apart from unit tests.
Applications of `R.view`, either for specific compute kernels or in
optimization passes, is instead kept for follow-up PRs.
@Lunderberg Lunderberg force-pushed the relax_implement_view_operator branch from 0cb58ce to ed4fd50 Compare April 29, 2024 17:34
@masahi masahi requested review from tqchen and Hzfengsy April 29, 2024 18:36
}

StructInfoDeriveFunc infer_sinfo_env_func;
infer_sinfo_env_func = EnvFunc::Get("tvm.relax.struct_info.infer_view_sinfo");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a packed func? Can't we use the C++ function directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to define the StructInfo for the generated Call node, we could call the C++ function directly. Ideally, though, if the arguments change due to some downstream transform, the shape inference should be repeated with the new argument. For that

  • Using an operator's FInferStructInfo function. This only applies to tvm.ir.Op instances, and we've just removed that as part of LegalizeOps.
  • Using the params and ret fields from the FuncStructInfo. This works for static cases, and most dynamic shapes, but doesn't support inferring the output shape based on a ShapeExpr argument.
  • Using FuncStructInfo::OpaqueFunc with a derivation func. This is the most general method, and essentially lets you pack a FInferStructInfo into a FuncStructInfo.

The third option doesn't come up very often, and I could only find one previous location where this functionality is used. (Previous usage is here, which defines the use of CallNode::sinfo_args as the default inferred struct info for external functions.)

@tqchen
Copy link
Member

tqchen commented Apr 30, 2024

Just want to note here. Having view operation can in general cause problems mainly because most ops(including generated and external ones) assumes elem_offset = 0 for performance reasons(alignment, less kernel argument).

Ideally we would like to ensure such assumption to hold. This being said there are some needs in slicing out the arrays. e.g. in the case of LoRA elements. There are a few ways to go with this:

  • Enable special ops that handle inputs which can come a view, likely only LoRA ones, which can inline view operations into ops.
  • Add a R.memory.ensure_compact operation, which can potentially results in a copy for backends that do not have direct memory ptr access, but can potentially do ptr editing for backends that support them (this would need a target dependent lowering)
  • R.view perhaps should be renamed as R.memory.view, this is a more advanced operator that contains certain assumptions and likely not something we want to advertise genrally for now.

tqchen
tqchen previously requested changes Apr 30, 2024
@Lunderberg
Copy link
Contributor Author

Having view operation can in general cause problems mainly because most ops(including generated and external ones) assumes elem_offset = 0

So long as the operations assert that the element offset is zero when this assumption is being made, this makes sense. This is what we do in MakePackedAPI for PrimFuncs that require the elem_offset to be zero.

For external operations that accept an aligned pointer to data, I like your suggestion of ensuring that we provide aligned data. My plan was to only introduce views that maintain the same alignment that is provided by existing allocations.

For external operations that accept a NDArray and assume the offset is zero without validating that assumption, I think we should view this as a bug in those external operations. If we know specific cases that ignore the offset, we can definitely insert alignment operators to provide aligned buffers. I wouldn't want to add it for every single external operation, though, because at some point we need to trust that functions accept the arguments that they say are accepted.

  • Add a R.memory.ensure_compact operation, which can potentially results in a copy for backends that do not have direct memory ptr access, but can potentially do ptr editing for backends that support them (this would need a target dependent lowering)

To be clear, do you mean R.memory.ensure_aligned instead of R.memory.ensure_compact? This has come up a few times in this discussion, and I want to make sure that we are discussing the same thing. The view operation cannot be applied to a strided NDArray, and its output is always compact.

Having the dedicated operation for it would also work well for dynamically-shaped arguments. In those cases, we wouldn't know until runtime whether the operation requires a copy or not in order to provide an aligned argument.

Enable special ops that handle inputs which can come a view, likely only LoRA ones, which can inline view operations into ops.

I agree with aiming to have views be fused with later operations where possible, though I'd add that this is not LoRA-specific functionality. Anywhere that CombineParallelMatmul can be used to improve the matmul performance, a view into the result can be used to avoid an unnecessary copy from the output.

  • R.view perhaps should be renamed as R.memory.view, this is a more advanced operator that contains certain assumptions and likely not something we want to advertise genrally for now.

I think the only assumption it makes is that a platform supports casting of pointers.

Regarding names, I agree that R.memory.view is a better name for it, and will update the PR.

- Rename `R.view` to `R.memory.view`
- Rename `relax.op.view` to `relax.op.memory.view`
@Lunderberg
Copy link
Contributor Author

Changes have been made as requested, ready for re-review.

@tqchen
Copy link
Member

tqchen commented May 1, 2024

R.memory.ensure_compact will actually return a new NDArray whose elem_offset=0, and for devices that allows explicit ptr moving, this is something that can accelerate backends

@Lunderberg
Copy link
Contributor Author

Having R.memory.ensure_aligned makes sense, as that will allow us to explicitly mark where additional copies are required after applying R.memory.view, before passing to a compute kernel. I'm putting it together as a follow-up PR.

Any other concerns? I think this PR is ready to merge.

@tqchen tqchen dismissed their stale review May 2, 2024 18:46

dismissing my previous requests as they are addressed

@tqchen
Copy link
Member

tqchen commented May 2, 2024

Thank you! my previous comments are addressed. Unfortunately i didn't get a chance to do a thourough read, would be good to get review from another person. So i just dissmissed my previous comment

@Lunderberg
Copy link
Contributor Author

Sounds good, and thank you on the feedback!

# specific language governing permissions and limitations
# under the License.

"""Operations that act on the DLTensor container """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need update here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, and updated.

dtype: Optional[Expr] = None,
relative_byte_offset: Optional[Expr] = None,
) -> Expr:
"""Broadcasts a tensor to a specified shape.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need update here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you on the catch, and updated.

@@ -296,13 +298,17 @@ class CallableProxy(StructInfoProxy):
purity : bool
Whether the callable is pure.

derive_func: Optional[Union[str, tvm.ir.EnvFunc]]
The derivation function for the outputq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo outputq?

And what is a derivation function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, updated to output.

The derivation function is equivalent to FInferStructInfo set for relax operations, but can be applied to an arbitrary function. It can be used in cases where a PackedFunc may be called, and the output StructInfo should be derived from the input arguments, rather than taken from sinfo_args. This functionality has existed in the C++ API for a while, but it looks like this is the first time it's been exposed through the Python API, so I fleshed out the description here.

@masahi
Copy link
Member

masahi commented May 8, 2024

@tvm-bot rerun

@masahi masahi merged commit 4c1ebcf into apache:main May 9, 2024
20 checks passed
@Lunderberg Lunderberg deleted the relax_implement_view_operator branch May 9, 2024 13:21
Lunderberg pushed a commit that referenced this pull request Aug 6, 2024
… R.view (#17145)

Previously, `R.view` was legalized to extern call to
`runtime.TVMArrayCreateView` during `LegalizeOps`. This call to extern
func can't be properly handled by `StaticBlockPlanMemory` because it
assumes the extern func does not retain the input buffer. Extern func
returning a view of the input would break the ref count of the
buffer. This PR defers the legalization of `R.view` so that it can be
explicitly handled by memory planning.

A new op `R.ensure_aligned` is added as discussed in #16955
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants