-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support metadata for passing by value task arguments #5527
Conversation
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I left a few comments.
src/ray/core_worker/common.h
Outdated
|
||
/// Id of the argument, if passed by reference, otherwise nullptr. | ||
const std::shared_ptr<ObjectID> id_; | ||
/// Data of the argument, if passed by value, otherwise nullptr. | ||
const std::shared_ptr<Buffer> data_; | ||
/// Metadata of the argument, if passed by value, otherwise nullptr. | ||
const std::shared_ptr<Buffer> metadata_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reuse the existing RayObject
to present the by-value param in the TaskArg
? That way it should be more clear and more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for the data_ and metadata_, could we maintain the following invariants to make the code simpler?
- one of them must be non-null;
- if it's not null, then the buffer must be valid, that is, the size of the buffer must be greater than 0. Thus can only reply on the validity of the pointer itself, without need to further check its size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether forbidding empty buffers is a better choice. Maybe we can add two more functions, HasData
and HasMetadata
.
Do you have any concerns about processing empty buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just found that RayObject
already has the HasMetadata
function. I think it's better to ensure that data_
cannot be nullptr yet is allowed to be empty. As for metadata_
, we don't set any restrictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain a bit more on why it's better to ensure data_ cannot be nullptr?
I think for error conditions data_
could nullptr?
I preferred to not have empty buffers because:
- it's more intuitive and more clear conceptually. When it's not a nullptr, it's always valid.
- in most cases we only have
data_
but notmetadata_
, and we can save a memory allocation/free in this case. - I'm a bit concerned that having empty buffers might be error-prone. e.g. it's possible that people might only check the validity of the pointer, but don't check if buffer size is empty, then it might cause some bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed my mind. I think we should ensure that the same set of rules apply to both data_
and metadata_
to avoid confusion. I tried to add checks in the constructor of RayObject
to avoid empty buffers, but it just made it worse. Tests failed due to the checks and I can't find out the source of empty buffers. And it cost me hours.
I suggest to keep it as is. We can add more checks after we have an agreement on nullptr
vs. empty buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then how about just keep it for now, and leave a todo that we can revisit this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added checks to avoid empty buffers when constructing a RayObject
instance.
Test FAILed. |
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, just please be very clear about the semantics of the RayObject
as it is getting more and more complicated.
src/ray/core_worker/common.h
Outdated
} | ||
|
||
RAY_CHECK((data_ && data_->Size()) || (metadata_ && metadata_->Size())) | ||
<< "Data and metadat cannot be both empty."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "Data and metadat cannot be both empty."; | |
<< "Data and metadata cannot both empty."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
src/ray/core_worker/common.h
Outdated
} | ||
|
||
/// Whether this object has metadata. | ||
bool HasMetadata() const { return metadata_ != nullptr; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the invariant is that the object has either data_
or metadata_
, please also add a HasData
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
src/ray/core_worker/common.h
Outdated
@@ -20,6 +20,63 @@ struct RayFunction { | |||
const std::vector<std::string> function_descriptor; | |||
}; | |||
|
|||
/// Binary representation of ray object. | |||
class RayObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment that only one of data_
or metadata_
must be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ray/core_worker/common.h
Outdated
// If this object is required to hold a copy of the data, | ||
// make a copy if the passed in buffers don't already have a copy. | ||
if (data_ && !data_->OwnsData()) { | ||
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true); | |
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), /*flag_name=*/true); |
Same for below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ray/core_worker/common.h
Outdated
RayObject(const std::shared_ptr<Buffer> &data, | ||
const std::shared_ptr<Buffer> &metadata = nullptr, bool copy_data = false) | ||
: data_(data), metadata_(metadata), has_data_copy_(copy_data) { | ||
if (has_data_copy_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that this PR didn't introduce this flag, but can you please explain why we need the has_data_copy_
flag and its semantics in the header comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when it's set, when should the caller set it, what could go wrong if it isn't set when it should be, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zhijunfu Can you help answer the questions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to Buffer
definition, please refer to src/ray/common/buffer.h.
For example, by default when you initialize a LocalMemoryBuffer
with a data pointer and a length, it just assigns the pointer and length, but doesn't copy the data content, this is for performance reasons, but in this case the buffer cannot ensure data validity, it instead relies the lifetime passed in data pointer.
This is fine for most cases - for example when you put an object into store, you create a temporary RayObject and don't want to do an extra copy. But in some cases you do want to always hold a valid data - for example, memory store uses RayObject
to represent objects, in this case you actually want the object data to remain valid after user puts it into store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that seems reasonable but we need to document it better (basically, the content of this comment). The comment in src/ray/common/buffer.h
also doesn't have much more information.
src/ray/core_worker/common.h
Outdated
@@ -20,6 +20,63 @@ struct RayFunction { | |||
const std::vector<std::string> function_descriptor; | |||
}; | |||
|
|||
/// Binary representation of ray object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this comment more useful or remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the updated comment.
src/ray/core_worker/common.h
Outdated
bool HasMetadata() const { return metadata_ != nullptr; } | ||
|
||
private: | ||
// TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix this before merging. Shouldn't be too hard and can easily get lost afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the discussion with @zhijunfu, I tried it, but a lot of tests failed and I can't find out all the sources of empty buffers after hours. You may continue to discuss it on that thread.
src/ray/core_worker/common.h
Outdated
// zero size means empty data/metadata. We'd better pick one and treat the other as | ||
// invalid. | ||
|
||
/// Data of the ray object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ray/core_worker/common.h
Outdated
|
||
/// Data of the ray object. | ||
std::shared_ptr<Buffer> data_; | ||
/// Metadata of the ray object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -9,8 +11,15 @@ | |||
public byte[] metadata; | |||
|
|||
public NativeRayObject(byte[] data, byte[] metadata) { | |||
Preconditions.checkNotNull(data); | |||
Preconditions.checkNotNull(metadata); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should metadata
be a zero-length array when there's no metadata? If so, we can document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. This is something I didn't keep consistent with the C++ code. I'll make null pointer and zero-length array both valid. If you prefer to allow only one of them, we can discuss it in the first thread.
byte[] data = nativeRayObject.data; | ||
|
||
// If meta is not null, deserialize the object from meta. | ||
if (meta != null && meta.length > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta != null
isn't needed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'll make null pointer valid, it is needed.
|
||
private FunctionArg(ObjectId id, byte[] data) { | ||
private FunctionArg(ObjectId id, NativeRayObject value) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check (id == null) != (value == null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private constructor, but OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this check can help catch bugs eariler
src/ray/core_worker/common.h
Outdated
@@ -20,6 +20,63 @@ struct RayFunction { | |||
const std::vector<std::string> function_descriptor; | |||
}; | |||
|
|||
/// Binary representation of ray object. | |||
class RayObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better put this RayObject
class in src/common/
, because raylet may also need to use this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. can you fix the conflict? thx
Test PASSed. |
@raulchen Conflicts solved. |
/// \param[in] data Data of the ray object. | ||
/// \param[in] metadata Metadata of the ray object. | ||
/// \param[in] copy_data Whether this class should hold a copy of data. | ||
RayObject(const std::shared_ptr<Buffer> &data, const std::shared_ptr<Buffer> &metadata, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment explaining the has_data_copy
flag and why it's necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/ray/common/ray_object.h
Outdated
bool HasMetadata() const { return metadata_ != nullptr && metadata_->Size() > 0; } | ||
|
||
private: | ||
// TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address this before we merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added checks to avoid empty buffers.
RAY_CHECK(data_ != nullptr) << "This argument isn't passed by value."; | ||
return data_; | ||
const RayObject &GetValue() const { | ||
RAY_CHECK(value_ != nullptr) << "This argument isn't passed by value."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try not to use RAY_CHECK
in the core worker as it causes an Abort trap: 6
, making it very inconvenient to debug (and horrible for users if it happens somehow). Better to use RAY_RETURN_NO_OK
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is that possible? RAY_RETURN_NOT_OK
is used for functions with the return type of Status
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RAY_CHECK
should not get triggered by user code. If it fails, it must be a bug in ray codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it's better to use RAY_CHECK
to catch system bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry - of course you can't use RAY_RETURN_NOT_OK
. Actually, isn't it better to define separate classes for by value and by reference arguments instead of having these checks? That way the compiler can catch these bugs for us...
Test FAILed. |
Test PASSed. |
@edoakes All comments should be addressed. Please let us know if you have other comments. I'll postpone merging this PR until tomorrow. |
Test PASSed. |
LGTM aside from my comment about defining separate structs/classes for by value and by reference args. You can push that to another PR if this one is blocking things, but seems much better than |
@edoakes By-value and by-reference arguments need different methods ( I'll merge this PR first, as the refactor may need more discussions. |
Why are these changes needed?
This feature is needed by direct actor call, as it only accepts passing by value arguments. When passing byte array between Java and Python, we need metadata in task arguments to make passing byte array as value possible. Hence, this PR is a prerequisite of #5504.
What do these changes do?
Add
matadata
field forTaskArg
incommon.proto
.Update the object serialization code of both Java and Python.
Extract
serialize
anddeserialzie
methods ofObjectStore
to a new classObjectSerializer
.Related issue number
#5029
Linter
scripts/format.sh
to lint the changes in this PR.