Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support metadata for passing by value task arguments #5527

Merged
merged 15 commits into from
Sep 8, 2019

Conversation

kfstorm
Copy link
Member

@kfstorm kfstorm commented Aug 24, 2019

Why are these changes needed?

This feature is needed by direct actor call, as it only accepts passing by value arguments. When passing byte array between Java and Python, we need metadata in task arguments to make passing byte array as value possible. Hence, this PR is a prerequisite of #5504.

What do these changes do?

Add matadata field for TaskArg in common.proto.

Update the object serialization code of both Java and Python.

Extract serialize and deserialzie methods of ObjectStore to a new class ObjectSerializer.

Related issue number

#5029

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16519/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16521/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16520/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16523/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16522/
Test PASSed.

Copy link
Contributor

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I left a few comments.


/// Id of the argument, if passed by reference, otherwise nullptr.
const std::shared_ptr<ObjectID> id_;
/// Data of the argument, if passed by value, otherwise nullptr.
const std::shared_ptr<Buffer> data_;
/// Metadata of the argument, if passed by value, otherwise nullptr.
const std::shared_ptr<Buffer> metadata_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to reuse the existing RayObject to present the by-value param in the TaskArg? That way it should be more clear and more consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for the data_ and metadata_, could we maintain the following invariants to make the code simpler?

  • one of them must be non-null;
  • if it's not null, then the buffer must be valid, that is, the size of the buffer must be greater than 0. Thus can only reply on the validity of the pointer itself, without need to further check its size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether forbidding empty buffers is a better choice. Maybe we can add two more functions, HasData and HasMetadata.

Do you have any concerns about processing empty buffers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found that RayObject already has the HasMetadata function. I think it's better to ensure that data_ cannot be nullptr yet is allowed to be empty. As for metadata_, we don't set any restrictions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain a bit more on why it's better to ensure data_ cannot be nullptr?

I think for error conditions data_ could nullptr?

I preferred to not have empty buffers because:

  • it's more intuitive and more clear conceptually. When it's not a nullptr, it's always valid.
  • in most cases we only have data_ but not metadata_, and we can save a memory allocation/free in this case.
  • I'm a bit concerned that having empty buffers might be error-prone. e.g. it's possible that people might only check the validity of the pointer, but don't check if buffer size is empty, then it might cause some bugs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed my mind. I think we should ensure that the same set of rules apply to both data_ and metadata_ to avoid confusion. I tried to add checks in the constructor of RayObject to avoid empty buffers, but it just made it worse. Tests failed due to the checks and I can't find out the source of empty buffers. And it cost me hours.

I suggest to keep it as is. We can add more checks after we have an agreement on nullptr vs. empty buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Then how about just keep it for now, and leave a todo that we can revisit this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo added.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added checks to avoid empty buffers when constructing a RayObject instance.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16537/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16540/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16559/
Test PASSed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, just please be very clear about the semantics of the RayObject as it is getting more and more complicated.

}

RAY_CHECK((data_ && data_->Size()) || (metadata_ && metadata_->Size()))
<< "Data and metadat cannot be both empty.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "Data and metadat cannot be both empty.";
<< "Data and metadata cannot both empty.";

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

}

/// Whether this object has metadata.
bool HasMetadata() const { return metadata_ != nullptr; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the invariant is that the object has either data_ or metadata_, please also add a HasData method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -20,6 +20,63 @@ struct RayFunction {
const std::vector<std::string> function_descriptor;
};

/// Binary representation of ray object.
class RayObject {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please comment that only one of data_ or metadata_ must be present.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// If this object is required to hold a copy of the data,
// make a copy if the passed in buffers don't already have a copy.
if (data_ && !data_->OwnsData()) {
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), true);
data_ = std::make_shared<LocalMemoryBuffer>(data_->Data(), data_->Size(), /*flag_name=*/true);

Same for below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

RayObject(const std::shared_ptr<Buffer> &data,
const std::shared_ptr<Buffer> &metadata = nullptr, bool copy_data = false)
: data_(data), metadata_(metadata), has_data_copy_(copy_data) {
if (has_data_copy_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this PR didn't introduce this flag, but can you please explain why we need the has_data_copy_ flag and its semantics in the header comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when it's set, when should the caller set it, what could go wrong if it isn't set when it should be, etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhijunfu Can you help answer the questions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to Buffer definition, please refer to src/ray/common/buffer.h.

For example, by default when you initialize a LocalMemoryBuffer with a data pointer and a length, it just assigns the pointer and length, but doesn't copy the data content, this is for performance reasons, but in this case the buffer cannot ensure data validity, it instead relies the lifetime passed in data pointer.

This is fine for most cases - for example when you put an object into store, you create a temporary RayObject and don't want to do an extra copy. But in some cases you do want to always hold a valid data - for example, memory store uses RayObject to represent objects, in this case you actually want the object data to remain valid after user puts it into store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that seems reasonable but we need to document it better (basically, the content of this comment). The comment in src/ray/common/buffer.h also doesn't have much more information.

@@ -20,6 +20,63 @@ struct RayFunction {
const std::vector<std::string> function_descriptor;
};

/// Binary representation of ray object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this comment more useful or remove it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the updated comment.

bool HasMetadata() const { return metadata_ != nullptr; }

private:
// TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this before merging. Shouldn't be too hard and can easily get lost afterwards.

Copy link
Member Author

@kfstorm kfstorm Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in the discussion with @zhijunfu, I tried it, but a lot of tests failed and I can't find out all the sources of empty buffers after hours. You may continue to discuss it on that thread.

// zero size means empty data/metadata. We'd better pick one and treat the other as
// invalid.

/// Data of the ray object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/// Data of the ray object.
std::shared_ptr<Buffer> data_;
/// Metadata of the ray object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/ray/core_worker/common.h Show resolved Hide resolved
@@ -9,8 +11,15 @@
public byte[] metadata;

public NativeRayObject(byte[] data, byte[] metadata) {
Preconditions.checkNotNull(data);
Preconditions.checkNotNull(metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should metadata be a zero-length array when there's no metadata? If so, we can document this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. This is something I didn't keep consistent with the C++ code. I'll make null pointer and zero-length array both valid. If you prefer to allow only one of them, we can discuss it in the first thread.

byte[] data = nativeRayObject.data;

// If meta is not null, deserialize the object from meta.
if (meta != null && meta.length > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta != null isn't needed now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'll make null pointer valid, it is needed.


private FunctionArg(ObjectId id, byte[] data) {
private FunctionArg(ObjectId id, NativeRayObject value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check (id == null) != (value == null)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private constructor, but OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this check can help catch bugs eariler

@@ -20,6 +20,63 @@ struct RayFunction {
const std::vector<std::string> function_descriptor;
};

/// Binary representation of ray object.
class RayObject {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better put this RayObject class in src/common/, because raylet may also need to use this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

@kfstorm kfstorm requested review from raulchen and edoakes September 2, 2019 11:13
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16723/
Test PASSed.

Copy link
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. can you fix the conflict? thx

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16831/
Test PASSed.

@kfstorm
Copy link
Member Author

kfstorm commented Sep 6, 2019

@raulchen Conflicts solved.

edoakes
edoakes previously requested changes Sep 7, 2019
src/ray/common/ray_object.h Outdated Show resolved Hide resolved
src/ray/common/ray_object.h Outdated Show resolved Hide resolved
/// \param[in] data Data of the ray object.
/// \param[in] metadata Metadata of the ray object.
/// \param[in] copy_data Whether this class should hold a copy of data.
RayObject(const std::shared_ptr<Buffer> &data, const std::shared_ptr<Buffer> &metadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining the has_data_copy flag and why it's necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

bool HasMetadata() const { return metadata_ != nullptr && metadata_->Size() > 0; }

private:
// TODO (kfstorm): Currently both a null pointer and a pointer points to a buffer with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address this before we merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added checks to avoid empty buffers.

src/ray/common/task/task_spec.cc Outdated Show resolved Hide resolved
RAY_CHECK(data_ != nullptr) << "This argument isn't passed by value.";
return data_;
const RayObject &GetValue() const {
RAY_CHECK(value_ != nullptr) << "This argument isn't passed by value.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try not to use RAY_CHECK in the core worker as it causes an Abort trap: 6, making it very inconvenient to debug (and horrible for users if it happens somehow). Better to use RAY_RETURN_NO_OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is that possible? RAY_RETURN_NOT_OK is used for functions with the return type of Status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RAY_CHECK should not get triggered by user code. If it fails, it must be a bug in ray codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's better to use RAY_CHECK to catch system bugs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry - of course you can't use RAY_RETURN_NOT_OK. Actually, isn't it better to define separate classes for by value and by reference arguments instead of having these checks? That way the compiler can catch these bugs for us...

src/ray/core_worker/common.h Outdated Show resolved Hide resolved
src/ray/core_worker/common.h Outdated Show resolved Hide resolved
src/ray/core_worker/test/core_worker_test.cc Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16856/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16859/
Test PASSed.

@raulchen
Copy link
Contributor

raulchen commented Sep 7, 2019

@edoakes All comments should be addressed. Please let us know if you have other comments. I'll postpone merging this PR until tomorrow.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16864/
Test PASSed.

@edoakes
Copy link
Contributor

edoakes commented Sep 7, 2019

LGTM aside from my comment about defining separate structs/classes for by value and by reference args. You can push that to another PR if this one is blocking things, but seems much better than RAY_CHECKing the fields.

@raulchen
Copy link
Contributor

raulchen commented Sep 8, 2019

@edoakes By-value and by-reference arguments need different methods (GetReference and GetValue). If we use 2 separate classes, we'll need to check type and downcast the pointer, e.g., reinterpret_cast<ByValueTaskArg>(task_arg)->GetValue(). This looks more inconvenient.

I'll merge this PR first, as the refactor may need more discussions.

@raulchen raulchen merged commit d8f5804 into ray-project:master Sep 8, 2019
@raulchen raulchen deleted the pass_by_value_metadata branch September 8, 2019 03:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants