Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UpdateBy gRPC #2635

Merged
merged 5 commits into from
Aug 29, 2022
Merged

UpdateBy gRPC #2635

merged 5 commits into from
Aug 29, 2022

Conversation

devinrsmith
Copy link
Member

@devinrsmith devinrsmith commented Jul 11, 2022

Fixes #2607

Copy link
Member

@niloc132 niloc132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👎 as it stands - this is introducing a lot (all? not sure) of QST via gRPC in this one new feature, and that means there is a lot of code that isn't obvious from the pull request which is now reachable from any client. Easiest example: a Expression.type.raw is an easy way to smuggle arbitrary java for execution, but there are no safeguards introduced to prevent this (the updateby group_by field may hold those raw expression strings, and engine-table's SelectColumn.ExpressionAdapter will invoke SelectColumnFactory.getExpression(..) on them without a second look).

At the very least, if this feature can only be implemented by using QST's Expression types, I would a) like to see the qst protos moved to a new .proto file, and b) a safe-by-default config option preventing the use of the UpdateByGrpcImpl type.


message UpdateByRequest {

message Options {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC not all languages qualify these with namespacing, it probably would be a good idea to give more descriptive names to avoid collisions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unfortunate :/ but good to know.

Expression expression = 2;
}

message Pair {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convention elsewhere has been to just encode these as foo=bar/foo strings - ComboAggregateRequest.Aggregate.match_pairs for example.

i see the point that we can be more descriptive by describing it this way, but for a=a we have to write the string twice anyways, and a plain string for each of the two fields is also not accurate, since it can only be java identifiers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some documentation - but the idea is that input_column_name can be empty for the a=a case:

    public static Pair adapt(io.deephaven.proto.backplane.grpc.Pair pair) {
        final ColumnName output = ColumnName.of(pair.getOutputColumnName());
        return pair.getInputColumnName().isEmpty() ? output : Pair.of(ColumnName.of(pair.getInputColumnName()), output);
    }


optional int32 chunk_capacity = 2;

optional double maxStaticSparseMemoryOverhead = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snake case for message fields (maximumLoadFactor, targetLoadFactor too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

Spec spec = 1;
repeated Pair pair = 2;
}
oneof type {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a oneof?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explicitly modeled this way at the table api level b/c there is room for additional types in the future:

public interface UpdateByClause {
...
    interface Visitor<T> {
        T visit(ColumnUpdateClause clause);
    }


enum BadDataBehavior {
// Reset the state for the bucket to {@code null} when invalid data is encountered.
RESET = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 0 is the "This is not set" default, consider making it fail-safe with THROW, so as to not surprise users who fail to set this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In context, at least right now, this is always being prefaced w/ optional. The idea is that if the client does not provide a value for it, it will inherit the appropriate server defaults (the defaults vary depending on the specific field):

message UpdateByEmaOptions {
  optional BadDataBehavior on_null_value = 1;
  optional BadDataBehavior on_nan_value = 2;
  ...

I'll add documentation at this layer to explain that.

That said, I'm not against making THROW = 0, but there may be reasons to want the order to be the same as the java enum it is mapping? (We can update the java enum as well...)

oneof type {
string column_name = 1;
int64 long_value = 2;
string raw = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only raw and long_value? if various other numbers/etc can be raw, why can't long?

Also it seems odd to refer to the "type" as being "column_name" or "raw" - perhaps expression could have a "value" which could be "column reference" or "long literal", "raw X value", etc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just found the Expression type hierarchy in existing code, and I'm only more confused - why is RawString not a Value, but ColumnName is? Shouldn't ColumnName be a "Reference" or something rather than a Value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column name, long value, and raw are the only plumbed through parts at the table api layer right now, but I agree we should flesh these out more (#830). I can complete the loop a bit here in these regards.

As it is right now,

Expression = Value | RawString
Value = ColumnName | long (literal)

Maybe we should add more hierarchy here, with "Reference" and "Literal" instead of "Value"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this proto.

@devinrsmith
Copy link
Member Author

At first I was a bit confused by your comment on "easy way to smuggle arbitrary java for execution" - is there anything preventing the equivalent via a view or select? Looking at the code now though, I see io.deephaven.server.table.validation.ColumnExpressionValidator#validateColumnExpressions... it looks like this is the protection you are referring to?

I'll adapt these checks in.

I'm hoping that we can (eventually) migrate from grpc -> table api into grpc -> qst -> table api, as I think it reduces duplication and provides other benefits IMO.

@devinrsmith
Copy link
Member Author

Looking deeper, I'm having trouble finding out if we disable arbitrary code execution via gRPC table operations api today. Afaict, we only do "this looks like a valid operation" validation. I would also argue, this should be the responsibility of the engine, not the grpc layer. Trying to get clarification from @rcaudy

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. It's a lot of code for not a lot of functionality.


private static io.deephaven.proto.backplane.grpc.Pair adapt(ColumnName pair) {
return io.deephaven.proto.backplane.grpc.Pair.newBuilder()
.setOutputColumnName(pair.name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It surprised me that you were opting to not make input column name explicit.

@rcaudy
Copy link
Member

rcaudy commented Jul 28, 2022

I don't hate this. I do sort of feel like the layering is getting excessive. I think the evidence of this can be inferred from how long it took you (@devinrsmith ) and I to trace through some of the table creation logic for parent resolution.

@devinrsmith devinrsmith requested review from niloc132 and rcaudy July 29, 2022 00:35
@devinrsmith
Copy link
Member Author

Note: this PR also brings selectDistinct in-line with the security of view, update, etc (previously, it was very locked down):

image

@devinrsmith devinrsmith mentioned this pull request Jul 29, 2022
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marker, I've reviewed up to here.

@devinrsmith
Copy link
Member Author

I've force pushed removing any changes to TableSpec and gRPC impl via TableSpec. I haven't resolved any concerns wrt the .proto structure yet.

@pete-petey pete-petey modified the milestones: Jul 2022, Aug 2022 Aug 16, 2022
// public TableSpec apply(TableSpec spec, String[] formulas) {
// return spec.selectDistinct(formulas);
// }
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not meant to be added?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it, and looks like I need to also fix some merge conflicts.

@devinrsmith devinrsmith requested a review from niloc132 August 18, 2022 17:48
@devinrsmith devinrsmith merged commit dffa8cc into deephaven:main Aug 29, 2022
@devinrsmith devinrsmith deleted the updateby-grpc branch August 29, 2022 14:21
@github-actions github-actions bot locked and limited conversation to collaborators Aug 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UpdateByTable gRPC impl
4 participants