You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{
"name": "partition by ROWKEY in join on non-ROWKEY",
"statements": [
"CREATE STREAM L (A STRING, B STRING) WITH (kafka_topic='LEFT', value_format='JSON', KEY='A');",
"CREATE STREAM R (C STRING, D STRING) WITH (kafka_topic='RIGHT', value_format='JSON', KEY='C');",
"CREATE STREAM OUTPUT AS SELECT L.A, L.B, R.C, R.D, L.ROWKEY, R.ROWKEY FROM L JOIN R WITHIN 10 SECONDS ON L.B = R.D PARTITION BY L.ROWKEY;"
],
"comments": [
"This test demonstrates a problem when we JOIN on a non-ROWKEY field and then PARTITION BY ",
"a ROWKEY field. Note that the key is 'join' when it should be 'a' and the key-field is 'B' ",
"when it should be 'L_ROWKEY'"
],
"inputs": [
{"topic": "LEFT", "key": "a", "value": {"A": "a", "B": "join"}},
{"topic": "RIGHT", "key": "c", "value": {"C": "c", "D": "join"}}
],
"outputs": [
{"topic": "OUTPUT", "key": "join", "value": {"A": "a", "B": "join", "C": "c", "D": "join", "L_ROWKEY": "a", "R_ROWKEY": "c"}}
],
"post": {
"sources": [
{"name": "OUTPUT", "type": "stream", "keyField": "B"}
]
}
To Reproduce
Run the partition-by.json QTT test.
Expected behavior
The rowkey is L_ROWKEY and the value is a.
Actual behaviour
The rowkey is B is and the value is join
The text was updated successfully, but these errors were encountered:
and therein lies the challenge with this whole debate, if i've followed it
correctly ? whether we think we want a user to consider the PARTITION BY to
be something that happens "based on the source, as a part of processing the
query", or something that happens "afterwards" and affects the output of
whatever else the query specified. I would argue that it's far more SQL-y
to think of it the latter way - SQL is supposed to be declarative after all
;) - it should say "what i want to come out of this query" and very little
about "how the system should execute this query". I do agree that we're
blurring the lines a little with PARTITION BY in that it forcibly mandates
something about how the results are written to storage but the same base
principle should apply. I'm OK to say that "PARTITION BY considers the
columns from your input(s) in it's evaluation", which i think was teh
genesis of this but that seems entirely different than saying that "your
INPUT is re-partitioned by what you specify". PARTITION BY should have it's
effects applied to the _output_ of the statement.
On Mon, Dec 9, 2019 at 12:00 PM Almog Gavra ***@***.***> wrote:
I'm actually not sure this is a bug - if we think of PARTITION BY as
happening before the JOIN (which it has to, if it is to work on the
source schema) then the partition actually happens on the source (L)
before the join repartition. This behavior is pretty useless and can be
confusing, so I think we should throw an error in this situation.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4053?email_source=notifications&email_token=ABCXJICNODNILKZYS7P67LTQX2PVJA5CNFSM4JV4WMD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKPHSQ#issuecomment-563409866>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCXJIGTCT3JR3SHKAH7T53QX2PVJANCNFSM4JV4WMDQ>
.
Describe the bug
To Reproduce
Run the
partition-by.json
QTT test.Expected behavior
The rowkey is
L_ROWKEY
and the value isa
.Actual behaviour
The rowkey is
B
is and the value isjoin
The text was updated successfully, but these errors were encountered: