JOIN on non-ROWKEY and PARTITION BY ROWKEY is broken #4053

agavra · 2019-12-05T17:03:01Z

Describe the bug

    {
      "name": "partition by ROWKEY in join on non-ROWKEY",
      "statements": [
        "CREATE STREAM L (A STRING, B STRING) WITH (kafka_topic='LEFT', value_format='JSON', KEY='A');",
        "CREATE STREAM R (C STRING, D STRING) WITH (kafka_topic='RIGHT', value_format='JSON', KEY='C');",
        "CREATE STREAM OUTPUT AS SELECT L.A, L.B, R.C, R.D, L.ROWKEY, R.ROWKEY FROM L JOIN R WITHIN 10 SECONDS ON L.B = R.D PARTITION BY L.ROWKEY;"
      ],
      "comments": [
        "This test demonstrates a problem when we JOIN on a non-ROWKEY field and then PARTITION BY ",
        "a ROWKEY field. Note that the key is 'join' when it should be 'a' and the key-field is 'B' ",
        "when it should be 'L_ROWKEY'"
      ],
      "inputs": [
        {"topic": "LEFT", "key": "a", "value": {"A": "a", "B": "join"}},
        {"topic": "RIGHT", "key": "c", "value": {"C": "c", "D": "join"}}
      ],
      "outputs": [
        {"topic": "OUTPUT", "key": "join", "value": {"A": "a", "B": "join", "C": "c", "D": "join", "L_ROWKEY": "a", "R_ROWKEY": "c"}}
      ],
      "post": {
        "sources": [
          {"name": "OUTPUT", "type": "stream", "keyField": "B"}
        ]
      }

To Reproduce
Run the partition-by.json QTT test.

Expected behavior
The rowkey is L_ROWKEY and the value is a.

Actual behaviour
The rowkey is B is and the value is join

The text was updated successfully, but these errors were encountered:

blueedgenick · 2019-12-09T21:36:08Z

and therein lies the challenge with this whole debate, if i've followed it correctly ? whether we think we want a user to consider the PARTITION BY to be something that happens "based on the source, as a part of processing the query", or something that happens "afterwards" and affects the output of whatever else the query specified. I would argue that it's far more SQL-y to think of it the latter way - SQL is supposed to be declarative after all ;) - it should say "what i want to come out of this query" and very little about "how the system should execute this query". I do agree that we're blurring the lines a little with PARTITION BY in that it forcibly mandates something about how the results are written to storage but the same base principle should apply. I'm OK to say that "PARTITION BY considers the columns from your input(s) in it's evaluation", which i think was teh genesis of this but that seems entirely different than saying that "your INPUT is re-partitioned by what you specify". PARTITION BY should have it's effects applied to the _output_ of the statement.

…

On Mon, Dec 9, 2019 at 12:00 PM Almog Gavra ***@***.***> wrote: I'm actually not sure this is a bug - if we think of PARTITION BY as happening before the JOIN (which it has to, if it is to work on the source schema) then the partition actually happens on the source (L) before the join repartition. This behavior is pretty useless and can be confusing, so I think we should throw an error in this situation. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4053?email_source=notifications&email_token=ABCXJICNODNILKZYS7P67LTQX2PVJA5CNFSM4JV4WMD2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGKPHSQ#issuecomment-563409866>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABCXJIGTCT3JR3SHKAH7T53QX2PVJANCNFSM4JV4WMDQ> .

agavra added the bug label Dec 5, 2019

agavra self-assigned this Dec 5, 2019

agavra mentioned this issue Dec 5, 2019

feat: expression support for PARTITION BY #4032

Merged

2 tasks

agavra mentioned this issue Dec 10, 2019

fix: properly set key when partition by ROWKEY and join on non-ROWKEY #4090

Merged

2 tasks

agavra closed this as completed in #4090 Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JOIN on non-ROWKEY and PARTITION BY ROWKEY is broken #4053

JOIN on non-ROWKEY and PARTITION BY ROWKEY is broken #4053

agavra commented Dec 5, 2019

blueedgenick commented Dec 9, 2019 via email

JOIN on non-ROWKEY and PARTITION BY ROWKEY is broken #4053

JOIN on non-ROWKEY and PARTITION BY ROWKEY is broken #4053

Comments

agavra commented Dec 5, 2019

blueedgenick commented Dec 9, 2019 via email