Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

Closed
dgcaron opened this issue Mar 23, 2018 · 25 comments · Fixed by #3477

Comments

@dgcaron
Copy link

dgcaron commented Mar 23, 2018

We are getting data from several sources using kafka connect and transform the data using ksql. One of the target databases is a document database (couchbase). When we store the data that is transformed by ksql all fields are uppercased. Aliasing fields did not change the behavior. The systems that rely on the data in couchbase expect the data to be camelcased which is not possible with v0.4 of ksql. Any workarounds or ways to retain the casing set by an alias (or source topic)?

@rodesai
Copy link
Contributor

rodesai commented Mar 28, 2018

This is not possible with KSQL at the moment. We would have to implement this as a feature that lets you specify (to some degree) how you want your field names serialized. Lets use this issue to track the feature request.

@rodesai rodesai changed the title All fieldnames are uppercased after a select as Allow statement to specify format for field names when serialized to output topic Mar 28, 2018
@apurvam apurvam changed the title Allow statement to specify format for field names when serialized to output topic Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic Mar 29, 2018
@dgcaron
Copy link
Author

dgcaron commented Apr 1, 2018

for the time being, i was thinking about implementing a SMT that uses a schema to rename the fields. is that a plausible work-around for now? The SMT would match the name of the field to the schema (case insensitive) and then replace the source field for the schema field.

@apurvam
Copy link
Contributor

apurvam commented Apr 1, 2018

You mean your sink connector writing to couch base will transform the field names to the correct casing? That should work.

@sneko
Copy link

sneko commented Sep 15, 2018

Any news since april to avoid UPPERCASING?

Thank you :)

@dgcaron
Copy link
Author

dgcaron commented Sep 16, 2018

I haven't seen any updates yet. I think this is quite a feature so I am hesitant to create a PR myself as I don't oversee all the parts involved (to be honest, I wouldn't know where to start what). We are using a simple SMT at the moment to rename the fields before they are written to the db. Unfortunately, that code is in an internal repo of a company.

The SMT is configured with a avro schema, with the uppercased fields in an alias property. When a new message arrives, the SMT searches for an occurrence in the aliases and then substitutes the field for the name in the schema.

    "name" : "camelCasedName",
    "type" : [ "null", "string" ],
    "default" : null,
    "aliases" : [ "UPPERCASEDNAME" ]
  }

@rmoff
Copy link
Member

rmoff commented Jan 21, 2019

I'm not sure in which version this changed, but as of 5.1 you can quote fields (and objects) to retain their casing.

CREATE STREAM "MixedCaseStream" AS \
            SELECT COL1, \
                   COL1 AS "lowercase_col1", \
                   COL1 AS "MixedCase_COL1", \
                   COL1 AS "UPPERCASE_COL1", \
                   COL1 AS NOTQUOTED_MixedCase_COL1 \
          FROM SOURCE;

More: https://rmoff.net/post/how-ksql-handles-case/

@maeglindeveloper
Copy link

maeglindeveloper commented Feb 14, 2019

Hi everyone, I'm facing and issue when I'm trying to PARTITION BY using a column name change as mentionned above (with quote). I'm using the 5.1.0 version of ksql.

Here is my command
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId";

The error is the following one

"message": "io.confluent.ksql.util.KsqlException: Column USERID does not exist in the result schema. Error in Partition By clause.\n\tat

It seems that the column name is not resolved.
Any idea?
@rmoff @miguno

@mayankiiitm
Copy link

I'm not sure in which version this changed, but as of 5.1 you can quote fields (and objects) to retain their casing.

CREATE STREAM "MixedCaseStream" AS \
            SELECT COL1, \
                   COL1 AS "lowercase_col1", \
                   COL1 AS "MixedCase_COL1", \
                   COL1 AS "UPPERCASE_COL1", \
                   COL1 AS NOTQUOTED_MixedCase_COL1 \
          FROM SOURCE;

More: https://rmoff.net/post/how-ksql-handles-case/

When I rename and column name contains any lower case character, column becomes null. Source topic is getting the data. As long column name does not contains any lower case character everything works fine.

@mayankiiitm
Copy link

In the example above, lowercase_col1, MixedCase_COL1 will be null while UPPERCASE_COL1 will be equal to COL1

@JaredDarling
Copy link

JaredDarling commented Aug 14, 2019

👍 for getting this looked at. This is unexpected behavior and causes integration issues.

@apurvam
Copy link
Contributor

apurvam commented Aug 15, 2019

related to #2589

@macthestack
Copy link

We should allow setting a configuration value so that materialised tables that are created based on streams with AVRO schemas will honour original casing.

@pavelpe
Copy link

pavelpe commented Feb 23, 2020

Hi everyone, I'm facing and issue when I'm trying to PARTITION BY using a column name change as mentioned above (with quote). I'm using the 5.1.0 version of ksql.

Here is my command
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId";

The error is the following one

"message": "io.confluent.ksql.util.KsqlException: Column USERID does not exist in the result schema. Error in Partition By clause.\n\tat

It seems that the column name is not resolved.
Any idea?
@rmoff @miguno

Any update regarding this issue? Will it be solved in future release?
I'm using KSQL 5.4.0 and it still happens.
Due to this issue, I have to create 2 streams instead one (single stream with mixed-cases field names and re-partitioning ). We use it extensively for with ES Sink connectors.

@agavra
Copy link
Contributor

agavra commented Feb 25, 2020

@pavelpe - there was a bug handling PARTITION BY clauses that was fixed in #4018 which is available in 0.6.0+ if you can use our docker based ksqldb deployments and will be available in 5.5.0 when it is released if you need to wait for confluent packaging.

@pavelpe
Copy link

pavelpe commented Feb 26, 2020

@pavelpe - there was a bug handling PARTITION BY clauses that was fixed in #4018 which is available in 0.6.0+ if you can use our docker based ksqldb deployments and will be available in 5.5.0 when it is released if you need to wait for confluent packaging.

Maybe I did not entirely understand bug behind #4018 but I just want to be sure that this fix also takes care of mentioned bug, doing PARTITION BY on quoted column name (specifically used in lower cased column names).

From @maeglindeveloper example:
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId"; <= such column name.

I'm trying to clarify it because #4018 doesn't say anything about quoted column names.
Having query that consists (as mentioned in #4018):
PARTITION BY col1*col2 AS new_col1;
will result in schema:
NEW_COL1 INT KEY, NEW_COL2 INT. <= all columns names are upper-cased (default KSQL behavior).

Just want to be sure we talk here about same thing.

@fish-face
Copy link

I don't believe that #3477 really fixed this issue; while you can now specify the casing manually for each column, this is not feasible for data sources with many columns.

Is there some rationale behind the upper-casing? When auto-creating columns from Avro, KSQL already has case information which it can use. When creating tables, streams and topics, the user has just typed in the identifier and overriding what they typed without reason seems like a bad idea.

Worse, not all databases are case insensitive. In particular, postgres is case sensitive in exactly the opposite way to KSQL; unquoted identifiers are coerced to lower case by default. This makes a very bad user experience!

To me the logical default where identifiers absolutely must be coerced to some case or another would be to lower them. Not only because of postgres, but because identifiers should contrast with SQL keywords.

If there is no particular rationale I firmly believe that the default should be to leave column names unmolested or, at worst, lower-case them. This is of course a backwards incompatible change and would have to wait for a major version bump. Pending that, a setting which can alter the behaviour would be a great benefit.

@fish-face
Copy link

@agavra sorry to highlight you directly on an old issue but do you have any comment on the above? I can open a new issue if it is felt that this is too different.

@agavra
Copy link
Contributor

agavra commented Mar 2, 2020

hey @fish-face - thanks for your detailed thoughts! You definitely have a lot of valid concerns and we should address them.

while you can now specify the casing manually for each column, this is not feasible for data sources with many columns.

I totally agree that manually casing each column for something that already exists in schema registry is bad behavior, we should verify that this is the behavior and if so create a new ticket to track that.

On the other hand, I don't think we should change the default case sensitivity to "case sensitive" as this is not SQL standard (5.2.13 in SQL92):


         <regular identifier> ::= <identifier body>

         <identifier body> ::=
              <identifier start> [ { <underscore> | <identifier part> }... ]


         <identifier start> ::= !! See the Syntax Rules

         <identifier part> ::=
                <identifier start>
              | <digit>

         <delimited identifier> ::=
              <double quote> <delimited identifier body> <double quote>

...

     13)A <regular identifier> and a <delimited identifier> are equiva-
        lent if the <identifier body> of the <regular identifier> (with
        every letter that is a lower-case letter replaced by the equiva-
        lent upper-case letter or letters) and the <delimited identifier
        body> of the <delimited identifier> (with all occurrences of
        <quote> replaced by <quote symbol> and all occurrences of <dou-
        blequote symbol> replaced by <double quote>), considered as
        the repetition of a <character string literal> that specifies a
        <character set specification> of SQL_TEXT and an implementation-
        defined collation that is sensitive to case, compare equally
        according to the comparison rules in Subclause 8.2, "<comparison
        predicate>".

You can see here that the sql standard definition of identifier equivalence is actually to replace lower-case letters with upper-case ones and I'm guessing that's why we implemented it that way historically - but I agree that's a little aggressive and it would have identical behavior if we flipped that logic to lowercase instead of uppercase.

When creating tables, streams and topics, the user has just typed in the identifier and overriding what they typed without reason seems like a bad idea.

I think this is the default behavior for most databases. I believe postgres is actually also case-insensitive (but you are right that it implements this by lowercasing everything as opposed to uppercasing):

postgres=# CREATE TABLE foo (ID VARCHAR, COL2 BIGINT);
CREATE TABLE
postgres=# \d foo
                     Table "public.foo2"
 Column |       Type        | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
 id     | character varying |           |          |
 col2   | bigint            |           |          |

postgres=# INSERT INTO foo (Id, cOL2) VALUES ('hi', 1);
INSERT 0 1
postgres=# SELECT * FROM FOO;
 id | col2
----+------
 hi |    1
(1 row)

To me the logical default where identifiers absolutely must be coerced to some case or another would be to lower them. Not only because of postgres, but because identifiers should contrast with SQL keywords.

I don't think contrasting with SQL keywords is a valid concern. SQL keywords are case insensitive and some people prefer typing the keywords in lower case (again an example from postgres):

postgres=# cReaTe Table FOo3 (ID varchar);
CREATE TABLE
postgres=# \d foo3;
                     Table "public.foo3"
 Column |       Type        | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
 id     | character varying |           |          |

Let me know if this answered any lingering doubts you may have!

@fish-face
Copy link

@agavra thanks for the reply.

I will double-check the schema case and raise another issue if I can confirm it. There is also the matter of creating topics, which we notice KSQL upper-cases as well - let me know if you would like that as a separate issue.

I agree introducing case sensitivity would be a backwards step! (Of course topic names are case sensitive due to kafka itself so we don't have a choice there)

Postgres is in fact case-sensitive but the lowering masks this somewhat:

postgres=# CREATE TABLE "FOO" ("ID" VARCHAR, "COL2" BIGINT);
CREATE TABLE
postgres=# SELECT * FROM FOO;
ERROR:  relation "foo" does not exist
LINE 1: select * from FOO
postgres=# SELECT * FROM "FOO";
 ID 
----
(0 rows)

Behaviour for columns is identical. Notice the lowercase "relation "foo" does not exist" caused by postgres' lowering. This does seem contrary to the SQL standard you quote though I'm not familiar enough with it to be sure (e.g. I don't know for sure that what we're calling identifiers is what they mean).

Would it help to look at some other major databases to see if postgres (and hence our particular irritation here) is an anomaly?

Regarding case-contrast, I agree this is less important, though whether by convention or by standard the advantage remains with upper case keywords and lower/mixed case identifiers IMO :)

@fish-face
Copy link

the reason your examples in postgres are case sensitive is because you added the double quotes in the create statement

This is to emulate what happens if a KTable is pushed, via connect, to postgres, with upper-case columns. Sorry I wasn't clear. To further clarify: Postgres is never case insensitive, but if you don't double-quote, it will forcibly lower case everything. Thus if you, for any reason, have upper case column or table names, you must from that point on quote those names every time you use them, which is quite annoying for users.

This remains, of course, a bit of a postgres oddity. However I think it does highlight the dangers of interfering with cases.

I will double-check the schema case and raise another issue if I can confirm it.

It looks like the case where you're creating a stream from an avro-formatted topic is covered by this issue.

@agavra
Copy link
Contributor

agavra commented Mar 5, 2020

@fish-face I had realized what you meant after I had submitted my comment and deleted it, but I suppose not before you had responded! (For context for people looking at this discussion) I'm giving it another thought

@fish-face
Copy link

Haha, I happened to be writing part of a reply at the time so I saw it immediately :P

@Deninc
Copy link

Deninc commented May 13, 2020

All the databases I've worked on have lowercase convention for field names. KSQL is the only exception and it has made a lot of confusion and trouble.. especially then connecting with other data sources.

@emerzonic
Copy link

emerzonic commented May 3, 2021

I don't know why ksql decided to uppercased data when kafka is not doing so. Could we have ksql just leave the data just as the schema in the schema registry or as it receives it from the source kafka topic? This makes working with ksql impossible especially when you have large data set.

@agavra
Copy link
Contributor

agavra commented May 3, 2021

hey @Deninc @emerzonic - thanks for your (totally valid) concerns, since this ticket is specifically to allow casing can you add your comments here: #2415? Otherwise comments on closed issues tend to slip through the cracks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.