Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

dgcaron · 2018-03-23T22:28:39Z

We are getting data from several sources using kafka connect and transform the data using ksql. One of the target databases is a document database (couchbase). When we store the data that is transformed by ksql all fields are uppercased. Aliasing fields did not change the behavior. The systems that rely on the data in couchbase expect the data to be camelcased which is not possible with v0.4 of ksql. Any workarounds or ways to retain the casing set by an alias (or source topic)?

rodesai · 2018-03-28T18:15:34Z

This is not possible with KSQL at the moment. We would have to implement this as a feature that lets you specify (to some degree) how you want your field names serialized. Lets use this issue to track the feature request.

dgcaron · 2018-04-01T20:35:37Z

for the time being, i was thinking about implementing a SMT that uses a schema to rename the fields. is that a plausible work-around for now? The SMT would match the name of the field to the schema (case insensitive) and then replace the source field for the schema field.

apurvam · 2018-04-01T21:44:30Z

You mean your sink connector writing to couch base will transform the field names to the correct casing? That should work.

sneko · 2018-09-15T23:22:06Z

Any news since april to avoid UPPERCASING?

Thank you :)

dgcaron · 2018-09-16T05:49:05Z

I haven't seen any updates yet. I think this is quite a feature so I am hesitant to create a PR myself as I don't oversee all the parts involved (to be honest, I wouldn't know where to start what). We are using a simple SMT at the moment to rename the fields before they are written to the db. Unfortunately, that code is in an internal repo of a company.

The SMT is configured with a avro schema, with the uppercased fields in an alias property. When a new message arrives, the SMT searches for an occurrence in the aliases and then substitutes the field for the name in the schema.

    "name" : "camelCasedName",
    "type" : [ "null", "string" ],
    "default" : null,
    "aliases" : [ "UPPERCASEDNAME" ]
  }

rmoff · 2019-01-21T14:35:01Z

I'm not sure in which version this changed, but as of 5.1 you can quote fields (and objects) to retain their casing.

CREATE STREAM "MixedCaseStream" AS \
            SELECT COL1, \
                   COL1 AS "lowercase_col1", \
                   COL1 AS "MixedCase_COL1", \
                   COL1 AS "UPPERCASE_COL1", \
                   COL1 AS NOTQUOTED_MixedCase_COL1 \
          FROM SOURCE;

More: https://rmoff.net/post/how-ksql-handles-case/

maeglindeveloper · 2019-02-14T08:13:10Z

Hi everyone, I'm facing and issue when I'm trying to PARTITION BY using a column name change as mentionned above (with quote). I'm using the 5.1.0 version of ksql.

Here is my command
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId";

The error is the following one

"message": "io.confluent.ksql.util.KsqlException: Column USERID does not exist in the result schema. Error in Partition By clause.\n\tat

It seems that the column name is not resolved.
Any idea?
@rmoff @miguno

mayankiiitm · 2019-07-31T09:25:26Z

I'm not sure in which version this changed, but as of 5.1 you can quote fields (and objects) to retain their casing.
CREATE STREAM "MixedCaseStream" AS \
            SELECT COL1, \
                   COL1 AS "lowercase_col1", \
                   COL1 AS "MixedCase_COL1", \
                   COL1 AS "UPPERCASE_COL1", \
                   COL1 AS NOTQUOTED_MixedCase_COL1 \
          FROM SOURCE;
More: https://rmoff.net/post/how-ksql-handles-case/

When I rename and column name contains any lower case character, column becomes null. Source topic is getting the data. As long column name does not contains any lower case character everything works fine.

mayankiiitm · 2019-07-31T09:31:09Z

In the example above, lowercase_col1, MixedCase_COL1 will be null while UPPERCASE_COL1 will be equal to COL1

JaredDarling · 2019-08-14T00:12:05Z

👍 for getting this looked at. This is unexpected behavior and causes integration issues.

apurvam · 2019-08-15T05:18:20Z

related to #2589

macthestack · 2020-01-13T14:14:14Z

We should allow setting a configuration value so that materialised tables that are created based on streams with AVRO schemas will honour original casing.

pavelpe · 2020-02-23T09:31:48Z

Hi everyone, I'm facing and issue when I'm trying to PARTITION BY using a column name change as mentioned above (with quote). I'm using the 5.1.0 version of ksql.

Here is my command
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId";

The error is the following one

"message": "io.confluent.ksql.util.KsqlException: Column USERID does not exist in the result schema. Error in Partition By clause.\n\tat

It seems that the column name is not resolved.
Any idea?
@rmoff @miguno

Any update regarding this issue? Will it be solved in future release?
I'm using KSQL 5.4.0 and it still happens.
Due to this issue, I have to create 2 streams instead one (single stream with mixed-cases field names and re-partitioning ). We use it extensively for with ES Sink connectors.

agavra · 2020-02-25T18:26:14Z

@pavelpe - there was a bug handling PARTITION BY clauses that was fixed in #4018 which is available in 0.6.0+ if you can use our docker based ksqldb deployments and will be available in 5.5.0 when it is released if you need to wait for confluent packaging.

pavelpe · 2020-02-26T12:33:03Z

@pavelpe - there was a bug handling PARTITION BY clauses that was fixed in #4018 which is available in 0.6.0+ if you can use our docker based ksqldb deployments and will be available in 5.5.0 when it is released if you need to wait for confluent packaging.

Maybe I did not entirely understand bug behind #4018 but I just want to be sure that this fix also takes care of mentioned bug, doing PARTITION BY on quoted column name (specifically used in lower cased column names).

From @maeglindeveloper example:
CREATE STREAM "mynewstream" AS SELECT "userId" AS "userId" FROM "mystream" PARTITION BY "userId"; <= such column name.

I'm trying to clarify it because #4018 doesn't say anything about quoted column names.
Having query that consists (as mentioned in #4018):
PARTITION BY col1*col2 AS new_col1;
will result in schema:
NEW_COL1 INT KEY, NEW_COL2 INT. <= all columns names are upper-cased (default KSQL behavior).

Just want to be sure we talk here about same thing.

fish-face · 2020-02-26T14:45:59Z

I don't believe that #3477 really fixed this issue; while you can now specify the casing manually for each column, this is not feasible for data sources with many columns.

Is there some rationale behind the upper-casing? When auto-creating columns from Avro, KSQL already has case information which it can use. When creating tables, streams and topics, the user has just typed in the identifier and overriding what they typed without reason seems like a bad idea.

Worse, not all databases are case insensitive. In particular, postgres is case sensitive in exactly the opposite way to KSQL; unquoted identifiers are coerced to lower case by default. This makes a very bad user experience!

To me the logical default where identifiers absolutely must be coerced to some case or another would be to lower them. Not only because of postgres, but because identifiers should contrast with SQL keywords.

If there is no particular rationale I firmly believe that the default should be to leave column names unmolested or, at worst, lower-case them. This is of course a backwards incompatible change and would have to wait for a major version bump. Pending that, a setting which can alter the behaviour would be a great benefit.

fish-face · 2020-02-28T11:04:11Z

@agavra sorry to highlight you directly on an old issue but do you have any comment on the above? I can open a new issue if it is felt that this is too different.

agavra · 2020-03-02T17:50:47Z

hey @fish-face - thanks for your detailed thoughts! You definitely have a lot of valid concerns and we should address them.

while you can now specify the casing manually for each column, this is not feasible for data sources with many columns.

I totally agree that manually casing each column for something that already exists in schema registry is bad behavior, we should verify that this is the behavior and if so create a new ticket to track that.

On the other hand, I don't think we should change the default case sensitivity to "case sensitive" as this is not SQL standard (5.2.13 in SQL92):


         <regular identifier> ::= <identifier body>

         <identifier body> ::=
              <identifier start> [ { <underscore> | <identifier part> }... ]


         <identifier start> ::= !! See the Syntax Rules

         <identifier part> ::=
                <identifier start>
              | <digit>

         <delimited identifier> ::=
              <double quote> <delimited identifier body> <double quote>

...

     13)A <regular identifier> and a <delimited identifier> are equiva-
        lent if the <identifier body> of the <regular identifier> (with
        every letter that is a lower-case letter replaced by the equiva-
        lent upper-case letter or letters) and the <delimited identifier
        body> of the <delimited identifier> (with all occurrences of
        <quote> replaced by <quote symbol> and all occurrences of <dou-
        blequote symbol> replaced by <double quote>), considered as
        the repetition of a <character string literal> that specifies a
        <character set specification> of SQL_TEXT and an implementation-
        defined collation that is sensitive to case, compare equally
        according to the comparison rules in Subclause 8.2, "<comparison
        predicate>".

You can see here that the sql standard definition of identifier equivalence is actually to replace lower-case letters with upper-case ones and I'm guessing that's why we implemented it that way historically - but I agree that's a little aggressive and it would have identical behavior if we flipped that logic to lowercase instead of uppercase.

When creating tables, streams and topics, the user has just typed in the identifier and overriding what they typed without reason seems like a bad idea.

I think this is the default behavior for most databases. I believe postgres is actually also case-insensitive (but you are right that it implements this by lowercasing everything as opposed to uppercasing):

postgres=# CREATE TABLE foo (ID VARCHAR, COL2 BIGINT);
CREATE TABLE
postgres=# \d foo
                     Table "public.foo2"
 Column |       Type        | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
 id     | character varying |           |          |
 col2   | bigint            |           |          |

postgres=# INSERT INTO foo (Id, cOL2) VALUES ('hi', 1);
INSERT 0 1
postgres=# SELECT * FROM FOO;
 id | col2
----+------
 hi |    1
(1 row)

To me the logical default where identifiers absolutely must be coerced to some case or another would be to lower them. Not only because of postgres, but because identifiers should contrast with SQL keywords.

I don't think contrasting with SQL keywords is a valid concern. SQL keywords are case insensitive and some people prefer typing the keywords in lower case (again an example from postgres):

postgres=# cReaTe Table FOo3 (ID varchar);
CREATE TABLE
postgres=# \d foo3;
                     Table "public.foo3"
 Column |       Type        | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
 id     | character varying |           |          |

Let me know if this answered any lingering doubts you may have!

fish-face · 2020-03-05T11:06:54Z

@agavra thanks for the reply.

I will double-check the schema case and raise another issue if I can confirm it. There is also the matter of creating topics, which we notice KSQL upper-cases as well - let me know if you would like that as a separate issue.

I agree introducing case sensitivity would be a backwards step! (Of course topic names are case sensitive due to kafka itself so we don't have a choice there)

Postgres is in fact case-sensitive but the lowering masks this somewhat:

postgres=# CREATE TABLE "FOO" ("ID" VARCHAR, "COL2" BIGINT);
CREATE TABLE
postgres=# SELECT * FROM FOO;
ERROR:  relation "foo" does not exist
LINE 1: select * from FOO
postgres=# SELECT * FROM "FOO";
 ID 
----
(0 rows)

Behaviour for columns is identical. Notice the lowercase "relation "foo" does not exist" caused by postgres' lowering. This does seem contrary to the SQL standard you quote though I'm not familiar enough with it to be sure (e.g. I don't know for sure that what we're calling identifiers is what they mean).

Would it help to look at some other major databases to see if postgres (and hence our particular irritation here) is an anomaly?

Regarding case-contrast, I agree this is less important, though whether by convention or by standard the advantage remains with upper case keywords and lower/mixed case identifiers IMO :)

fish-face · 2020-03-05T16:36:18Z

the reason your examples in postgres are case sensitive is because you added the double quotes in the create statement

This is to emulate what happens if a KTable is pushed, via connect, to postgres, with upper-case columns. Sorry I wasn't clear. To further clarify: Postgres is never case insensitive, but if you don't double-quote, it will forcibly lower case everything. Thus if you, for any reason, have upper case column or table names, you must from that point on quote those names every time you use them, which is quite annoying for users.

This remains, of course, a bit of a postgres oddity. However I think it does highlight the dangers of interfering with cases.

I will double-check the schema case and raise another issue if I can confirm it.

It looks like the case where you're creating a stream from an avro-formatted topic is covered by this issue.

agavra · 2020-03-05T16:39:28Z

@fish-face I had realized what you meant after I had submitted my comment and deleted it, but I suppose not before you had responded! (For context for people looking at this discussion) I'm giving it another thought

fish-face · 2020-03-05T16:41:03Z

Haha, I happened to be writing part of a reply at the time so I saw it immediately :P

Deninc · 2020-05-13T14:21:17Z

All the databases I've worked on have lowercase convention for field names. KSQL is the only exception and it has made a lot of confusion and trouble.. especially then connecting with other data sources.

emerzonic · 2021-05-03T00:31:49Z

I don't know why ksql decided to uppercased data when kafka is not doing so. Could we have ksql just leave the data just as the schema in the schema registry or as it receives it from the source kafka topic? This makes working with ksql impossible especially when you have large data set.

agavra · 2021-05-03T16:01:42Z

hey @Deninc @emerzonic - thanks for your (totally valid) concerns, since this ticket is specifically to allow casing can you add your comments here: #2415? Otherwise comments on closed issues tend to slip through the cracks.

rodesai added enhancement data-accessibility labels Mar 28, 2018

rodesai changed the title ~~All fieldnames are uppercased after a select as~~ Allow statement to specify format for field names when serialized to output topic Mar 28, 2018

apurvam changed the title ~~Allow statement to specify format for field names when serialized to output topic~~ Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic Mar 29, 2018

TobiasHauck mentioned this issue Apr 12, 2019

KLIP-3: WITH OUTPUT_SCHEMA #2679

Closed

2 tasks

mayankiiitm mentioned this issue Jul 31, 2019

>Lowercase character in column name makes the column null #3152

Closed

agavra mentioned this issue Oct 4, 2019

feat: support quoted identifiers in column names #3477

Merged

2 tasks

agavra closed this as completed in #3477 Oct 10, 2019

big-andy-coates mentioned this issue Oct 11, 2019

JSON elements in CAPS #3195

Closed

rayzyar mentioned this issue Feb 26, 2020

Expression support for PARTITION BY clause #4018

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

dgcaron commented Mar 23, 2018

rodesai commented Mar 28, 2018

dgcaron commented Apr 1, 2018

apurvam commented Apr 1, 2018

sneko commented Sep 15, 2018

dgcaron commented Sep 16, 2018 •

edited

Loading

rmoff commented Jan 21, 2019

maeglindeveloper commented Feb 14, 2019 •

edited

Loading

mayankiiitm commented Jul 31, 2019

mayankiiitm commented Jul 31, 2019

JaredDarling commented Aug 14, 2019 •

edited

Loading

apurvam commented Aug 15, 2019

macthestack commented Jan 13, 2020

pavelpe commented Feb 23, 2020

agavra commented Feb 25, 2020

pavelpe commented Feb 26, 2020

fish-face commented Feb 26, 2020

fish-face commented Feb 28, 2020

agavra commented Mar 2, 2020

fish-face commented Mar 5, 2020

fish-face commented Mar 5, 2020

agavra commented Mar 5, 2020

fish-face commented Mar 5, 2020

Deninc commented May 13, 2020

emerzonic commented May 3, 2021 •

edited

Loading

agavra commented May 3, 2021 •

edited

Loading

Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

Allow statement to specify the casing (camel case, uppercase, etc) for field names when serialized to output topic #1039

Comments

dgcaron commented Mar 23, 2018

rodesai commented Mar 28, 2018

dgcaron commented Apr 1, 2018

apurvam commented Apr 1, 2018

sneko commented Sep 15, 2018

dgcaron commented Sep 16, 2018 • edited Loading

rmoff commented Jan 21, 2019

maeglindeveloper commented Feb 14, 2019 • edited Loading

mayankiiitm commented Jul 31, 2019

mayankiiitm commented Jul 31, 2019

JaredDarling commented Aug 14, 2019 • edited Loading

apurvam commented Aug 15, 2019

macthestack commented Jan 13, 2020

pavelpe commented Feb 23, 2020

agavra commented Feb 25, 2020

pavelpe commented Feb 26, 2020

fish-face commented Feb 26, 2020

fish-face commented Feb 28, 2020

agavra commented Mar 2, 2020

fish-face commented Mar 5, 2020

fish-face commented Mar 5, 2020

agavra commented Mar 5, 2020

fish-face commented Mar 5, 2020

Deninc commented May 13, 2020

emerzonic commented May 3, 2021 • edited Loading

agavra commented May 3, 2021 • edited Loading

dgcaron commented Sep 16, 2018 •

edited

Loading

maeglindeveloper commented Feb 14, 2019 •

edited

Loading

JaredDarling commented Aug 14, 2019 •

edited

Loading

emerzonic commented May 3, 2021 •

edited

Loading

agavra commented May 3, 2021 •

edited

Loading