Empty string is converted to null #331

drewrobb · 2017-03-02T20:45:51Z

spark-redshift (v3.0.0-preview1) will convert an empty string '' into a null value when reading data from redshift:

spark.read
  .format("com.databricks.spark.redshift")
  .option("url", url)
  .option("aws_iam_role", iamRole)
  .option("tempdir", tmpDir)        
  .option("query", "select '' as foo, null as bar")
  .load()
  .show()

[info] +----+----+
[info] | foo| bar|
[info] +----+----+
[info] |null|null|
[info] +----+----+

In redshift, there is a distinction between null and '':

redshift-user=> select foo is null as foonull, bar is null as barnull from (select '' as foo, null as bar);
 foonull | barnull 
---------+---------
 f       | t
(1 row)

spark-sql also supports this distinction:

spark.sql("select '' as foo, null as bar").show()

[info] +---+----+
[info] |foo| bar|
[info] +---+----+
[info] |   |null|
[info] +---+----+

The text was updated successfully, but these errors were encountered:

JoshRosen · 2017-03-02T21:04:40Z

This is a known issue (#49) but we could do a better job of documenting it in the README. Do you have any suggestions for how we can fix this?

jstultz · 2017-03-02T22:15:15Z

I was discussing this with @drewrobb (it bit me this morning); I hadn't realized where the distinction was lost, and it looks like it's lost in the UNLOAD query to redshift. One (not awesome) option would be to allow a value to be passed through to NULL AS, and let the caller choose a value that's not expected to be in the dataset

http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html#unload-examples-null-as

drewrobb · 2017-03-02T22:31:53Z

In conjunction with the unload option ADDQUOTES and NULL AS '@NULL@, you can get some output that could distinguish between '' and null correctly through the UNLOAD step. Some further work would be necessary in https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftInputFormat.scala to remove the quotes before returning the data to spark.

JoshRosen · 2017-03-03T19:08:40Z

Adding an option for configuration of the null value sounds reasonable to me.

botchniaque · 2018-04-18T12:13:31Z

Is there any update on this?

drewrobb · 2018-04-18T20:08:00Z

Databricks has decided to close source new work on this project, so I don't expect any update or any PR accepted here. Maybe a community fork will get going, I'm disappointed and will avoid Databricks open source projects going forward.

foivosana · 2019-02-19T09:25:52Z

Is this supposed to be fixed? Writing data with empty strings to Redshift keeps giving me errors regarding non-nullable column, but i would like to keep the empty strings as they are and not convert them to null

Enigma-v · 2022-08-18T00:05:37Z

I was wondering if you had an issue with Redshift that returns NULL in one database and on the same dataset but in different database it returns empty string. The data is the same on both of the databases?

drewrobb closed this as completed Apr 18, 2018

dichiarafrancesco mentioned this issue May 12, 2018

Empty string is converted to null Yelp/spark-redshift#4

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty string is converted to null #331

Empty string is converted to null #331

drewrobb commented Mar 2, 2017

JoshRosen commented Mar 2, 2017

jstultz commented Mar 2, 2017

drewrobb commented Mar 2, 2017

JoshRosen commented Mar 3, 2017

botchniaque commented Apr 18, 2018

drewrobb commented Apr 18, 2018

foivosana commented Feb 19, 2019

Enigma-v commented Aug 18, 2022

Empty string is converted to null #331

Empty string is converted to null #331

Comments

drewrobb commented Mar 2, 2017

JoshRosen commented Mar 2, 2017

jstultz commented Mar 2, 2017

drewrobb commented Mar 2, 2017

JoshRosen commented Mar 3, 2017

botchniaque commented Apr 18, 2018

drewrobb commented Apr 18, 2018

foivosana commented Feb 19, 2019

Enigma-v commented Aug 18, 2022