Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty string is converted to null #331

Closed
drewrobb opened this issue Mar 2, 2017 · 8 comments
Closed

Empty string is converted to null #331

drewrobb opened this issue Mar 2, 2017 · 8 comments

Comments

@drewrobb
Copy link

drewrobb commented Mar 2, 2017

spark-redshift (v3.0.0-preview1) will convert an empty string '' into a null value when reading data from redshift:

spark.read
  .format("com.databricks.spark.redshift")
  .option("url", url)
  .option("aws_iam_role", iamRole)
  .option("tempdir", tmpDir)        
  .option("query", "select '' as foo, null as bar")
  .load()
  .show()

[info] +----+----+
[info] | foo| bar|
[info] +----+----+
[info] |null|null|
[info] +----+----+

In redshift, there is a distinction between null and '':

redshift-user=> select foo is null as foonull, bar is null as barnull from (select '' as foo, null as bar);
 foonull | barnull 
---------+---------
 f       | t
(1 row)

spark-sql also supports this distinction:

spark.sql("select '' as foo, null as bar").show()

[info] +---+----+
[info] |foo| bar|
[info] +---+----+
[info] |   |null|
[info] +---+----+
@JoshRosen
Copy link
Contributor

This is a known issue (#49) but we could do a better job of documenting it in the README. Do you have any suggestions for how we can fix this?

@jstultz
Copy link

jstultz commented Mar 2, 2017

I was discussing this with @drewrobb (it bit me this morning); I hadn't realized where the distinction was lost, and it looks like it's lost in the UNLOAD query to redshift. One (not awesome) option would be to allow a value to be passed through to NULL AS, and let the caller choose a value that's not expected to be in the dataset

http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD_command_examples.html#unload-examples-null-as

@drewrobb
Copy link
Author

drewrobb commented Mar 2, 2017

In conjunction with the unload option ADDQUOTES and NULL AS '@NULL@, you can get some output that could distinguish between '' and null correctly through the UNLOAD step. Some further work would be necessary in https://github.com/databricks/spark-redshift/blob/1092c7cd03bb751ba4e93b92cd7e04cffff10eb0/src/main/scala/com/databricks/spark/redshift/RedshiftInputFormat.scala to remove the quotes before returning the data to spark.

@JoshRosen
Copy link
Contributor

Adding an option for configuration of the null value sounds reasonable to me.

@botchniaque
Copy link

Is there any update on this?

@drewrobb
Copy link
Author

Databricks has decided to close source new work on this project, so I don't expect any update or any PR accepted here. Maybe a community fork will get going, I'm disappointed and will avoid Databricks open source projects going forward.

@foivosana
Copy link

Is this supposed to be fixed? Writing data with empty strings to Redshift keeps giving me errors regarding non-nullable column, but i would like to keep the empty strings as they are and not convert them to null

@Enigma-v
Copy link

I was wondering if you had an issue with Redshift that returns NULL in one database and on the same dataset but in different database it returns empty string. The data is the same on both of the databases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants