Generated table creation SQL fails for strings longer than 256 characters #29

Aerlinger · 2015-07-30T16:38:57Z

Thanks to @jaley for the recent milestone on getting data into Redshift from a DataFrame, the latest updates are pretty awesome. One edge case I encountered, however, is during schema generation String datatypes in the Dataframe always get mapped to the TEXT type in the generated SQL schema. This is problematic as the TEXT type is actually just an alias for varchar(256) in Redshift, and will subsequently cause a failure during the data import to Redshift if any rows contain text exceeding 256 characters (the specific error in Redshift is String length exceeds DDL length)

Digging through the code, it looks like the source if this problem actually originates from the spark.sql.jdbc.JDBCWrapper class which defines the mapping (https://github.com/apache/spark/blob/8a94eb23d53e291441e3144a1b800fe054457040/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L128).

Redshift only supports fixed length fields so I don't see another way to preserve data integrity without replacing TEXT in the SQL schema to VARCHAR(N) where N is the longest string length for that column in the Dataframe. Obviously this would introduce a small amount of overhead and complexity to the code.

The text was updated successfully, but these errors were encountered:

jaley · 2015-07-30T16:53:52Z

Yeah, sorry I noticed this when I was visually inspected the generated schemata from Redshift and forgot to document it as a known issue. You're right, that the issue comes from the fact that we're calling out to the JDBC schema generation code, which has a default 256 varchar for StringType.

Ideal solution would be that we set the column to be VARCHAR(N) as you suggest, where N is the length of the longest String in the column. There are actually performance benefits to making your varchar columns minimal width in Redshift, so this would be nice in that respect, as its common in ETL use cases to just pick a default that's probably bigger than necessary. I'm not sure on the best implementation strategy for this though... It might be possible to map a kind of identity function over all String columns during the unload to Avro, which uses a shared accumulator variable to track the max width for all String columns?

If we did go with something like that, we would probably want to raise another ticket to allow users to explicitly set the width of columns too. It could easily be that they know the maximum width for valid data in a particular column, and want to raise an error if any data is bigger. It might actually be better to deal with this situation by allowing users to specify entire column spec overrides, thus enabling them to tune compression settings, etc.

Aerlinger · 2015-07-30T17:28:38Z

Yup, your thoughts are definitely in line with mine. I suppose there are a few options with varying tradeoffs (1) Use an accumulator to calculate N for VARCHAR(N) (2) Optimistically create a large text column (3) Allow length to be manually specified/overridden by the user.

These could all be supported by delegating the decision to the user via configuration, but item (1) definitely seems like a requirement and is most relevant to my use cases so I'm going to work on it now and submit a PR when I have something worth submitting.

jaley · 2015-07-30T17:34:38Z

Cool, that's great!

I guess another option, which some users might want, is to enable the
TRUNCATECOLUMNS
flag. That would probably need a parameter. We might just want a way to
allow users to specific an arbitrary string of COPY parameters, as it'll be
hard to keep up with new Redshift parameters anyway.

On Thu, Jul 30, 2015 at 6:28 PM, Anthony [email protected] wrote:

Yup, your thoughts are definitely in line with mine. I suppose there are a
few options with varying tradeoffs (1) Use an accumulator to calculate N
for VARCHAR(N) (2) Optimistically create a large text column (3) Allow
length to be manually specified/overridden by the user.

These could all be supported by delegating the decision to the user via
configuration, but item (1) definitely seems like a requirement and is most
relevant to my use cases so I'm going to work on it now and submit a PR
when I have something worth submitting.

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

marmbrus · 2015-07-30T17:45:16Z

What about using column metadata to specify the length when desired. By default we can choose a "safe" large column so that data is never truncated and insertions always work. When users care about extra performance they can add metadata to string columns and we can create more efficient fixed length ones.

marmbrus · 2015-07-30T17:45:54Z

Note, I think this is compatible with other suggestions as there can be an option to run a job and automatically populate this metadata using accumulators when desired.

jaley · 2015-07-30T17:51:44Z

That sounds good. I guess that would mean though that we either stop using
the existing JDBC schemaString function, or add the functionality there to
support VARCHAR length configuration from the schema column metadata?
Unless maybe it already supports it and I missed that?

On Thu, Jul 30, 2015 at 6:45 PM, Michael Armbrust [email protected]
wrote:

Note, I think this is compatible with other suggestions as there can be an
option to run a job and automatically populate this metadata using
accumulators when desired.

—
Reply to this email directly or view it on GitHub
#29 (comment)
.

Aerlinger · 2015-07-30T17:59:24Z

@jaley I think you're right that if want a calculated value of N for VARCHAR(N) we'll have to forgo using the JDBC schemaString function. However, for a generating large default VARCHAR we could simply add a new JDBC dialect for Redshift (https://github.com/apache/spark/blob/db81b9d89f62f18a3c4c2d9bced8486bfdea54a2/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala) which is actually just what's being called by the schemaString method.

jaley · 2015-07-30T21:10:09Z

@Aerlinger The JDBCDialect looks like the "right" way to do it, but I don't think it'll give you a way to pass through metadata from a provided user schema? You can only implement getJDBCType to convert from Spark SQL types to Redshift's types, and all you're given is the DataType -- no metadata from a user schema.

I would avoid increasing the VARCHAR default size too, as Amazon recommend making the columns as small as possible for best performance. 256 is already too big for most enum-like categorical fields.

How about this as a rough approach?

Do schemaString ourselves inside spark-redshift. We only need to support Redshift's column types, there aren't very many, and it'll mean we have one less dependency on the package-private JDBC code.
Use @marmbrus suggestion of checking the field metadata in the schema for a "size" attribute. That'll mean users can set a custom schema through the Data Source API as normal, just adding the size settings for fields that matter.
If there's no explicit size setting, figure out a sensible setting dynamically while write the data to Avro by mapping a function over the DataFrame before save() that puts the max string lengths in an accumulator. Shouldn't be too bad... right? :-)
Maybe have a parameter to disable the above and fall back to a default value.

Generating the schema string ourselves and involving hints from the schema metadata would actually provide us a good basis for adding column compression hints later. That can be useful, as with medium-sized tables, it's not always a good plan to have Redshift decide.

What do you guys think? I'm happy to lend a hand with code or reviews if I can be of help.

marmbrus · 2015-07-30T21:16:09Z

+1 to removing our dependance on the private schemaString.

I'd have a preference towards not doing an extra pass over the data unless the user requests it, but instead just picking a large safe default. It could be very expensive and I'm not sure how that will work when appending to an existing table. If someone has numbers to show how big of a deal this is on the redshift read side I could be swayed.

Aerlinger · 2015-08-01T15:40:37Z

@jaley You're right that you can't pass metadata through JDBC dialect, and I agree that a fresh implementation for schemaString to replace the current hack is the way to go.

I don't think it's possible to use an accumulator to infer size for VARCHAR(N) as accumulators are only readable by the driver, but we'd need to obviously need to read this value to determine if a new maximum text length is encountered. Not sure if I'm missing something, however.

Aerlinger · 2015-08-01T16:22:41Z

On second thought, it may be possible to use an accumulator within the fold function.

JoshRosen · 2015-08-20T21:36:03Z

In #46, I removed the dependence on Spark's schemaString by inlining the code here inside of spark-redshift, so I think that this issue should now be easier to fix.

traviscrawford · 2015-08-20T23:56:17Z

@eduardoramirez and I noticed this TEXT field issue today. We're very interested in optionally specifying VARCHAR field lengths as we know the field sizes (through custom options on protobuf messages that model our tables).

This patch allows users to specify a `maxlength` column metadata entry for string columns in order to control the width of `VARCHAR` columns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29. See also: #53 to improve error reporting when LOAD fails. Author: Josh Rosen <[email protected]> Closes #54 from JoshRosen/max-length.

JoshRosen · 2015-09-09T00:54:20Z

In #54, I added support for configuring VARCHAR column lengths via column metadata. I think that the original issue here has therefore been fixed. If we want to continue discussing automatic inference of column lengths then I propose that we do so in a separate issue / thread in order to make things easier to track.

…en-central-full-release Stable 4.0.0 release to publish to maven central

Aerlinger mentioned this issue Aug 4, 2015

Calculate VARCHAR(N) on schema generation to support text longer than the 256 characters #37

Closed

JoshRosen added the bug label Aug 20, 2015

JoshRosen mentioned this issue Aug 25, 2015

Add extracopyoptions #35

Closed

JoshRosen modified the milestone: 0.5 Aug 25, 2015

JoshRosen mentioned this issue Aug 26, 2015

Use maxlength metadata to configure VARCHAR column lengths #54

Closed

JoshRosen closed this as completed Sep 9, 2015

JoshRosen self-assigned this Sep 9, 2015

JoshRosen mentioned this issue Nov 2, 2015

StringType defaults to TEXT which is varchar(256) #118

Closed

JoshRosen mentioned this issue Jan 26, 2016

No way to specify column compression #164

Closed

JoshRosen mentioned this issue Mar 25, 2016

Allow specification of PRIMARY KEY constraint #194

Open

blowenstein93 pushed a commit to blowenstein93/spark-redshift that referenced this issue Oct 26, 2019

Merge pull request databricks#29 from lucagiovagnoli/luca-publish-mav…

4df21f0

…en-central-full-release Stable 4.0.0 release to publish to maven central

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generated table creation SQL fails for strings longer than 256 characters #29

Generated table creation SQL fails for strings longer than 256 characters #29

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

marmbrus commented Jul 30, 2015

marmbrus commented Jul 30, 2015

jaley commented Jul 30, 2015

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

marmbrus commented Jul 30, 2015

Aerlinger commented Aug 1, 2015

Aerlinger commented Aug 1, 2015

JoshRosen commented Aug 20, 2015

traviscrawford commented Aug 20, 2015

JoshRosen commented Sep 9, 2015

Generated table creation SQL fails for strings longer than 256 characters #29

Generated table creation SQL fails for strings longer than 256 characters #29

Comments

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

marmbrus commented Jul 30, 2015

marmbrus commented Jul 30, 2015

jaley commented Jul 30, 2015

Aerlinger commented Jul 30, 2015

jaley commented Jul 30, 2015

marmbrus commented Jul 30, 2015

Aerlinger commented Aug 1, 2015

Aerlinger commented Aug 1, 2015

JoshRosen commented Aug 20, 2015

traviscrawford commented Aug 20, 2015

JoshRosen commented Sep 9, 2015