Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generated table creation SQL fails for strings longer than 256 characters #29

Closed
Aerlinger opened this issue Jul 30, 2015 · 14 comments
Closed
Assignees
Labels
Milestone

Comments

@Aerlinger
Copy link

Thanks to @jaley for the recent milestone on getting data into Redshift from a DataFrame, the latest updates are pretty awesome. One edge case I encountered, however, is during schema generation String datatypes in the Dataframe always get mapped to the TEXT type in the generated SQL schema. This is problematic as the TEXT type is actually just an alias for varchar(256) in Redshift, and will subsequently cause a failure during the data import to Redshift if any rows contain text exceeding 256 characters (the specific error in Redshift is String length exceeds DDL length)

Digging through the code, it looks like the source if this problem actually originates from the spark.sql.jdbc.JDBCWrapper class which defines the mapping (https://github.com/apache/spark/blob/8a94eb23d53e291441e3144a1b800fe054457040/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L128).

Redshift only supports fixed length fields so I don't see another way to preserve data integrity without replacing TEXT in the SQL schema to VARCHAR(N) where N is the longest string length for that column in the Dataframe. Obviously this would introduce a small amount of overhead and complexity to the code.

@jaley
Copy link
Contributor

jaley commented Jul 30, 2015

Yeah, sorry I noticed this when I was visually inspected the generated schemata from Redshift and forgot to document it as a known issue. You're right, that the issue comes from the fact that we're calling out to the JDBC schema generation code, which has a default 256 varchar for StringType.

Ideal solution would be that we set the column to be VARCHAR(N) as you suggest, where N is the length of the longest String in the column. There are actually performance benefits to making your varchar columns minimal width in Redshift, so this would be nice in that respect, as its common in ETL use cases to just pick a default that's probably bigger than necessary. I'm not sure on the best implementation strategy for this though... It might be possible to map a kind of identity function over all String columns during the unload to Avro, which uses a shared accumulator variable to track the max width for all String columns?

If we did go with something like that, we would probably want to raise another ticket to allow users to explicitly set the width of columns too. It could easily be that they know the maximum width for valid data in a particular column, and want to raise an error if any data is bigger. It might actually be better to deal with this situation by allowing users to specify entire column spec overrides, thus enabling them to tune compression settings, etc.

@Aerlinger
Copy link
Author

Yup, your thoughts are definitely in line with mine. I suppose there are a few options with varying tradeoffs (1) Use an accumulator to calculate N for VARCHAR(N) (2) Optimistically create a large text column (3) Allow length to be manually specified/overridden by the user.

These could all be supported by delegating the decision to the user via configuration, but item (1) definitely seems like a requirement and is most relevant to my use cases so I'm going to work on it now and submit a PR when I have something worth submitting.

@jaley
Copy link
Contributor

jaley commented Jul 30, 2015

Cool, that's great!

I guess another option, which some users might want, is to enable the
TRUNCATECOLUMNS
flag. That would probably need a parameter. We might just want a way to
allow users to specific an arbitrary string of COPY parameters, as it'll be
hard to keep up with new Redshift parameters anyway.

On Thu, Jul 30, 2015 at 6:28 PM, Anthony [email protected] wrote:

Yup, your thoughts are definitely in line with mine. I suppose there are a
few options with varying tradeoffs (1) Use an accumulator to calculate N
for VARCHAR(N) (2) Optimistically create a large text column (3) Allow
length to be manually specified/overridden by the user.

These could all be supported by delegating the decision to the user via
configuration, but item (1) definitely seems like a requirement and is most
relevant to my use cases so I'm going to work on it now and submit a PR
when I have something worth submitting.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@marmbrus
Copy link
Contributor

What about using column metadata to specify the length when desired. By default we can choose a "safe" large column so that data is never truncated and insertions always work. When users care about extra performance they can add metadata to string columns and we can create more efficient fixed length ones.

@marmbrus
Copy link
Contributor

Note, I think this is compatible with other suggestions as there can be an option to run a job and automatically populate this metadata using accumulators when desired.

@jaley
Copy link
Contributor

jaley commented Jul 30, 2015

That sounds good. I guess that would mean though that we either stop using
the existing JDBC schemaString function, or add the functionality there to
support VARCHAR length configuration from the schema column metadata?
Unless maybe it already supports it and I missed that?

On Thu, Jul 30, 2015 at 6:45 PM, Michael Armbrust [email protected]
wrote:

Note, I think this is compatible with other suggestions as there can be an
option to run a job and automatically populate this metadata using
accumulators when desired.


Reply to this email directly or view it on GitHub
#29 (comment)
.

@Aerlinger
Copy link
Author

@jaley I think you're right that if want a calculated value of N for VARCHAR(N) we'll have to forgo using the JDBC schemaString function. However, for a generating large default VARCHAR we could simply add a new JDBC dialect for Redshift (https://github.com/apache/spark/blob/db81b9d89f62f18a3c4c2d9bced8486bfdea54a2/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala) which is actually just what's being called by the schemaString method.

@jaley
Copy link
Contributor

jaley commented Jul 30, 2015

@Aerlinger The JDBCDialect looks like the "right" way to do it, but I don't think it'll give you a way to pass through metadata from a provided user schema? You can only implement getJDBCType to convert from Spark SQL types to Redshift's types, and all you're given is the DataType -- no metadata from a user schema.

I would avoid increasing the VARCHAR default size too, as Amazon recommend making the columns as small as possible for best performance. 256 is already too big for most enum-like categorical fields.

How about this as a rough approach?

  • Do schemaString ourselves inside spark-redshift. We only need to support Redshift's column types, there aren't very many, and it'll mean we have one less dependency on the package-private JDBC code.
  • Use @marmbrus suggestion of checking the field metadata in the schema for a "size" attribute. That'll mean users can set a custom schema through the Data Source API as normal, just adding the size settings for fields that matter.
  • If there's no explicit size setting, figure out a sensible setting dynamically while write the data to Avro by mapping a function over the DataFrame before save() that puts the max string lengths in an accumulator. Shouldn't be too bad... right? :-)
  • Maybe have a parameter to disable the above and fall back to a default value.

Generating the schema string ourselves and involving hints from the schema metadata would actually provide us a good basis for adding column compression hints later. That can be useful, as with medium-sized tables, it's not always a good plan to have Redshift decide.

What do you guys think? I'm happy to lend a hand with code or reviews if I can be of help.

@marmbrus
Copy link
Contributor

+1 to removing our dependance on the private schemaString.

I'd have a preference towards not doing an extra pass over the data unless the user requests it, but instead just picking a large safe default. It could be very expensive and I'm not sure how that will work when appending to an existing table. If someone has numbers to show how big of a deal this is on the redshift read side I could be swayed.

@Aerlinger
Copy link
Author

@jaley You're right that you can't pass metadata through JDBC dialect, and I agree that a fresh implementation for schemaString to replace the current hack is the way to go.

I don't think it's possible to use an accumulator to infer size for VARCHAR(N) as accumulators are only readable by the driver, but we'd need to obviously need to read this value to determine if a new maximum text length is encountered. Not sure if I'm missing something, however.

@Aerlinger
Copy link
Author

On second thought, it may be possible to use an accumulator within the fold function.

@JoshRosen
Copy link
Contributor

In #46, I removed the dependence on Spark's schemaString by inlining the code here inside of spark-redshift, so I think that this issue should now be easier to fix.

@traviscrawford
Copy link
Contributor

@eduardoramirez and I noticed this TEXT field issue today. We're very interested in optionally specifying VARCHAR field lengths as we know the field sizes (through custom options on protobuf messages that model our tables).

@JoshRosen JoshRosen modified the milestone: 0.5 Aug 25, 2015
JoshRosen added a commit that referenced this issue Aug 27, 2015
This patch allows users to specify a `maxlength` column metadata entry for string columns in order to control the width of `VARCHAR` columns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29.

See also: #53 to improve error reporting when LOAD fails.

Author: Josh Rosen <[email protected]>

Closes #54 from JoshRosen/max-length.
@JoshRosen
Copy link
Contributor

In #54, I added support for configuring VARCHAR column lengths via column metadata. I think that the original issue here has therefore been fixed. If we want to continue discussing automatic inference of column lengths then I propose that we do so in a separate issue / thread in order to make things easier to track.

@JoshRosen JoshRosen self-assigned this Sep 9, 2015
blowenstein93 pushed a commit to blowenstein93/spark-redshift that referenced this issue Oct 26, 2019
…en-central-full-release

Stable 4.0.0 release to publish to maven central
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants