-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generated table creation SQL fails for strings longer than 256 characters #29
Comments
Yeah, sorry I noticed this when I was visually inspected the generated schemata from Redshift and forgot to document it as a known issue. You're right, that the issue comes from the fact that we're calling out to the JDBC schema generation code, which has a default 256 varchar for StringType. Ideal solution would be that we set the column to be If we did go with something like that, we would probably want to raise another ticket to allow users to explicitly set the width of columns too. It could easily be that they know the maximum width for valid data in a particular column, and want to raise an error if any data is bigger. It might actually be better to deal with this situation by allowing users to specify entire column spec overrides, thus enabling them to tune compression settings, etc. |
Yup, your thoughts are definitely in line with mine. I suppose there are a few options with varying tradeoffs (1) Use an accumulator to calculate N for These could all be supported by delegating the decision to the user via configuration, but item (1) definitely seems like a requirement and is most relevant to my use cases so I'm going to work on it now and submit a PR when I have something worth submitting. |
Cool, that's great! I guess another option, which some users might want, is to enable the On Thu, Jul 30, 2015 at 6:28 PM, Anthony [email protected] wrote:
|
What about using column metadata to specify the length when desired. By default we can choose a "safe" large column so that data is never truncated and insertions always work. When users care about extra performance they can add metadata to string columns and we can create more efficient fixed length ones. |
Note, I think this is compatible with other suggestions as there can be an option to run a job and automatically populate this metadata using accumulators when desired. |
That sounds good. I guess that would mean though that we either stop using On Thu, Jul 30, 2015 at 6:45 PM, Michael Armbrust [email protected]
|
@jaley I think you're right that if want a calculated value of N for VARCHAR(N) we'll have to forgo using the JDBC schemaString function. However, for a generating large default VARCHAR we could simply add a new JDBC dialect for Redshift (https://github.com/apache/spark/blob/db81b9d89f62f18a3c4c2d9bced8486bfdea54a2/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala) which is actually just what's being called by the |
@Aerlinger The JDBCDialect looks like the "right" way to do it, but I don't think it'll give you a way to pass through metadata from a provided user schema? You can only implement I would avoid increasing the How about this as a rough approach?
Generating the schema string ourselves and involving hints from the schema metadata would actually provide us a good basis for adding column compression hints later. That can be useful, as with medium-sized tables, it's not always a good plan to have Redshift decide. What do you guys think? I'm happy to lend a hand with code or reviews if I can be of help. |
+1 to removing our dependance on the private I'd have a preference towards not doing an extra pass over the data unless the user requests it, but instead just picking a large safe default. It could be very expensive and I'm not sure how that will work when appending to an existing table. If someone has numbers to show how big of a deal this is on the redshift read side I could be swayed. |
@jaley You're right that you can't pass metadata through JDBC dialect, and I agree that a fresh implementation for I don't think it's possible to use an accumulator to infer size for |
On second thought, it may be possible to use an accumulator within the |
In #46, I removed the dependence on Spark's |
@eduardoramirez and I noticed this |
This patch allows users to specify a `maxlength` column metadata entry for string columns in order to control the width of `VARCHAR` columns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29. See also: #53 to improve error reporting when LOAD fails. Author: Josh Rosen <[email protected]> Closes #54 from JoshRosen/max-length.
In #54, I added support for configuring VARCHAR column lengths via column metadata. I think that the original issue here has therefore been fixed. If we want to continue discussing automatic inference of column lengths then I propose that we do so in a separate issue / thread in order to make things easier to track. |
…en-central-full-release Stable 4.0.0 release to publish to maven central
Thanks to @jaley for the recent milestone on getting data into Redshift from a DataFrame, the latest updates are pretty awesome. One edge case I encountered, however, is during schema generation
String
datatypes in the Dataframe always get mapped to theTEXT
type in the generated SQL schema. This is problematic as theTEXT
type is actually just an alias forvarchar(256)
in Redshift, and will subsequently cause a failure during the data import to Redshift if any rows contain text exceeding 256 characters (the specific error in Redshift isString length exceeds DDL length
)Digging through the code, it looks like the source if this problem actually originates from the
spark.sql.jdbc.JDBCWrapper
class which defines the mapping (https://github.com/apache/spark/blob/8a94eb23d53e291441e3144a1b800fe054457040/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L128).Redshift only supports fixed length fields so I don't see another way to preserve data integrity without replacing
TEXT
in the SQL schema toVARCHAR(N)
whereN
is the longest string length for that column in the Dataframe. Obviously this would introduce a small amount of overhead and complexity to the code.The text was updated successfully, but these errors were encountered: