-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve shading scope #1398
Comments
Issue The hive class packages are relocated in rapids UDF, but these hive classes are not included in the shaded jar;
The type of "function" is org.apache.hadoop.hive.ql.exec.UDF, If not excluded in aggregator.xml, this line is relocated to
The package of UDF in the spark runtime hive-exec jar is still org.apache.hadoop.hive.ql.exec.UDF. |
The problem with Option 2 is it's sort of where we already are now, just using an include approach rather than an exclude approach. Either way, we have to carefully maintain the include/exclude list as the code evolves, otherwise we get a nasty problem not caught at compile time but only runtime, where an incorrect shading policy results in an unresolved or incompatible class error. Also Option 2 as written seems to have problems since it fails to address the Hive classes that are needed by ORC. If we run against a Spark version that does not provide an ORC-with-hive classifier, we'll get class not found errors for the Hive classes ORC needs at runtime unless we also shade Hive. And then we're back to the 'oops, I shaded too much of Hive' problem. Option 1 is a lot more up-front code motion, but it has the nice property that once the code has been moved, we don't have the problem of needing to manually keep the shading inclusions/exclusions list in sync with the code as the code evolves. We can shade all of the ORC and Hive packages for the plugin I/O module without needing to worry about accidentally breaking the UDF code or any other code in the plugin that needs to access raw ORC or Hive classes directly for some reason. Basically this option means we need to split sql-plugin into three modules. Note I'm using just some sample names here, I'm not suggesting these are the best names:
To be clear I'm not adamant that we use option 1, but I'm not a big fan of option 2 since it seems essentially equivalent to what we're already doing today but just switching to an include list rather than an exclude list (and I don't think it will actually work in all cases due to the ignoring of Hive). That's only better or worse depending on which part of the code base we expect to evolve faster, as both require manual maintenance of the inclusion or exclusion list for shading. So from my perspective, the discussion should be around whether we do option 1 or stick with the manual maintained shading lists we already have today. |
Agree, thanks @res-life! |
…IDIA#1398) Signed-off-by: spark-rapids automation <[email protected]>
Is your feature request related to a problem? Please describe.
Currently the build system is shading things like ORC and Hive across the entire SQL plugin, and this triggers some issues with areas of the plugin that need to access Hive classes when interacting with spark-hive Catalyst classes (e.g.: #1393).
Describe the solution you'd like
I think it would be cleaner if the portions of the plugin were in a separate Maven submodule that was shaded, then we can avoid shading over the entire SQL plugin code. Shading would be localized to the code that requires it.
Describe alternatives you've considered
If we could get out of the business of shading entirely that would be awesome, but I don't see how we can support multiple versions of Apache Spark that could contain conflicting versions of Parquet/ORC if we try to use those dependencies directly.
The text was updated successfully, but these errors were encountered: