-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hive's CombineTextInputFormat #21842
Comments
This should be easy to fix. Try adding the following lines to the .put(new SerdeAndInputFormat(LAZY_SIMPLE_SERDE_CLASS, "org.apache.hadoop.mapred.lib.CombineTextInputFormat"), TEXTFILE)
.put(new SerdeAndInputFormat(LAZY_SIMPLE_SERDE_CLASS, "org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat"), TEXTFILE) If that works, then please send a PR. Trino has never supported combined splits, so there should be no performance difference. |
I added that and I do get back data, however, the trino logs are flooded with exceptions. They are DEBUG, so, perhaps that is fine and normal. They don't show up with INFO logging. If these are fine, then I'll get out a PR for the change. Thanks!
|
These logs are normal. The |
I use the hive metastore pretty heavily and we currently have hundreds of tables that make use of the CombineTextInputFormat for the storage descriptor. I was recently trying to upgrade to Trino 445 from Trino 409 and found #15921 which mentions removing Hive dependencies in the code base. I spun it up and tried querying against my metastore and was getting errors like this when querying text based tables:
This error tracks with the code changes in the HiveStorageFormat and the HiveClassNames. Our hive tables are used to query with Trino but also run spark jobs against them. So, using the CombineTextInputFormat is ideal to help deal with the small file problem in spark. However, that means those tables can't be queried via Trino.
A few questions:
The text was updated successfully, but these errors were encountered: