Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg: INSERT INTO ... SELECT creates corrupted parquet files #10787

Closed
clemensvonschwerin opened this issue Jan 25, 2022 · 2 comments
Closed

Comments

@clemensvonschwerin
Copy link
Contributor

Issue: On some asserted INSERT INTO ... SELECT query the data part of resulting parquet files is corrupted, footer is readable, though (verified via trino SELECT and parquet-tools)

Trino version: 367

Connector: Iceberg

Compression: GZIP

Example query:

INSERT INTO insert_table
SELECT RANK() OVER (ORDER BY coalesce(rank_dim_1.rank,0),coalesce(rank_dim_2.rank,0),coalesce(rank_dim_3.rank,0),coalesce(
rank_dim_4.rank,0),coalesce(rank_dim_5.rank,0)), rl.id AS line_item_id 
FROM main_table rl 
LEFT OUTER JOIN 
(SELECT DENSE_RANK() OVER (order by dim_1) AS rank, dim_1 
FROM (SELECT DISTINCT LOWER(dim_1) AS dim_1 FROM main_table)) rank_dim_1 ON ( rank_dim_1.dim_1 = LOWER(rl.dim_1) ) 
LEFT OUTER JOIN 
(SELECT DENSE_RANK() OVER
 (order by dim_2) AS rank, dim_2 
 FROM (SELECT DISTINCT LOWER(dim_2) AS dim_2 FROM main_table)) rank_dim_2 ON ( rank_dim_2.dim_2 = LOWER(rl.dim_2) ) 
LEFT OUTER JOIN 
(SELECT DENSE_RANK() OVER (order by dim_3) AS rank, dim_3 
FROM (SELECT DISTINCT LOWER(dim_3) AS dim_3 
FROM main_table)) rank_dim_3 ON ( rank_dim_3.dim_3 = LOWER(rl.dim_3) ) 
LEFT OUTER JOIN (SELECT DENSE_RANK() OVER (order by dim_4) AS rank, dim_4 
FROM (SELECT DISTINCT LOWER(dim_4) AS dim_4 
FROM main_table)) rank_dim_4 ON ( rank_dim_4.dim_4 = LOWER(rl.dim_4) ) 
LEFT OUTER JOIN (SELECT DENSE_RANK() OVER (order by dim_5) AS rank, dim_5 
FROM (SELECT DISTINCT LOWER(dim_5) AS dim_5 FROM main_table)
) rank_dim_5 ON ( rank_dim_5.dim_5 = LOWER(rl.dim_5) ) 

Number of rows in main_table: approx. 63M

Unfortunately I cannot share any data to reproduce, because it is company internal data and not too small in size. Any ideas towards what could cause this issue and how to work around it are highly appreciated.

@hashhar
Copy link
Member

hashhar commented Jan 25, 2022

See #10710.

Can you verify whether disabling S3 streaming upload resolves the issue for you?

@clemensvonschwerin
Copy link
Contributor Author

@hashhar thank you so much for pointing me to the thread. I applied #10729 to our trino setup and this seems to fix the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants