Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trino lineage fails to capture upstream columns when join and transformation is used #10272

Closed
amalakar opened this issue Dec 10, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@amalakar
Copy link
Contributor

amalakar commented Dec 10, 2021

Here is how to reproduce, the following query:

create table amalakar.new_query_log_1 as
with queries as
(
select * from 
hive.default.event_presto_query_logged p2
where ds='2021-12-08' and hr=3
)
SELECT
  p1.occurred_at as occurred_at,
  substr(p2.query_id, 1, 10) as new_query_id
  FROM queries p1 
  inner join queries p2
ON p1.query_id=p2.query_id

limit 10

Produces the following lineage:

{
  "hive.amalakar.new_query_log_1.new_query_id": [],
  "hive.amalakar.new_query_log_1.occurred_at": [
    {
      "columnName": "hive.default.event_presto_query_logged.occurred_at"
    }
  ]
}

Notice how the upstream of new_query_id is not being captured.

I did an impact analysis, and at lyft this bug impacts 79% of our lineage, only 21% is being captured accurately as of now.

num_valid_upstream num_invalid_upstream valid_percentage
111622 410457 21
@kokosing
Copy link
Member

@Praveen2112 would you like to take a look?

@Praveen2112 Praveen2112 self-assigned this Dec 13, 2021
@Praveen2112
Copy link
Member

Taking a look at it. This is seen for columns with function where its argument are from a AliasedRelation.

@findepi findepi added the bug Something isn't working label Dec 13, 2021
@kokosing
Copy link
Member

Does #10319 fixes your issue? @amalakar Are you able to test it in your setup?

@amalakar
Copy link
Contributor Author

@Praveen2112 thanks for the amazing turnaround time, appreciate it. Least we can do is help test it. Our setup/deploy would have a turn-around time to test it. If it is okay, let's wait for the comments getting resolved and get the PR on approved state? Would be happy to deploy and give it a spin. Does that sound reasonable?

@kokosing
Copy link
Member

Yes. Thank you for your help. I think you can test it now. It looks like there are comments about implementation details in PR. Addressing them should not change the scope of the fix. Testing now would prove if we covered all needed the cases in our tests.

@amalakar
Copy link
Contributor Author

This is working now, thanks! I ran the following query again:

create table amalakar.new_query_log_new_patch

as
with queries as
(
select * from
hive.default.event_presto_query_logged p2
where ds='2019-01-20'
)
SELECT
  p1.occurred_at as occurred_at,
  substr(p2.query_id, 1, 10) as new_query_id
  FROM queries p1
  inner join queries p2
ON p1.query_id=p2.query_id

limit 10

Lineage I am seeing now is:

{
  "hive.amalakar.new_query_log_new_patch.occurred_at": [
    {
      "columnName": "hive.default.event_presto_query_logged.occurred_at"
    }
  ],
  "hive.amalakar.new_query_log_new_patch.new_query_id": [
    {
      "columnName": "hive.default.event_presto_query_logged.query_id"
    }
  ]
}

cc: Thanks @akashkatipally in helping test this!

@Praveen2112
Copy link
Member

Fixed by #10319

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

4 participants