Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sql commands with same column name to dbt rpc produce weird results in table #3147

Closed
1 of 5 tasks
nave91 opened this issue Mar 5, 2021 · 3 comments · Fixed by #3158
Closed
1 of 5 tasks

Sql commands with same column name to dbt rpc produce weird results in table #3147

nave91 opened this issue Mar 5, 2021 · 3 comments · Fixed by #3158
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors!
Milestone

Comments

@nave91
Copy link

nave91 commented Mar 5, 2021

Describe the bug

When we run sql like select 1 as a, 2 as b, 3 as a, 4 as d produces weird table results as shown in screenshot below.
Screen Shot 2021-03-04 at 10 26 08 PM

Steps To Reproduce

  1. Start dbt rpc server with dbt rpc.
  2. Send curl to create a request in rpc server
 curl -X POST --header "Content-Type: application/json" "http://localhost:8580/jsonrpc" --data '{"jsonrpc": "2.0", "id": "1_1_2021-03-04T22:28:34.621-05:00", "method": "run_sql", "params": {"sql": "c2VsZWN0IDEgYXMgYSwgMiBhcyBiLCAzIGFzIGEsIDQgYXMgZApsaW1pdCA1MDAKLyogbGltaXQgYWRkZWQgYXV0b21hdGljYWxseSBieSBkYnQgY2xvdWQgKi8=", "name": "request", "task_tags": {"uriValue": "file:///1614005139931/__unsaved/Statement 1.sql", "clientStart": "2021-03-04T22:28:34.597-05:00", "branch_name": "master", "cli": "run_sql", "entityId": "ac36a481-aa3e-4503-96d3-c0ccbb2a5dbb"}}}'
  1. Run curl to see output of above query request:
curl  -X POST --header "Content-Type: application/json" "http://localhost:8580/jsonrpc" --data '{"jsonrpc": "2.0", "id": "1_1_2021-03-04T22:28:34.621-05:00", "method": "poll", "params": {"request_token": "a76096de-a72b-4bd8-ba9e-8d70efd65237", "logs": true, "logs_start": 0}}'

Result looks something like below and we are looking for table key

{
   "result":{
      "state":"success",
      "start":"2021-03-05T04:16:02.795158Z",
      "end":"2021-03-05T04:16:04.445599Z",
      "elapsed":1.650441,
      "logs":[
         ....
            "table":{
               "column_names":[
                  "a",
                  "b",
                  "a_2",
                  "d"
               ],
               "rows":[
                  [
                     3.0,
                     2.0,
                     4.0,
                     null
                  ]
               ]
            }
         }
      ],
      "generated_at":"2021-03-05T04:16:04.232967Z",
      "elapsed_time":0.134774
   },
   "id":"1_1_2021-03-04T22:28:34.621-05:00",
   "jsonrpc":"2.0"
}

Expected behavior

Result should've been:

...
'column_names': ['a', 'b', 'a_2', 'd']
'rows': [[1.0, 2.0, 3.0, 4.0]]
...

Screenshots and log output

Output of dbt rpc:
Screen Shot 2021-03-04 at 10 26 08 PM

System information

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: ____________)

The output of dbt --version:

installed version: 0.17.0
   latest version: 0.19.0

Your version of dbt is out of date! You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 0.17.0
  - snowflake: 0.17.0
  - redshift: 0.17.0
  - postgres: 0.17.0

The operating system you're using:
macOS Catalina 10.15.7

The output of python --version:
Python 3.7.4

Additional context

Add any other context about the problem here.

@nave91 nave91 added bug Something isn't working triage labels Mar 5, 2021
@jtcohen6 jtcohen6 added good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Mar 5, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 5, 2021

Thanks @nave91! I think this is a dupe of #2625, but you've provided much more detail here, the reproduction case is even weirder, and I'm better equipped to investigate this now than I was last summer.

I don't actually think this is an RPC bug; I think it's a more foundational bug with adapter.execute. That's the method doing the legwork here to return the result table:
https://github.com/fishtown-analytics/dbt/blob/b70fb543f5825744b7705009aef815b8071ab516/core/dbt/rpc/node_runners.py#L89-L97

And it's possible to reproduce this using run_query, an entirely separate dbt entry point into adapter.execute, by putting this in a dbt SQL file (model, analysis, etc):

{% if execute %}

    {% set my_bad_query %}
    select 1 as a, 2 as b, 3 as a, 4 as d
    {% endset %}

    {% set results = run_query(my_bad_query) %}

    {{ results.column_names }}
    {% for row in results %}
        {{ row }}
    {% endfor %}

{% endif %}

And compiling it:

```sql
('a', 'b', 'a_2', 'd')
        <agate.Row: (Decimal('3'), Decimal('2'), Decimal('4'), None)>

After a little more digging, it looks like the culprit here executeget_result_from_cursorprocess_results:

https://github.com/fishtown-analytics/dbt/blob/b70fb543f5825744b7705009aef815b8071ab516/core/dbt/adapters/sql/connections.py#L97-L103

I popped ipbd onto line 103, compile the query above, and bingo:

> /Users/jerco/dev/product/dbt/core/dbt/adapters/sql/connections.py(105)process_results()
    104
--> 105         return [dict(zip(column_names, row)) for row in rows]
    106

ipdb> column_names
['a', 'b', 'a', 'd']
ipdb> rows
[(1, 2, 3, 4)]
ipdb> for row in rows:
    dict(zip(column_names, row))

{'a': 3, 'b': 2, 'd': 4}

It's not possible to have a dictionary with two keys both named a!

The full array of column_names (four in all, including both a) and this data dict (with only three items) are passed to agate_helper.table_from_data_flat(), and that's where the second a is aliased:

ipdb> dbt.clients.agate_helper.table_from_data_flat(
            data,
            column_names
        ).print_table()
| a | b | a_2 | d |
| - | - | --- | - |
| 3 | 2 |   4 |   |

I think the _n alias is perfectly reasonable if a column name is duplicated. Could we update process_results() to efficiently perform the same sort of aliasing, and return all results (rather than just some)? I'm going to call this a good first issue for someone who knows python better than I do :)

@nave91
Copy link
Author

nave91 commented Mar 8, 2021

Aah! My bad, I could've added details to that existing ticket. I'll keep in mind to look for existing tickets before creating a new one. Thank you for clarifying the details there @jtcohen6!

@jtcohen6
Copy link
Contributor

jtcohen6 commented Mar 8, 2021

@nave91 All good, thanks for writing up such a detailed issue! You really helped to clue me into what's going on here. And if, in your boundless free time, you ever want to contribute a few clever lines of python, now there's a clear place to do it ;)

techytushar pushed a commit to techytushar/dbt that referenced this issue Mar 11, 2021
techytushar pushed a commit to techytushar/dbt that referenced this issue Mar 11, 2021
jtcohen6 added a commit that referenced this issue Mar 17, 2021
Feature to add _n alias to same column names #3147
@jtcohen6 jtcohen6 added this to the Margaret Mead milestone Apr 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants