Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sqllab): TRINO_EXPAND_ROWS: expand columns from ROWs #25809

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

giftig
Copy link
Contributor

@giftig giftig commented Oct 31, 2023

SUMMARY

Reimplement, in a simpler way, part of the PRESTO_EXPAND_DATA feature for trino.

Make use of the trino library's type expansion logic added since the original feature was introduced, as we now have sqla ROW types parsed out recursively in a way we can analyse.

Analyse those ROWs and expand our definition of get_columns out to include dotted.path references to all fields it's possible to query by dotted path (i.e. all ROWs, nested arbitrarily deep).

Add an extra optional query_as field to the column definition so that we can override the way it's queried: otherwise sqlalchemy is going to quote the whole thing as a single column name, which isn't correct, it should actually be quoted per dotted segment, and we'll want to alias it to the full dotted string.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before / without feature flag

Schema

screenshot_2023-10-31_13-54-30_166e8c1b

Preview

screenshot_2023-10-31_13-54-36_f74999ec

After

Schema

screenshot_2023-10-31_13-51-16_950c457a

Preview

screenshot_2023-10-31_13-51-26_ea82fadf

Database config

Last checkbox is new, to enable this feature:
screenshot_2023-11-02_11-36-58_52d73911

TESTING INSTRUCTIONS

  1. Configure a trino database and check "Enable row expansion in schemas" in Advanced > SQL Lab settings.
  2. Query tables in trino containing rows and/or nested rows
  3. Observe that dotted paths are expanded out on the schema preview, and queried in the table preview

To easily construct some data containing nested rows, you can use the following instructions:

  1. Run trinodb/trino:latest in docker
  2. docker exec -it trino trino to cli into it
  3. Create a table with nested fields like this:
USE memory.default;
CREATE TABLE foo AS (
  WITH
    data AS (
      SELECT
        (
          SELECT
            ARRAY[1, 2, 3] arr1,
            ARRAY[4, 5, 6] arr2,
            (
              SELECT
                (SELECT 'foo' foo, 'bar' bar) nest1a,
                (SELECT 'hodor' foo, 'HODOR HODOR HODOR' bar) nest1b
            ) nest1,
            (
              SELECT
                (SELECT 'baz' baz, 'buzz' buzz) nest2a,
                (SELECT 'bzzzzzzt' buzzing, 'BZZZZZT!' buzzsaw) nest2b
            ) nest2
        ) field1,
        'toplevel' field2
    )
  SELECT * FROM data

);
  1. Connect to the dockerised trino via superset; I did this by simply adding my trino container into the standard docker-compose definiton, so I could connect to it via trino://root@trino:8080/memory/default

ADDITIONAL INFORMATION

  • Has associated discussion: Implement PRESTO_EXPAND_DATA for trino #25713 (reply in thread)
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @giftig for the PR. The one question that I have is whether this is the best UI treatment for displaying complex/nested types as it's somewhat difficult to grok and/or can be somewhat overwhelming without the ability to expand/collapse these rows.

@justinpark is actually working on a SIP which would address the UI treatment for handling these nested types (using a tree structure). I think it's fine (from a UI perspective) as is, based on knowledge that the UI will be improved in the future.

superset/config.py Outdated Show resolved Hide resolved
@john-bodley john-bodley added the review:checkpoint Last PR reviewed during the daily review standup label Oct 31, 2023
@giftig
Copy link
Contributor Author

giftig commented Oct 31, 2023

@john-bodley there are some difficulties in expanding out the data to be more readable which I thought about while working on this:

  1. we can't really do much about arrays; the PRESTO_EXPAND_DATA thing tried to spread them across multiple rows in the output but this seemed very confusing for a few reasons (I went into some detail in the linked discussion)
  2. the original way that was being done was too fragile; it was actually parsing the junk you see there and manually expanding it out, without awareness of types and with too much reliance on being able to parse the syntax we see here. I considered a part 2 to this work being to do that in a more sensible way, either wrapping the user's query with an exhaustive SELECT on the fields to expand out the ROWs for you, or giving the user more control by providing some macros to just make it easier to do exactly that.

There's more, I left a fair few notes on some challenges I found on the related discussion if you're interested.

But I think this fixes a good chunk of the problem and I have no prejudice on whether we also add on some display improvements as well; it's fairly well-encapsulated so hopefully it'll work fine with what @justinpark is working on.

Currently from our end we consider it acceptable that querying a ROW or querying SELECT * directly will produce an unreadable blob since users can see the field breakdown in the schema, see more readable data in the table preview, and autocomplete foo.bar.baz to let them query fields more directly. We'd rather users selected a handful of relevant fields than build datasets with hundreds of selected fields anyway. We did notice that when we query via the trino cli we actually get a slightly different and nicer format though, like this:

{foo=123, bar="something"}

which is a lot easier to understand than [123, "something"] with no labels. So we intended to try to get that same format into superset as our next step, but if there's already work being done to make it a tree we can expand out then sounds like a better solution is already in progress. I'll take a look at the SIP, we'd definitely be interested in what's being done there and this stuff is very fresh in my mind right now.

@giftig giftig force-pushed the EXPAND_DATA-simpler branch from f45ad23 to 7763c01 Compare November 1, 2023 11:32
@john-bodley john-bodley removed the review:checkpoint Last PR reviewed during the daily review standup label Nov 1, 2023
@giftig giftig force-pushed the EXPAND_DATA-simpler branch 6 times, most recently from fbdcc42 to 302c036 Compare November 2, 2023 14:11
Reimplement, in a simpler way, part of the PRESTO_EXPAND_DATA feature
for trino.

Make use of the trino library's type expansion logic added since the
original feature was introduced, as we now have sqla ROW types parsed
out recursively in a way we can analyse.

Analyse those ROWs and expand our definition of get_columns out to
include dotted.path references to all fields it's possible to query by
dotted path (i.e. all ROWs, nested arbitrarily deep).

Add an extra optional query_as field to the column definition so that we
can override the way it's queried: otherwise sqlalchemy is going to
quote the whole thing as a single column name, which isn't correct, it
should actually be quoted per dotted segment, and we'll want to alias it
to the full dotted string.

Add a setting in the database modal to enable this feature.
Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the entry in the UI. I've cc'ed in some frontend engineers to review said portion of the code.

@giftig
Copy link
Contributor Author

giftig commented Nov 8, 2023

Hi @john-bodley @villebro @michael-s-molina gentle nudge about this one when you have some time.

@john-bodley
Copy link
Member

The backend changes LGTM. @michael-s-molina or @villebro would you mind taking a look at the frontend changes?

@michael-s-molina
Copy link
Member

@justinpark Should we test this PR with our database to see how the data is displayed?

@giftig
Copy link
Contributor Author

giftig commented Nov 20, 2023

Hi @michael-s-molina, @justinpark have you had chance to test this one? Happy to merge?

@justinpark
Copy link
Member

@giftig LGTM with the current UI design.
One thing I would like to revise the response structure to accommodate a future tree view user interface (UI) that is organized hierarchically. This UI will consist of columns, with each column containing child elements.
Notification_Center

1 similar comment
@justinpark
Copy link
Member

@giftig LGTM with the current UI design.
One thing I would like to revise the response structure to accommodate a future tree view user interface (UI) that is organized hierarchically. This UI will consist of columns, with each column containing child elements.
Notification_Center

@rusackas rusackas merged commit 8d73ab9 into apache:master Nov 20, 2023
josedev-union pushed a commit to Ortege-xyz/studio that referenced this pull request Jan 22, 2024
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
sfirke pushed a commit to sfirke/superset that referenced this pull request Mar 22, 2024
@giftig giftig deleted the EXPAND_DATA-simpler branch October 21, 2024 14:41
vinothkumar66 pushed a commit to vinothkumar66/superset that referenced this pull request Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants